blog   contact    
 
log: Winning with Data
1.888.Aster.Data Email

Archive for the ‘Analytics’ Category



Posted on February 10th, 2009 by Shawn Kung

blue_sky11.jpg

Cloud computing is a fascinating concept.  It offers greenfield opportunities (or more appropriately, blue sky frontiers) for businesses to affordably scale their infrastructure needs without plunking down a huge hardware investment (and the space/power/cooling costs associated with managing your own hosted environment).  This removes the risks of mis-provisioning by enabling on-demand scaling according to your data growth needs.  Especially in these economic times, the benefits of Cloud computing are very attractive.

But let’s face it - there’s also a lot of hype, and it’s hard to separate truth from fiction.  For example, what qualities would you say are key to data warehousing in the cloud?

Here’s a checklist of things I think are important:

[1] Time-To-Scalability.  The whole point of clouds is to offer easy access to virtualized resources.  A cloud warehouse needs to quickly scale-out and scale-in to adapt to changing needs.  It can’t take days to scale…it has to happen on-demand in minutes (<1 hour).

[2] Manageability.  You go with clouds because you not only want to save on hardware, but also on the operational people costs of maintaining that infrastructure.  A cloud warehouse needs to offer one-click scaling, easy install/upgrade, and self-managed resiliency.

[3] Ecosystem.  While clouds offer *you* huge TCO savings, you can’t compromise service levels for your customers - especially if you run your business on the cloud.  BI/ETL/monitoring tools, Backup & Recovery, and ultra-fast data loading can’t be overlooked for “frontline” mission-critical warehousing on the cloud.

[4] Analytics.  Lots of valuable data is generated via the cloud and there are opportunities to subscribe to new data feed services.  It’s insufficient for a cloud warehouse to just do basic SQL reporting.  Rather, it must offer the ability to do deep analytics very quickly.

[5] Choice.  A truly best-in-class cloud warehouse won’t lock you in to a single cloud vendor.  Rather, it will offer portability by enabling you to choose the best cloud for you to run your business on.

Finally, here are a couple ideas on the future of cloud warehousing.  What if you could link multiple cloud warehouses together and do interesting queries across clouds?  And what about the opportunities for game-changing new analytics - with so many emerging data subscription services, wouldn’t this offer ripe opportunities for mash-up analytics (eg. using Aster SQL/MapReduce).

What do you think are the standards for “best-in-class” cloud warehousing?

Posted on February 10th, 2009 by Steve Wooledge

It may not sound sexy, but analyzing a sequence of events over time is non-trivial in standard SQL. When Aster started providing customers ways to express time-series analysis more elegantly and have it return result sets in 90 seconds as opposed to 90 minutes, we started getting comments like,  “That’s beautiful … I would make out with my screen if I could!”.

Crazy, I know.

Time-series, or sequential pattern analysis, is useful in a number of industries and applications such as:

- Price changes over time in financial market data

- Path analysis within click-stream data to define anything from top referring sites to the “golden” paths customers navigate before purchasing

- Patterns which detect deviant activity such as spamming or insurance claims fraud

- Sessionization (mapping each event in a clickstream to a human user session)

More specifically, one customer wanted to improve the effectiveness of their advertising and drive more visitors to their site. They asked us to help determine the top paths people take to get to their site and top paths people take after leaving the site. Knowing the “before” path gave them insight into what sites to place advertisements on to drive traffic to their site. Additionally, pathing helps the customer understand behavior/preferences of users who visit their site (e.g., if espn.com is a top site that is in the path of the 5 pages leading up to the customer site, they know that many visitors like sports).

However, discovering relationships between rows of data is difficult to express in SQL, which must invoke multiple self-joins of the data. These joins dramatically expand the amount of data involved in the query and slow down query performance - not to mention complexity in developing and parsing these expressions.

What’s the solution? There are a few, but Aster’s approach has been to develop extensions to SQL which is executed in-database in a single-pass over the data in a massively-parallel fashion utilizing nPath, which is a SQL-MapReduce (SQL/MR) function used to perform regular expression pattern matching over a sequence of rows. It allows users to:

- Specify any pattern in an ordered collection - a sequence - of rows with symbols;

- Specify additional conditions on the rows matching these symbols; and

- Extract useful information from these row sequences.

I’ll share the syntax of nPath here to give you more context of how the query operates:

SELECT…
FROM nPath (
ON {table|query}
…various parameters…
)
WHERE…
GROUP BY…
HAVING…
ORDER BY…
etc.

nPath performs pattern matching and computes aggregates. The results of this computation are the output rows of nPath. The rows from nPath can subsequently be used like any other table in a SQL query: rows from nPath may be filtered with the WHERE clause, joined to rows from other tables, grouped with the GROUP BY clause, and so on.

The result? Incredibly powerful insight into a series of events which indicates a pattern or segment can be expressed in SQL, run in parallel on massive clusters of compute-power in an extremely efficient manner via a single pass over the data, and made accessible to business analysts through traditional BI tools.

What do you think? What other methods are people using to tackle these types of problem?

Posted on November 19th, 2008 by jguevara

I recently attended a panel discussion in New York on media fragmentation consisting of media agency execs including:

- Bant Breen (Interpublic - Initiative – President, Worldwide Digital Communications),
- John Donahue (Omnicom Media Group - Director of BI Analytics and Integration),
- Ed Montes (Havas Digital - Executive Vice President),
- Tim Hanlon (Publicis - Executive Vice President/Ventures for Denuo)

The discussion was kicked off of by Brian Pitz, Principle of Equity Research for Bank of America.  Brian set the stage for a spirited discussion regarding the continuing fragmentation of online media along with research on the issues posed by this.  The panel discussion touched upon many issues including fear placement around unknown user-generated content, agency lack of skill set to address this medium and lack of standards.  However, what surprised me most was the unanimous consensus in opinion that there is more value further out on “The Tail” of the online publisher spectrum due to the targeted nature of the content.  Yet the online media buying statistics conflict with this opinion (over 77% of online ad spending is still flowing to the top 10 sites).

When asked “why the contrast?” between their sentiment and the stats, the discussion revealed the level of uncertainty due to a lack of transparency into “The Tail”.  Despite the 300+ ad networks that have emerged to address this very challenge, the value chain lacks the data to confidently invest the dollars.  In addition, there was a rather cathartic moment when John Donahue professed that agencies should “Take Back Your Data From Those that Hold It Hostage”.

It is our belief that the opinions expressed by the panel serve as evidence of a shift towards a new era in media where evidential data will drive valuation across media rather than sampling-based ratings acting as the currency.  No one will be immune from this:

- Agencies need it to confidentially invest their clients dollars and show demonstrable ROI of their services
- Ad networks need it to earn their constituencies’ share of marketing budgets
- Ad networks need it to defend the targeted value and the appropriateness of their collective content
- 3rd Party measurement firms (comScore, Nielsen Online, ValueClick) need it to maintain the value of their objective value
- Advertisers need it to support the logic budget allocation decisions
- BIG MEDIA needs it to defend their 77% stake

You might be thinking, “The need for data is no great epiphany”.  However, I submit that the amount of data and the mere fact that all participants should have their own copy is a shift in thinking.  Gone are the days where:

- The value chain is driven solely by 3rd Party’s and their audience samples
- Ad Servers/Ad Networks are the only keepers of the data
- Service Providers can offer data for a fee

Posted on November 6th, 2008 by Mayank Bawa

I was at Defrag 2008 yesterday and it was a wonderful, refreshing experience. A diverse group of Web 2.0 veterans and newcomers came together to accelerate the “Aha!” moment in today’s online world. The conference was very well organized and there were interesting conversations on and off the stage.

The key observation was that individuals, groups and organizations are struggling to discover, assemble, organize, act on, and gather feedback from data. Data itself is growing and fragmenting at an exponential pace. We as individuals feel overwhelmed by the slew of data (messages, emails, news, posts) in the microcosm, and we as organizations feel overwhelmed in the macrocosm.

The very real danger is that an individual or organization’s feeling of being constantly overwhelmed could result in the reduction of their “Aha!” moments - our resources will be so focused on merely keeping pace with new information that we won’t have the time or energy to connect the dots.

The goal then is to find tools and best practices to enable the “Aha!” moments - to connect the dots even as information piles up on our fingertips.

My thought going into the conference was that we need to understand what causes these “Aha!” moments. If we understand the cause, we can accelerate the “Aha!” even at scale.

Earlier this year, Janet Rae-Dupree published an insightful piece in the International Herald Tribune on Reassessing the Aha! Moment. Her thesis is that creativity and innovation - “Aha! Moments” - do not come in flashes of pure brilliance. Rather, innovation is a slow process of accretion, building small insight upon interesting fact upon tried-and-true process.

Building on this thesis, I focused my talk on using frontline data warehousing as an infrastructure piece that allows organizations to collect, store, analyze and act on market events. The incremental fresh data loads in a frontline data warehouse add up over time to build a stable historical context. At the same time, applications can contrast fresh data with historical data to build the small contrasts gradually until the contrasts become meaningful to act upon.

I’d love to hear back from you on how massive data can accelerate, rather than impede, the “Aha!” moment.

Aster Defrag 2008 97
View SlideShare presentation or Upload your own. (tags: systems data)

Posted on October 26th, 2008 by Steve Wooledge

In a down economy, marketing and advertising are some of the first budgets to get cut. However, recessions are also a great time to gain market share from your competitors. If you take a pragmatic, data-driven approach to your marketing, you can be sure you’re getting the most ROI from every penny spent. It is not a coincidence that in the last recession, Google and Advertising.com came out stronger since they provided channels that were driven by performance metrics.

That’s why I’m excited that Aster Data Systems will be at Net.Finance East in New York City next week. Given the backdrop of the global credit crisis, we will learn first-hand the implications of the events in the financial landscape.  I am sure the marketing executives are thinking of ways to take advantage of a change in the financial landscape, whether it’s multi-variate testing, more granular customer segmentation, or simply lowering the data infrastructure costs associated with your data warehouse or Web analytics.

Look us up if you’re at Net.Finance East - we’d love to learn from you and vice-versa.

Posted on October 6th, 2008 by Steve Wooledge

Aster announced the general availability of our nCluster 3.0 database, complete with new feature sets. We’re thrilled with the adoption we saw before GA of the product, and it’s always a pleasure to speak directly with someone who is using nCluster to enable their frontline decision-making.

Lenin Gali, Director of BI, ShareThis

ShareThis logo

Lenin Gali, director of business intelligence for the online sharing platform ShareThis, is one such friend. He recently sat down with us to discuss how Internet and social networking companies can successfully grow their business by rapidly analyzing and acting on their massive data.

You can read the full details of our conversation on the Aster Website.

Posted on October 6th, 2008 by Mayank Bawa

It is really remarkable how many companies today view data analytics as the cornerstone of their businesses.

acerno logoaCerno is an advertising network that uses powerful analytics to predict which advertisements to deliver to which person at what time. Their analytics are performed on completely anonymous consumer shopping data of 140M users obtained from an association of 450+ product manufacturers and multi-channel retailers. There is a strong appetite at aCerno to perform analytics that they have not done before because each 1% uplift in the click-through rates is a significant revenue stream for them and their customers.

Aggregate KnowledgeAggregate Knowledge powers a discovery network (The Pique Discovery™ Network) that delivers recommendations of products and content based on what was previously purchased and viewed by an individual using the collective behavior of the crowds that had behaved similarly in the past. Again, each 1% increase of engagement is a significant revenue stream for them and their customers.

ShareThis logoShareThis provides a sharing network via a widget that makes it simple for people to share things they find online with their friends. In a short period of time since their launch, ShareThis has reached over 150M unique monthly users. The amazing insight is that ShareThis knows which content users actually engage with, and want to tell their friends about! And in its sheer genius, ShareThis gives away its service to publishers and consumers free; relying on delivering targeted advertising for its revenue: by delivering relevant ad messages while knowing the characteristics of that thing being shared. Again, the better their analytics, the better their revenue.

Which brings me to my point: data analytics is a direct contributor of revenue gains in these companies.

Traditionally, we think of data warehousing as a back-office task. The data warehouse can be loaded in separate load windows; loads can run late (the net effect is that business users will get their reports late); loads, backups, and scale-up can take data warehouses offline –which is OK since these tasks can be done on non-business hours (nights/weekends).

But these companies rely on data analytics for their revenue.

·    A separate exclusive load window implies that their service is not leveraging analytics during that window;
·    A late-running load implies that the service is getting stale data;
·    An offline warehouse implies that the service is missing fresh trends

Any such planned or unplanned outage results in lower revenues.

On the flip side, a faster load/query provides the service a competitive edge – a chance to do more with their data than anyone else in the market. A nimbler data model, a faster scale-out, or a more agile ETL process helps them implement their “Aha!” insights faster and gain revenue from a reduced time-to-market advantage.

These companies have moved data warehousing from the back-office to the frontlines of business: a competitive weapon to increase their revenues or to reduce their risks.

In response, the requirements of a data warehouse that supports these frontline applications go up a few notches: the warehouse has to be available for querying and loading 365×24x7; the warehouse has to be fast and nimble; the warehouse has to allow “Aha!” queries to be phrased.

We call these use cases “frontline data warehousing“. And today we released a new version of Aster nCluster that rises up those few notches to meet the demands of the frontline applications.

Posted on October 6th, 2008 by Tasso Argyros

Back in the days when Mayank, George and I were still students at Stanford, working hard to create Aster, we had a pretty clear vision of what we wanted to achieve: allow the world to do more analytics on more data. Aster has grown tremendously since these days, but that vision hasn’t changed. And one can see this very clearly in the new release of our software, Aster nCluster 3.0, which is all about doing more analytics with more data. Because 3.0 introduces so many and important features, we tried to categorize them in three big buckets: Always Parallel, Always On, and In-Database MapReduce.

Always Parallel has to do with the “Big Data” part of our vision. We want to build systems that can handle 10x – 100x more data than any other system today. But this is too much data for any single “commodity server” (that is, a server with reasonable cost) that one can buy. So we put a lot of R&D effort into parallelizing every single function of the system – not only querying, but also loading, data export, backup, and upgrades. Plus, we allow our users to choose how much they want to parallelize all these functions, without having to scale up the whole system.

Always On also stems from the need to handle “Big Data”, but in a different way. In order for someone to store and analyze anything from a terabyte to a petabyte, she needs to use a system with more than a single server. But then availability and management can become a huge problem. What if a server fails? How do I keep going, but also how do I recover from the failure (either by introducing the same server or a new, replacement, server) with no downtime? And how can I seamlessly expand the system, in order for me to realize the great promise of horizontal scaling, without taking the system down? And, finally, how do I backup all these oceans of data without disrupting my system’s operation? All these issues are handled in our new 3.0 release.

We introduced In-Database MapReduce in a previous post so I won’t spend too much time here. But I want to point out how this fits our overall vision. Having a database which is always parallel and always on allows you to handle Big Data with high performance, low cost, and high availability. But once you have all this data, you want to do more analytics - to extract more value and insights. In-Database MapReduce is meant to do exactly that – push the limits of what insights you can extract by providing the first-ever system that tightly integrates MapReduce (a powerful analytical paradigm) with a wide-spread standard like SQL.

These are the big features in nCluster 3.0, and in the majority of our marketing materials we stop here. But I also want to talk about the other great things we have in there; things more subtle or technical to mention in the headlines, but still very important. We’ve added table compression features that offer online, multi-level compression for cost-savings. With table compression, you can choose your compression ratio and algorithm and have different tables compressed differently. This paves the way for data life-cycle management that can compress data differently depending on its age.

We’ve also implemented richer workload management to offer quality of service for fine-grained mixed workload prioritization via priority and fair-share based resource queue.  You can even allocate resource weights based on transaction number or time (useful when both big and small jobs occur).

3.0 also has Network Aggregation (NIC “bonding”) for performance and fault tolerance. This is a one-click configuration that automates network setup – usually a tedious error-prone sys admin task. And that’s not the end of it – we also are introducing an Upgrade Manager that automates upgrades from one version of nCluster to another, including what most frequently breaks upgrades: the operating system components. This is another building block of the low cost of ongoing administration that we’re so proud of achieving with nCluster. I could go on and on (new SQL enhancements, new data validation tools, heterogeneous hardware support, LDAP authentication, …), but since blog space is supposed to be limited, I’ll stop here. (Check out our new resource library if you want to dig deeper.)

Overall, I am delighted to see how our product has evolved towards the vision we laid out years back. I’m also thrilled that we’re building a solid ecosystem around Aster nCluster – we now support all the major BI platforms – and are establishing quite a network of systems integrators to help customers with implementation of their frontline data warehouses. In a knowledge-based economy full of uncertainty, opportunities, and threats, doing more analytics on more data will drive the competitiveness of successful corporations – and Aster nCluster 3.0 will help you deliver just that for your own company.

Posted on October 1st, 2008 by Tasso Argyros

Perhaps if you’ve got a hammer, everything looks like a nail, but it does strike us here at Aster that one of the underlying causes of the liquidity crisis is a problem with data and analytics. After all, the reason that banks don’t want to lend to each other anymore is that they don’t trust that the other banks really know the value of their mortgage-backed securities (MBS) – mainly because they themselves don’t trust the value of their own.

So why is this? Why is it hard for banks to understand the value of a particular mortgage backed security? Simple – a bank holding those MBS’s doesn’t have access to the granular data on the underlying individual assets (the mortgages, with their underlying properties and payees) that make up the pooled asset. Simple questions such as “In what zip codes are the properties backing these mortgages, and by what percentage have median property prices changed since the mortgage was issued?” can’t easily be answered. In other words, unlike stocks, for which you have easy access to a company’s full financial statements to value, the underlying value of an MBS is difficult to figure out because the data needed to price the underlying asset simply is unavailable.

The reason for this is that traditional MBS modeling has focused much more on the behavior of those securities under varying interest rate scenarios because, historically, the ratio of pre-payment vs. held-to-term of those mortgages has been the big swing factor in how much those securities were worth. When interest rates fall, people are more likely to pay off or refinance their existing mortgage, and when interest rates rise, people are more likely to hold onto their existing (now lower interest rate) mortgage. The default rate on the underlying mortgage was traditionally taken as a given, and not predictively modeled at all.

People sometimes ask us – what is the price for not implementing a world-class analytics capability. I think now we can answer, “$700 billion, and counting.” What has emerged to be crucial in this crisis are two pieces of data that are not tracked by standard models: 1) Is the current property value above or below the value of the mortgage(s) on the property, and 2) Can a mortgage holder afford his or her mortgage payments?

This has to change. If mortgages are going to continue to be securitized in the future, it will be necessary to have valuation models that track details on the underlying real assets (specifically, the property values, and the owners’ ability to pay) on at least a monthly, if not more frequent basis.  We would propose a national centralized registry for the mortgage securitization industry that matches property locations to mortgage pools, and provides a number of valuation models for those properties (analogous to Zillow’s property valuation model). It would also provide ongoing tracking of either a FICO score or more granular credit quality information for those mortgage holders.  In addition – and this is crucial - a centralized repository of the detailed data disclosed by the mortgage applicant in the original approved mortgage application needs to be retained in this registry. On top of this shared database, financial services companies would be free to create valuation and pricing models based on whatever predictive drivers they belief influence the default and payment rates.

Obviously such a centralized repository would require bullet-proof security and privacy handling, but such a registry needs to be established if we’re not going to fall into this mess all over again. Similar registries would be appropriate for student loan pools and other securitized loan categories such as auto-loans. Is anyone else thinking that this is all a matter of data?

Posted on August 26th, 2008 by Mayank Bawa

Pardon the tongue-in-cheek analogy to Oldsmobile when describing user-defined functions (UDFs), but I want to draw out some distinctions between this new class of functions that In-Database MapReduce enables.

Not Your Granddaddy's Oldsmobile

While similar on the surface, in practice there are stark differences between Aster In-Database MapReduce and traditional UDF’s.

MapReduce is a framework that parallelizes procedural programs to offload traditional cluster programming. UDF’s are simple database functions and while there are some syntactic similarities, that’s where the similarity ends. Several major differences between In-Database MapReduce and traditional UDF’s include:

Performance: UDF’s have limited or no parallelization capabilities in traditional databases (even MPP ones).  Even where UDF’s are executed in parallel in an MPP database, they’re limited to accessing local node data, have byzantine memory management requirements, require multiple passes and costly materialization.  In constrast, In-Database MapReduce automatically executes SQL/MR functions in parallel across potentially hundreds or even thousands of server nodes in a cluster, all in a single-pass (pipelined) fashion.

Flexibility: UDF’s are not polymorphic. Some variation in input/output schema may be allowed by capabilities like function overloading or permissive data-type handling, but that tends to greatly increase the burden on the programmer to write compliant code.  In contrast, In-Database MapReduce MR/SQL functions are evaluated at run-time to offer dynamic type inference, an attribute of polymorphism that offers tremendous adaptive flexibility previously only found in mid-tier object oriented programming.

Manageability: UDF’s are generally not sandboxed in production deployments. Most UDF’s are executed in-process by the core database engine, which means bad UDF code can crash a database. SQL/MR functions execute in their own process for full fault isolation (bad SQL/MR code results in an aborted query, leaving other jobs uncompromised). A strong process management framework also ensures proper resource management for consistent performance and progress visibility.

Category Archives

Relevant Blogs

  • Aster Data at the BBBTFebruary 5th, 2009
  • Aster Data Systems supports DW with MapReduceFebruary 5th, 2009
  • Converting data exhaust into data valueOctober 20th, 2008
  • Why MapReduce matters to SQL data warehousingOctober 20th, 2008
Copyright © 2008 Aster Data Systems, Inc. All rights reserved.