Archive for April, 2010

Design Patterns - The (Iterative) Analytical Data Warehouse
By Mayank Bawa in Analytics on April 26, 2010

I’ve remarked in an earlier post that the usage of data is changing and new applications are on the horizon. In the last couple of years, we’ve observed interesting design patterns for business processes that use data.

In a previous post, I outlined a design pattern that we call “The Automated Feedback Loop.” In this post, I want to outline a design pattern that we call “The (Iterative) Analytics Data Warehouse”.

The traditional well-understood design pattern of a data warehouse is a central (for Enterprise Data Warehouse) or departmental (for Data Marts) repository of data. Data is fed into the warehouse from ETL processes that pull data from a variety of sources. The data is organized in a data model that caters to 3 use-cases of the warehouse:

  1. Reports - A set of BI queries are run with regular frequency to monitor the state of the business. The target of the reports are business users who want to understand what happened. The goal is to keep them in touch with the pulse of the business.
  2. Exports - A set of export jobs are run with adhoc frequency to provide data sets for further analysis. The target of the exports are business analysts who want to optimize business practices. The goal is to provide them with true, quality-stamped data so that they can make confident optimization recommendations.
  3. Adhoc - A set of queries are run with adhoc frequency to detect or verify patterns that influence business events. The source of the queries are data scientists who want to understand and optimize business practices. The goal is to provide them with computation capabilities (good query interfaces, enough processing, memory and storage resources) to allow them to interact with the data.

The exports and adhoc tasks are transient tasks. Once the data analysts or data scientists find a pattern valuable to the business, that pattern is incorporated into a report so that business users can monitor that pattern on a frequent repeatable practice.

In a typical data warehouse, the bulk of tasks (~80%) are from [1] Reports. The remainder of 20% is from [2] Exports and [3] Adhoc.

Since Reports are frequent and generate known queries, the design of the data warehouse is done to cater to reporting. This includes data models, indexes, materialized views or derived tables - and other optimizations - to make the known Reporting queries go fast. Read the rest of this entry »

Dell and Aster Data: A Powerful Combination for Large-Scale Customers
By dkloc in Cloud Computing on April 26, 2010

If you read this blog, you’ve probably seen the news about the partnership between Aster Data and Dell on their new PowerEdge C-Series servers (link to their page). Together we have enabled some really successful customers such as MySpace and and proven that Dell hardware with Aster Data software easily scales to support large-scale data warehousing and advanced analytics.

Aster Data CEO Mayank Bawa explains this combination in more detail, as well as Aster Data’s history and the distinct advantages offered by the partnership with Dell, including Online Precision Scaling, out-of-the-box advanced in-database analytics, and always-on availability.

By Tasso Argyros in Analytics, Blogroll on April 16, 2010

This Monday we announced a new web destination for MapReduce, At a high level, this site is the first consolidated source of information & education around MapReduce, the groundbreaking programming model which is rapidly revolutionizing the way people deal with big data. Our vision is to make this site the one-stop-shop for anyone looking to learn how MapReduce can help analyze large amounts of data.

There were a couple reasons why we thought the world of big data analytics needed a resource like this. First, MapReduce is a relatively new technology and we are constantly getting questions from people in the industry wanting to learn more about it, from basic facts to using MapReduce for complex data analytics at Petabyte scale. By placing our knowledge and references in one public destination, we hope to build a valuable self-serve resource to educate many more people than what we could ever reach directly. In addition, we were motivated by the fact that most MapReduce resources out there focus more on specific implementations of MapReduce, which fragments the available knowledge and reduces its value. In this new effort we hope to create a multi-vendor & multi-tool resource which will benefit anyone interested in MapReduce.

We’re already working with analysts such as Curt Monash, Merv Adrian, Colin White and James Kobielus to syndicate their MapReduce-related posts. Going forward, we expect even more analysts, bloggers, practitioners, vendors, and academics to contribute. If traffic grows like we expect, we may eventually add a community forum to aid in interaction and sharing of knowledge and best practices.

I hope you enjoy surfing this new site! Free to email me for any suggestions as we work to make more useful for you.

2010 and Beyond - Data Clouds and Next-generation Analytics
By Mayank Bawa in Analytics, Cloud Computing, MapReduce on April 15, 2010

In the last few years there has been a significant amount of market pickup, from users and vendors, on data clouds and advanced analytics - specifically a new class of data-driven applications run in a data cloud or on-premise. What’s different about this from past approaches is the frequency and speed at which these applications are accessed, the depth of the analysis, the number of data sources involved and the volume of data mined by these applications - terabytes to petabytes. In the midst of this cacophony of dialogue, recent announcements from vendors in this space are helping to clarify different visions and approaches to the big data challenge.

Both Aster Data and Greenplum made announcements this week that illustrated different approaches. At the same time that Aster Data announced the Aster Analytics Center, Greenplum announced an upcoming product named Chorus. I wanted to take a moment to compare and contrast what these announcements say about the direction of the two companies.

Greenplum’s approach speaks to two traditional problem areas i) access to data, from provisioning of data marts to connectivity to data across marts, and ii) some level of collaboration among certain developers and analysts. Their approach is to create a tool for provisioning, unified data access, and sharing of annotations and data among different developers and analysts. Interestingly, this is not an entirely new concept; these are well-known problems for which a number of companies and tools have already developed best-of-breed solutions over the last 15 years. For example, the capabilities for data access are another version of Export/Copy primitives that already exist in all databases and that have been built upon by common ETL and EII tools for cases in which richer support than Export & Copy are needed - for instance, when data has to be transformed, correlated or cleaned while being moved from one context (mart) to another (mart).

This approach is indicative of a product direction in which the primary focus is on adding another option to the list of tools available to customers to address these problems. It’s really not a ground-breaking innovation that evolves the world of analytics. New types of analytics, or ‘data-driven applications,’ is where the enormous opportunity lies. The Greenplum approach of data collaboration is interesting in a test environment or sandbox. When it comes to real production value however, it effectively increases the functions available to the end user, but at a big cost due to significant increases in complexity, security issues and extra administrative overhead. What does this mean exactly?

  • The spin-up of marts and moving data around can result in “data sprawl” which ultimately increases administrative overhead and is dangerous in these days of compliance and sensitivity to privacy and data leaks.
  • Adding a new toolset into the data processing stack creates difficult and painful work to either manage and administer multiple tool sets for similar purposes or to eliminate and transition away from investments in existing toolsets.
  • To enable effective communication and sharing, users need strong processes and features for source identification of data, data collection, data transformation, rule administration, error detection & correction, data governance and security. The quality and security policies around meta-data are especially important as free-form annotations can lead to propagation of errors or leaks in the absence of strong oversight.

In contrast, Aster Data’s recent announcements support our long-standing investments in our unique advanced in-database architecture where applications run fully inside Aster Data’s platform with complete application services essential for complex analytic applications. The announcements highlight that our vision is not to create a new set of tools and layers in the data stack that recreate capabilities currently available from a number of leading vendors, but rather to deliver a new Analytics Platform, a Data-Application Server, to uniquely enable analytics professionals to create data-rich applications that were impossible or impractical before - namely, to create and use advanced analytics for rich, rapid, and scalable insights into their data. This focus is complemented by our partners, who offer proven best-of-breed solutions for collaboration and data transformation.

Read the rest of this entry »

Advanced Analytics Demystified
By Steve Wooledge in Uncategorized on April 13, 2010

It looks like Aster Data isn’t the only company hoping to demystify advanced analytics as a growing number of organizations try to decide whether to dip their toe in the water or dive in completely. There was an interesting article from Doug Henschen of Intelligent Enterprise last week on their launch of the “Advantage in Analytics” TechCenter.

The main impediments to growth for the analytics market have been “lack of awareness, skills and a demand for fact-based decision-making within organizations,” Doug writes. This preceded our announcement of the Aster Analytics Center and earlier this week to support those very goals.

The Aster Analytics Center provides best practices, and ready-to-use analytics solutions to jump-start development of advanced in-database analytics. Further, provides valuable information to companies interested in understanding MapReduce-based analytics and learning how other companies have built rich data-driven applications using MapReduce and SQL-MapReduce. We expect these resources, along with Intelligent Enterprise’s “Advantage in Analytics” TechCenter, to be integral in increasing awareness and skills. And as more organizations learn the benefits, demand will become nearly universal. Hats off to Intelligent Enterprise for supporting these goals.