26
Apr
By Mayank Bawa in Analytics on April 26, 2010
   

I’ve remarked in an earlier post that the usage of data is changing and new applications are on the horizon. In the last couple of years, we’ve observed interesting design patterns for business processes that use data.

In a previous post, I outlined a design pattern that we call “The Automated Feedback Loop.” In this post, I want to outline a design pattern that we call “The (Iterative) Analytics Data Warehouse”.

The traditional well-understood design pattern of a data warehouse is a central (for Enterprise Data Warehouse) or departmental (for Data Marts) repository of data. Data is fed into the warehouse from ETL processes that pull data from a variety of sources. The data is organized in a data model that caters to 3 use-cases of the warehouse:

  1. Reports - A set of BI queries are run with regular frequency to monitor the state of the business. The target of the reports are business users who want to understand what happened. The goal is to keep them in touch with the pulse of the business.
  2. Exports - A set of export jobs are run with adhoc frequency to provide data sets for further analysis. The target of the exports are business analysts who want to optimize business practices. The goal is to provide them with true, quality-stamped data so that they can make confident optimization recommendations.
  3. Adhoc - A set of queries are run with adhoc frequency to detect or verify patterns that influence business events. The source of the queries are data scientists who want to understand and optimize business practices. The goal is to provide them with computation capabilities (good query interfaces, enough processing, memory and storage resources) to allow them to interact with the data.

The exports and adhoc tasks are transient tasks. Once the data analysts or data scientists find a pattern valuable to the business, that pattern is incorporated into a report so that business users can monitor that pattern on a frequent repeatable practice.

In a typical data warehouse, the bulk of tasks (~80%) are from [1] Reports. The remainder of 20% is from [2] Exports and [3] Adhoc.

Since Reports are frequent and generate known queries, the design of the data warehouse is done to cater to reporting. This includes data models, indexes, materialized views or derived tables - and other optimizations - to make the known Reporting queries go fast.

Since exports and adhoc tasks are infrequent and generate unknown queries, the design of the data warehouse is unable to cater to them upfront. This means that exports and adhoc tasks generate queries that are harder to satisfy (since they have to fight the data modeling decisions made for reporting) and therefore impose more load on the data warehouse.

The net result is that reports run fast while exports and adhocs are slow. In fact, exports and adhocs consume so much resources, that reporting starts running slower. And that is not good - reports are distributed widely and reach a wide variety of business users. They are unhappy and put pressure on the data warehousing team to “get the reports in time.” At the same time, the export and adhoc users are unhappy because they can’t get to the data fast enough to benefit the business.

The poor data warehouse team now looks for solutions invariably prioritizing reports. Historically, the answer was to prioritize via workload management - constrain adhoc & export usage to devote resources to reporting. If workload management didn’t work, the answer was to create walled gardens by enforcing rules -

“thou shalt not report when data is loading; thou shalt not adhoc query when reports are being generated; thou shalt not export except in the evening”.

Let’s call this design pattern of data warehouse to be a “Reporting Data Warehouse”.

The first-class citizen of a “Reporting Data Warehouse” is reports. The exports and adhoc are second-class citizens - they do not get dedicated data models, they do not get large chunks of resources, and their protests are answered by asking them to constrain their requirements (use samples, use rolled up aggregates that were built to make reports faster, use smaller timeframes of history that were retained to just satisfy reporting requirements, phrase queries that are simpler even though they may be compromises on the pattern sought, …).

This motivates the definition of a different design pattern of a data warehouse whose use is to be an “Analytical (Iterative) Data Warehouse“.

The first-class citizens of an “Analytical Data Warehouse” are exports and adhoc analytics, and the primary users of the Analytical Data Warehouse are business analysts and data scientists. The data models are built to support their adhoc usage - fine-granularity data is retained, rich dimension tables are frequently imported, derived views and tables are created promptly, interfaces are opened up to express their patterns in a computationally simple and natural manner, scale-out is used to create resources for the tasks to finish interactively, enough storage is allocated for several exports to proceed simultaneously.

In other words, the infrastructure and the team exist to support export and adhoc usage as their primary customers.

Once an insight is confirmed, it can be added with careful design to the Reporting Data Warehouse - with carefully defined data models, indexes and materialized views and support to maintain it during the ETL process.

The infrastructure can play a significant role in enabling Analytical Data Warehouses.

  1. Query interface to support in-database computations: The export and adhoc queries want to manipulate data in rich ways, and often SQL is not enough. The infrastructure should support an easy-to-express interface for rich computations (e.g., SQL/MapReduce). This is important because the correct perspective of data, amenable to downstream manipulation, cannot be defined upfront in the ETL process.
  2. Incremental scale-out MPP capabilities: The infrastructure should allow for an ability to scale out both storage and computing resources incrementally and easily (i.e., without months of planning). This is important because temporary storage requirements (as data is transformed for analysis) or temporary processing requirements (as several models are generated to validate insights) at the peak of the analysis can be much higher than normal use.
  3. Cheap hardware: The size of data demanded by exports and adhoc users may be large and the computations may be rich. The infrastructure must enable analysis of data at a cheap operating point.
  4. Workload manage both export and analysis: Some data analysts are comfortable in manipulating data in their preferred tools (e.g., Excel, SAS, Matlab) - others are comfortable writing in-database queries (SQL, Java, C++, Perl, Python). The infrastructure should elegantly manage all tasks, and not require a “walled garden” to favor queries over export, or vice-versa.

My observation has been that the design methodology of an Analytical Data Warehouse is substantially different from a Reporting Data Warehouse. Understanding the primary customer of a data warehouse can often help simplify operations of the data warehouse and help lower the operating point costs substantially by making priorities clearer.


Post a comment

Name: 
Email: 
URL: 
Comments: