Archive for August, 2008

 
26
Aug
Posted by Mayank in MapReduce on August 26, 2008

Pardon the tongue-in-cheek analogy to Oldsmobile when describing user-defined functions (UDFs), but I want to draw out some distinctions between this new class of functions that In-Database MapReduce enables.

Not Your Granddaddy's Oldsmobile

While similar on the surface, in practice there are stark differences between Aster In-Database MapReduce and traditional UDF’s.

MapReduce is a framework that parallelizes procedural programs to offload traditional cluster programming. UDF’s are simple database functions and while there are some syntactic similarities, that’s where the similarity ends. Several major differences between In-Database MapReduce and traditional UDF’s include:

Performance: UDF’s have limited or no parallelization capabilities in traditional databases (even MPP ones). Even where UDF’s are executed in parallel in an MPP database, they’re limited to accessing local node data, have byzantine memory management requirements, require multiple passes and costly materialization. In constrast, In-Database MapReduce automatically executes SQL/MR functions in parallel across potentially hundreds or even thousands of server nodes in a cluster, all in a single-pass (pipelined) fashion.

Flexibility: UDF’s are not polymorphic. Some variation in input/output schema may be allowed by capabilities like function overloading or permissive data-type handling, but that tends to greatly increase the burden on the programmer to write compliant code. In contrast, In-Database MapReduce MR/SQL functions are evaluated at run-time to offer dynamic type inference, an attribute of polymorphism that offers tremendous adaptive flexibility previously only found in mid-tier object oriented programming.

Manageability: UDF’s are generally not sandboxed in production deployments. Most UDF’s are executed in-process by the core database engine, which means bad UDF code can crash a database. SQL/MR functions execute in their own process for full fault isolation (bad SQL/MR code results in an aborted query, leaving other jobs uncompromised). A strong process management framework also ensures proper resource management for consistent performance and progress visibility.



 
25
Aug
Posted by Mayank in MapReduce on August 25, 2008

I’m unbelievably excited about our new In-Database MapReduce feature!

Google has used MapReduce and GFS on page rank analysis, but the sky is really the limit for anyone to build powerful analytic apps. Curt Monash has posted an excellent compendium of applications that are successfully leveraging the MapReduce paradigm today.

A few examples of SQL/MapReduce functions that we’ve collaborated with our customers on so far:

1. Path Sequencing: SQL/MR functions can be used for developing regular expression matching of complex path sequences (eg. time series financial analysis or clickstream behavioral recommendations). It can also be extended to discover Golden Paths to reveal interesting behavioural patterns useful for segmentation, issue resolution, and risk optimization.

2. Graph Analysis: many interesting graph problems like BFS (breadth first search), SSSP (single source shortest path), APSP (all-pairs shortest path), and page rank that depend on graph traversal.

3. Machine Learning: several statistical algorithms like linear regression, clustering, collaborative filtering, naive bayes, support vector machine, and neural networks can be used to solve hard problems like pattern recognition, recommendations/market basket analysis, and classification/segmentation.

4. Data Transformations and Preparation: Large-scale transformations can be parameterized as SQL/MR functions for data cleansing and standardization, unleashing the true potential for Extract-Load-Transform pipelines and making large-scale data model normalization feasible. Push down also enables rapid discovery and data pre-processing to create analytical data sets used for advanced analytics such as SAS and SPSS.

Read the rest of this entry »



 
25
Aug
Posted by Mayank in MapReduce on August 25, 2008

I am very pleased to announce today that Aster nCluster now brings together the expressive power of a MapReduce framework with the strengths of a Relational Database!

Jeff Dean and Sanjay Ghemawat at Google had invented the MapReduce framework in 2004 for processing large volumes of unstructured data on clusters of commodity nodes. Jeff and Sanjay’s goal was to provide a trivially parallelizable framework so that even novice developers (a.k.a interns) could write programs in a variety of languages (Java/C/C++/Perl/Python) to analyze data independent of scale. And, they have certainly succeeded.

Once implemented, the same MapReduce framework has been used successfully within Google (and outside, via Yahoo! sponsored Apache’s Hadoop) to analyze structured data as well.

Read the rest of this entry »



 
05
Aug
Posted by Mayank in Business Intelligence on August 5, 2008

Today we are pleased to welcome Pentaho as a partner to Aster Data Systems. What this means is that our customers can now use Pentaho open-source BI products for reporting and analysis on top of Aster nCluster.

We have been working with Pentaho for some time on testing the integration between their BI products and our analytic database. We’ve been impressed with Pentaho’s technical team and the capabilities of the product they’ve built together with the open source community. Pentaho recently announced a new iPhone application which is darn cool!

I guess, by induction, Aster results can be seen on the iPhone too.

:-)