By Mayank Bawa in Analytics, Blogroll, Business analytics, MapReduce on August 25, 2008

I’m unbelievably excited about our new In-Database MapReduce feature!

Google has used MapReduce and GFS on page rank analysis, but the sky is really the limit for anyone to build powerful analytic apps. Curt Monash has posted an excellent compendium of applications that are successfully leveraging the MapReduce paradigm today.

A few examples of SQL/MapReduce functions that we’ve collaborated with our customers on so far:

1. Path Sequencing: SQL/MR functions can be used for developing regular expression matching of complex path sequences (eg. time series financial analysis or clickstream behavioral recommendations). It can also be extended to discover Golden Paths to reveal interesting behavioural patterns useful for segmentation, issue resolution, and risk optimization.

2. Graph Analysis: many interesting graph problems like BFS (breadth first search), SSSP (single source shortest path), APSP (all-pairs shortest path), and page rank that depend on graph traversal.

3. Machine Learning: several statistical algorithms like linear regression, clustering, collaborative filtering, naive bayes, support vector machine, and neural networks can be used to solve hard problems like pattern recognition, recommendations/market basket analysis, and classification/segmentation.

4. Data Transformations and Preparation: Large-scale transformations can be parameterized as SQL/MR functions for data cleansing and standardization, unleashing the true potential for Extract-Load-Transform pipelines and making large-scale data model normalization feasible. Push down also enables rapid discovery and data pre-processing to create analytical data sets used for advanced analytics such as SAS and SPSS.

These are just a few simple examples Aster has developed for our customers and partners via Aster’s In-Database MapReduce to help them with rich analysis and transformations of large data.

I’d like to finish with a simple code snippet example of a simple, yet powerful SQL/MR function we’ve developed called “Sessionization”

Our Internet customers have conveyed that defining a user session can’t be easily done (if at all) using standard SQL. One possibility is to use cookies but users frequently remove them or they expire.

Aster In-Database MapReduce

Aster developed a simple “Sessionization”Â? SQL/MR function via our standard Java API library to easily parameterize the discovery of a user session. A session would be defined by a timeout value (eg. in seconds). If the elapsed time between consecutive click events is greater than the timeout, this would signal a new session has begun for that user.

From a user perspective, the input is user clicks (eg. timestamp, userid). The output is to associate each click to a unique session identifier based on the Java procedure noted above. Here’s the simple syntax:

SELECT timestamp, userid, sessionid
FROM sessionize("timestamp", 600) ON clickstream
SEQUENCE BY timestamp

Indeed, it is that simple.

So simple, that we have reduced a complex multi-hour Extract-Load-Transform task into a toy example. That is the power of In-Database MapReduce!

Bookmark and Share

Three approaches to parallelizing data transformation | DBMS2 -- DataBase Management System Services on August 26th, 2008 at 1:03 pm #

[...] Of The Week: MapReduce. When I posted a list of canonical MapReduce applications, my friends at Aster Data offered one pushback — I left out the area of data transformation. As CEO Mayank Bawa puts [...]

Vikram Shukla on August 26th, 2008 at 9:12 pm #

Oracle supports table functions which can do what sessionize() does here and potentially parallelize it too.

Microsoft supports table-valued functions too. http://msdn.microsoft.com/en-us/library/aa964138(SQL.90).aspx

More generally, one can potentially embed map-reduce sequences in table functions.

[...] introduced In-Database MapReduce in a previous post so I won’t spend too much time here. But I want to point out how this fits our overall vision. [...]

Infology.Ru » Blog Archive » ??? ??????? ? ????????????????? ???????? ?????????????? ?????? on October 9th, 2008 at 1:36 pm #

[...] ???????????? ?????????? MapReduce, ??? ?????? ?? ???????? Aster Data ?????????? ??? ??? ???? – ? ??????? ????? [...]

Aster Data nPath | DBMS2 -- DataBase Management System Services on February 21st, 2010 at 1:05 am #

[...] it.  (Steve Wooledge’s blog post about nPath outlines why that might be needed.  Point 1 in Mayank Bawa’s August, 2008 post is much more concise. )  Now, that might seem to contradict the syntax, which is all about [...]

Post a comment