I’m unbelievably excited about our new In-Database MapReduce feature!
Google has used MapReduce and GFS on page rank analysis, but the sky is really the limit for anyone to build powerful analytic apps. Curt Monash has posted an excellent compendium of applications that are successfully leveraging the MapReduce paradigm today.
A few examples of SQL/MapReduce functions that we’ve collaborated with our customers on so far:
1. Path Sequencing: SQL/MR functions can be used for developing regular expression matching of complex path sequences (eg. time series financial analysis or clickstream behavioral recommendations). It can also be extended to discover Golden Paths to reveal interesting behavioural patterns useful for segmentation, issue resolution, and risk optimization.
2. Graph Analysis: many interesting graph problems like BFS (breadth first search), SSSP (single source shortest path), APSP (all-pairs shortest path), and page rank that depend on graph traversal.
3. Machine Learning: several statistical algorithms like linear regression, clustering, collaborative filtering, naive bayes, support vector machine, and neural networks can be used to solve hard problems like pattern recognition, recommendations/market basket analysis, and classification/segmentation.
4. Data Transformations and Preparation: Large-scale transformations can be parameterized as SQL/MR functions for data cleansing and standardization, unleashing the true potential for Extract-Load-Transform pipelines and making large-scale data model normalization feasible. Push down also enables rapid discovery and data pre-processing to create analytical data sets used for advanced analytics such as SAS and SPSS.
These are just a few simple examples Aster has developed for our customers and partners via Aster’s In-Database MapReduce to help them with rich analysis and transformations of large data.
I’d like to finish with a simple code snippet example of a simple, yet powerful SQL/MR function we’ve developed called “Sessionization”
Aster developed a simple “Sessionization”Ãƒâ€š? SQL/MR function via our standard Java API library to easily parameterize the discovery of a user session. A session would be defined by a timeout value (eg. in seconds). If the elapsed time between consecutive click events is greater than the timeout, this would signal a new session has begun for that user.
From a user perspective, the input is user clicks (eg. timestamp, userid). The output is to associate each click to a unique session identifier based on the Java procedure noted above. Here’s the simple syntax:
SELECT timestamp, userid, sessionid
FROM sessionize("timestamp", 600) ON clickstream
SEQUENCE BY timestamp
PARTITION BY userid;
Indeed, it is that simple.
So simple, that we have reduced a complex multi-hour Extract-Load-Transform task into a toy example. That is the power of In-Database MapReduce!