This post was co-authored by John Cieslewicz, Eric Friedman, and Peter Pawlowski of Aster Data Systems
One year ago we introduced SQL/MapReduce for the Aster nCluster database, which integrates MapReduce and SQL to enable deep analytics within the database. Pushing computation inside the database and close to the data is increasingly important as data sizes grow exponentially. As SQL/MR turns one year old, we are happy to announce that we will be presenting our SQL/MR innovations next week at the 35th International Conference on Very Large Data Bases (VLDB), the premier international forum for database research.
The title of our conference paper is SQL/MapReduce: A Practical Approach to Self-describing, Polymorphic, and Parallelizable User-defined Functions. The title of the paper may be large, but so are the advanced analytics possibilities created by the invention of SQL/MR!
We developed SQL/MR because we saw a growing gap between the deep analytics and application needs of very large data and the capabilities provided by SQL and traditional relational-only data processing. We call this gap, the “SQL Gap.”
SQL and the relational query processing model are well suited for many, but not all data processing tasks. Some queries are cumbersome, non-intuitive or impossible to express in SQL (note: now that SQL is turing complete, nothing is strictly impossible, but it can be very painful and perform very badly) – check out our paper for some examples. Moreover, query optimizers have a limited number of algorithms at their disposal to process data, which leads to convoluted data processing in situations where applying a little domain knowledge can yield a much more straightforward algorithm.
We found traditional user-defined functions (UDFs) to fall short in bridging this gap between SQL and the answers to challenging analytic problems that need to be solved. UDFs are often user-unfriendly, inflexible, and not easily parallelized. SQL/MR functions, in contrast, are designed to be easy to develop, easy to install, and easy to use – providing developers and analysts with a powerful tool to tackle the challenges posed by very large data.
To do this, we integrated the MapReduce programming model with SQL. MapReduce is a well known paradigm for parallel, fault-tolerant data processing that allows developers to write procedural code that is then applied to data in parallel. Pure MapReduce, however, misses out on aspects of SQL and relational data processing that are great – such as query optimizations, managed data, and transactions. By integrating SQL and MapReduce on top of Aster nCluster’s hardware management and fault tolerance, we leverage the strengths of each, resulting in a system that is much more powerful than either in isolation.
SQL/MapReduce At VLDB
SQL/MR is much more than a user-defined function. As our paper title states, a SQL/MR function is self-describing, polymorphic, and parallelizable. Let’s explore each of these characteristics and see why we hope researchers at VLDB will be as excited as we are by SQL/MR.
Self-Describing and Polymorphic
The behavior and output characteristics of a SQL/MR function are determined dynamically at query-time instead of statically at install-time as is the case with traditional user-defined functions. These characteristics allow SQL/MR functions to behave much more like general purpose library functions than single-use, specific application user-defined functions. When a SQL/MR function is used in a query, the nCluster query planner negotiates a contract with the SQL/MR function, providing the function’s input schema and optional user-supplied parameters. In return, the SQL/MR function agrees to a contract that specifies its output schema for the duration of the query. This contract negotiation is what makes SQL/MR functions self-describing, and their ability to be invoked on different input with different optional parameters makes them polymorphic as well. To summarize, by invoking a SQL/MR function on different input, with different optional parameters, the SQL/MR function may export a different output schema and perform different computation – the possibilities are entirely up to the developer!
The SQL/MR programming model, like that of MapReduce, is inherently parallelizable. Developers write procedural code in the language of their choice, but at runtime that code will run in parallel across hundreds of nodes within an nCluster database. Advanced analytics and application code can now be executed in parallel, directly on data stored within nCluster making nCluster an application-friendly, high performance data warehouse and data application system. Advanced analytics capabilities that we have already pushed inside nCluster using SQL/MR include click-stream sessionization, general purpose time series path matching, dynamic and massively parallel data loading from heterogeneous sources, and genetic sequence analysis.
Bridging the Gap
SQL/MapReduce has matured greatly over the past year and is used by our customers in creative ways we never imagined – proof positive that SQL/MapReduce is an effective way to bridge the “SQL Gap” to deeper analytics on very large data. Check out the other application examples and sample code videos on www.asterdata.com, as well as the MapReduce category of this blog
We’d love to hear from you about any type of analysis you’re doing where stand-alone SQL is becoming overly-complex … and please look us up at the VLDB 2009 show if you’re making the trip to Lyon, France!
(Apart from the authors, Brent Chun, Mohit Aron, Abhishek Marwah, Raghu Venkat, Vinay Bondhugula, and Prasan Roy of Aster Data Systems are notable contributors to the overall SQL/MR effort.)