27
Aug
By Tasso Argyros in Blogroll, Database, MapReduce on August 27, 2008
   

Building on Mayank’s post, let me dig deeper into a few of the most important differences between Aster’s In-Database MapReduce and User Defined Functions (UDFs):

Feature User Defined Functions Aster SQL/MR Functions What does it mean?
———— ———— ———— ————
Dynamic Polymorphism No. Requires changing the code of the function and static declarations Yes SQL/MR functions work just like SQL extensions – no need to change function code
———— ———— ———— ————
Parallelism Only in some cases and for few number of nodes Yes, across 100s of nodes Huge performance increases even for the most complex functions
———— ———— ———— ————
Availability Ensured No. In most cases UDFs run inside the database Always. Functions run outside the database Even if functions have bugs, the system remains resilient to failures
———— ———— ———— ————
Data Flow Control No. Requires changing the UDF code or writing complex SQL subselects Yes. “PARTITION BY” and “SEQUENCE BY” natively control the flow of data in and out of the SQL/MR functions Input/output of SQL/MR functions can be redistributed across the database cluster in different ways with no actual function code change

In this blog post we’ll focus on Polymorphism – what it is and why it’s so critically important for building real SQL extensions using MapReduce.

Polymorphism allows Aster SQL/MR functions to be coded once (by a person that understands a programming or scripting language) and then used many times through standard SQL by analysts. In this context, comparing Aster SQL/MR functions and UDFs is like comparing SQL with the C language. The former is flexible, declarative and dynamic, the latter requires customization and recompilation even for the slightest change in usage.

For instance, take a SQL/MR function that performs sessionization. Let us assume that we have a webclicks(userId int, timestampValue timestamp, URL varchar, referrerURL varchar); table that contains a record of clicks for each user on our website. The same function, with no additional declarations, can be used in all the following ways:


SELECT sessionId, userId, timestampValue
FROM Sessionize( 'timestamp', 60 ) ON webclicks;


SELECT  sessionId, userId, timestampValue
FROM Sessionize( 'timestamp', 60 ) ON
(SELECT userid, timestampValue FROM webclicks WHERE userid = 50 );

[Note how the number of input arguments changed (going from all columns of webclicks; to just two columns of webclicks) in the above clause but the same function can be used. This is not possible with a plain UDF without writing additional declarations and UDF code]


SELECT  sessionId, UID, TS
FROM Sessionize( 'ts', 60 ) ON
(SELECT userid as UID, timestampValue as TS FROM webclicks WHERE userid = 50 );

[Note how the names of the arguments changed but the Sessionize() function does the right thing]

In other words, Aster SQL/MR functions are real SQL extensions – once they’ve been implemented there is zero need to change their code or write additional declarations – there is perfect separation between implementation and usage. This is an extremely powerful concept since in many cases the people that implement UDFs (engineers) and the people that use them (analysts) have different skills. Requiring a lot of back-and-forth can simply kill the usefulness of UDFs – but not so with SQL/MR functions.

How can we do that? There’s no magic, just technology. Our SQL/MR functions are dynamically polymorphic. To do this, our SQL/MR implementation (the sessionize.o file) includes not only code but also logic to determine its output schema based on its input, which is invoked at every query. This means that there is no need for a static signature as is the case with UDFs!

In-Database MapReduce Flow

Polymorphism also makes it trivial to nest different functions arbitrarily. Consider a simple example with two functions, Sessionize() and FindBots(). FindBots() can filter the input from any users that seem to act as bots, e.g. users whom interactions are very frequent (who could click on 10 links per second? probably not a human). To use these two functions in combination, one would simply write:


SELECT  sessionId, UserId, ts, URL
FROM Sessionize( 'ts', 60 ) ON FindBots( 'userid', 'ts' ) ON webclicks;

Using UDFs instead of SQL/MR functions would mean that this statement would require multiple subselects and special UDF declarations to accommodate the different inputs that come out of the different stages of the query.

So what is it that we have created? SQL? Or MapReduce? It really doesn’t matter. We just combined the best of both worlds. And it’s unlike anything else the world has seen!


Comments:

[...] Data has actually posted quite a bit about MapReduce, including in its excellent blog, which has joined Vertica’s more academic Database Column as being among the best vendor [...]

[...] How Aster In-Database MapReduce Takes UDF’s to the Next Level Differences Between Aster and Hadoop Posted on September 6th, 2008 by Tasso [...]

[...] However, In-Database MapReduce is new and often loaded with a lot of technical discussion on how it’s different from PL/SQL or UDF’s, whether it’s suitable for business aanalysts or developers, and [...]

Post a comment

Name: 
Email: 
URL: 
Comments: