Building on Mayank’s post, let me dig deeper into a few of the most important differences between Aster’s In-Database MapReduce and User Defined Functions (UDFs):
|Feature||User Defined Functions||Aster SQL/MR Functions||What does it mean?|
|Dynamic Polymorphism||No. Requires changing the code of the function and static declarations||Yes||SQL/MR functions work just like SQL extensions – no need to change function code|
|Parallelism||Only in some cases and for few number of nodes||Yes, across 100s of nodes||Huge performance increases even for the most complex functions|
|Availability Ensured||No. In most cases UDFs run inside the database||Always. Functions run outside the database||Even if functions have bugs, the system remains resilient to failures|
|Data Flow Control||No. Requires changing the UDF code or writing complex SQL subselects||Yes. “PARTITION BY” and “SEQUENCE BY” natively control the flow of data in and out of the SQL/MR functions||Input/output of SQL/MR functions can be redistributed across the database cluster in different ways with no actual function code change|
In this blog post we’ll focus on Polymorphism – what it is and why it’s so critically important for building real SQL extensions using MapReduce.
Polymorphism allows Aster SQL/MR functions to be coded once (by a person that understands a programming or scripting language) and then used many times through standard SQL by analysts. In this context, comparing Aster SQL/MR functions and UDFs is like comparing SQL with the C language. The former is flexible, declarative and dynamic, the latter requires customization and recompilation even for the slightest change in usage.
For instance, take a SQL/MR function that performs sessionization. Let us assume that we have a
webclicks(userId int, timestampValue timestamp, URL varchar, referrerURL varchar); table that contains a record of clicks for each user on our website. The same function, with no additional declarations, can be used in all the following ways:
SELECT sessionId, userId, timestampValue
FROM Sessionize( 'timestamp', 60 ) ON webclicks;
SELECT sessionId, userId, timestampValue
FROM Sessionize( 'timestamp', 60 ) ON
(SELECT userid, timestampValue FROM webclicks WHERE userid = 50 );
[Note how the number of input arguments changed (going from all columns of
webclicks; to just two columns of
webclicks) in the above clause but the same function can be used. This is not possible with a plain UDF without writing additional declarations and UDF code]
SELECT sessionId, UID, TS
FROM Sessionize( 'ts', 60 ) ON
(SELECT userid as UID, timestampValue as TS FROM webclicks WHERE userid = 50 );
[Note how the names of the arguments changed but the Sessionize() function does the right thing]
In other words, Aster SQL/MR functions are real SQL extensions – once they’ve been implemented there is zero need to change their code or write additional declarations – there is perfect separation between implementation and usage. This is an extremely powerful concept since in many cases the people that implement UDFs (engineers) and the people that use them (analysts) have different skills. Requiring a lot of back-and-forth can simply kill the usefulness of UDFs – but not so with SQL/MR functions.
How can we do that? There’s no magic, just technology. Our SQL/MR functions are dynamically polymorphic. To do this, our SQL/MR implementation (the
sessionize.o file) includes not only code but also logic to determine its output schema based on its input, which is invoked at every query. This means that there is no need for a static signature as is the case with UDFs!
Polymorphism also makes it trivial to nest different functions arbitrarily. Consider a simple example with two functions, Sessionize() and FindBots(). FindBots() can filter the input from any users that seem to act as bots, e.g. users whom interactions are very frequent (who could click on 10 links per second? probably not a human). To use these two functions in combination, one would simply write:
SELECT sessionId, UserId, ts, URL
FROM Sessionize( 'ts', 60 ) ON FindBots( 'userid', 'ts' ) ON webclicks;
Using UDFs instead of SQL/MR functions would mean that this statement would require multiple subselects and special UDF declarations to accommodate the different inputs that come out of the different stages of the query.
So what is it that we have created? SQL? Or MapReduce? It really doesn’t matter. We just combined the best of both worlds. And it’s unlike anything else the world has seen!