09
Nov
By Mayank Bawa in Analytics, MapReduce on November 9, 2010
   

It’s ironic how all of a sudden Vertica is changing its focus from being a column-only database to claiming to be an Analytic Platform.

If you’ve used an Analytic Platform you know it’s more than just bolting in a layer of analytic functions on top of a database. But that’s how Vertica claims it’s now a full-blown analytic platform when in fact their analytics capability is rather thin. For instance, their first layer is a pair of window functions (CTE and CCE). The CCE window function is used, for example, to do sessionization. Vertica has a blog post that posits sessionization as a major advanced analytic operation. In truth, Vertica’s sessionization is not analytics. It is a basic data preparation step that adds a session attribute to each clickstream event so that very simple session-level analytics can be performed.

What’s interesting is the CCE window function is simply a pre-built function – some might say just syntactic sugar - that combines the functionality of finite width SQL window functions (LEAD/LAG) with CASE statements (WHEN condition THEN predicate). Nothing ground breaking to say the least!

For example, the CTE query referred to in a Vertica blog post can be rewritten very simply using SQL-99:

SELECT
symbol, bid, timestamp,
SUM(CASE WHEN bid > 10.6 THEN 1 ELSE 0 END)
OVER (PARTITION BY symbol ORDER BY timestamp) window_id
FROM tickstore;

The layering of custom pre-built functions has for a long time been the traditional way of adding functions to a database. The SQL-99 and SQL-2003 analytic functions themselves follow this tradition.

The problem with this is not just with Vertica but also with the giants of the market, Oracle and Microsoft for instance. Their approach is that the customer is at the mercy of the database vendor - pre-built analytic functions are hard-coded to every major release of the DBMS. There is no independence between the analytics layer and the DBMS – which real, well-architected analytic platforms need to have. Simply put, if you want to do a different sessionization semantic, you’ll have to wait for Vertica to build a whole new function.

Analytics is a wide canvas. An analyst wants to interpret data in ways that best suit the business context. The database must offer the analyst freedom to express their business context - not try to peg their needs in one of two hard-wired, rigid functions. This freedom requires a flexible and extensible architecture built from the ground up, not rigid functions bolted on from top.

Let’s consider an analyst who wants to interpret user behavior on his website from clickstream logs. He may start by preparing time-based sessions (the CCE approach). He may then decide to focus on Search behavior - how do users search for items? How do users refine their search? What searches did the users do until they found their specific item? For example, at a book-store, a user may search for “harry potter gift”, explore some results, then do another search for “harry potter rowling”, explore more results, then do a search for “harry potter rowling memorablia”, explore the results, and then add an item to their cart. The intention of the user is to find “memorablia” and not “books”. This is useful information and presents good opportunities to provide more relevant content and improve customer satisfaction. This analysis requires the analyst to prepare search-based sessions. After the sessions are prepared, the analyst may do path analysis: find terms that users search for when they are looking for “memorablia”, find popular items that drive most “memorablia” purchases, and find other items that co-occur with “memorablia” purchases.

Here is a SQL-MapReduce query that gets you this advanced behavior-based sessionization but note how simple it is and it’s blazing fast to run on massive amounts of data because of MapReduce parallelization. Our customers find that SQL-MapReduce improves their query performance orders of magnitude; one example is a clickstream analysis that took 6 minutes in SQL versus 77 seconds in SQL-MapReduce on 4 times the amount of data. This particular SQL-MapReduce sessionization example is similar – simpler to express and far faster:

SELECT * FROM
search_sessionize (
ON clickstream
partition by ip
order by ts
searchColumn('search_term')
timeout_span(900)
maxEditDistance(3)
maxNewWords(2)
stemming(on)
);

This query is simple. It is simple to write, simple to read, simple to edit, and simple to extend. The intent of the query can be understood even by non-SQL masters! Conversely, how many SQL jocks know how to efficiently combine LAG/LEAD functions with CASE functions – like you have to do with Vertica? In this example the clickstream data set assigns a session_id to series of searches and allows for edits to search terms (maxEditDistance) to be edited and refined by adding new words to a search (maxNewWords). It also supports the traditional timeout so that regardless of the next search term the search is assigned a new session.

Regarding speed - this query is fast. It makes one pass over the data and is fully parallelized. There are no wasted operations – no joins, no aggregates, no data movement – nothing that doesn’t directly help in computing the result. Every operation is parallelized, implying scalability. The function does not need to be re-written irrespective of the data size being 100GB or 1TB or 10TB or 100TB.

The best part of this framework is that the analyst does not have to depend on Aster Data shipping this function hard-coded in into an Aster Data nCluster release! The analyst can write a SQL-MR operator that programs the data for his needs now using our simple point-and-click IDE that easily deploys custom SQL-MR functions to nCluster. That operator can be re-used by all other analysts very simply. The operator makes SQL-MR queries very easy to understand by even non-SQL masters.

The proven simplicity, speed, and re-use is the power of the SQL-MapReduce framework.

Further, this search analysis requires search-based (i.e., behavior-based) sessionization, not a time-based sessionization. The CCE approach is time-based and cannot track variable number of events that mark a behavior. As a result, the analyst will not be able to prepare behavior-based session data. The Vertica system will not provide the analyst with functional freedom to do behavior-based analytics. In addition, it will take a lot of development years for the Vertica product to discover, develop, QA, and deploy all these functions - one by one.

Here is another SQL-MapReduce query that uses a SQL-MR operator that ships with Aster Data nCluster. This function cannot be done by CCE or CTE in Vertica because it doesn’t define a finite-time LEAD/LAG window.

This customer wanted to find the number of behavioral paths on a website that exclusively visit pages in ‘Sports’ or ‘Business’ and ignore any other content on the website.

Count all non-overlapping sequential user paths that start at ‘Home Page’ (pageid = 50) and pass exclusively through either ‘Sports’ pages (page_cat_id = 80) or ’Business’ pages (page_cat_id = ‘90’). (Note that SQL window functions define a fixed window - either time of event or number of rows - whereas SQL-MR operators like nPath permit you to define the window based on the pattern being observed across any number of rows or any period of time. Even further, finding patterns that are non-overlapping and overlapping are really hard to do with a standard SQL window function.)

SELECT B_count, S_count
FROM npath(
ON clickstream
PARTITION BY user_id
ORDER BY ts
MODE(NONOVERLAPPING)
PATTERN (‘H.(S|B)*’)
SYMBOLS(
pageid=50 as H,
page_cat_id=80 as S,
page_cat_id=90 as B)
RESULT(
Count(*) of ANY(B) as B_count,
Count(*) of ANY(S) as S_count)
)
ORDER BY B_count, S_count desc;

The above is just an example of advanced behavioral analysis that an analyst may want to do for a business context where the business may be interested in understanding specific user content consumption patterns. As business contexts change, sessionization & key behavior metrics may change. The underlying platform must be expressive enough to allow the analyst to express his intent.

In conclusion, the analytic platform must be:
- Expressive - allowing analysts to express their intuition
- Re-usable - allowing analysts to re-use their functions for other analytic applications
- Highly performant - allowing analysts to get their answers interactively and quickly
- Scalable - allowing analysts to leverage big data cost-effectively at scale; as Aster Data customers have proven, with linear scalability up to 900TB’s in one particular case

Analytics requires a true platform architecture that is easily extensible - not a patchwork of limiting SQL-only functions.


Post a comment

Name: 
Email: 
URL: 
Comments: