By Steve Wooledge in Blogroll, nPath on February 11, 2009

As a follow-on to the introductory nPath post, I wanted to share a little more depth on the nPath SQL syntax and a more sophisticated example which can be applied in click-stream or Web analytics. I’ll try to keep it concise for my colleagues who don’t want the pretty marketing bow . ;-)

SEO and SEM are critical traffic drivers for just about any consumer-facing website. Third party analytics offerings such as Google Analytics or Omniture can provide great turn-key package of canned reports. However, certain deep analytics on sequential events are simply out of the reach of not only these outsourced analytics services, but also in-house Web analytics data warehouses implemented on traditional solutions such as Oracle or SQL Server.

For example, suppose we are interested in the optimization of our website flow in order to retain and engage visitors driven to us by SEO/SEM. We want to answer the question: for SEO/SEM-driven traffic that stay on our site only for 5 or less pageviews and then leave our site and never return in the same session, what are the top referring search queries and what are the top path of navigated pages on our site? In traditional data warehouse solutions, this problem would require a five-way self-join of granular weblog data, which is simply unfeasible for large sites such as Myspace.

With the Aster nPath SQL/MR function, this problem can be expressed in a straightforward query that is executed in a very efficient manner in just a single pass over the granular data. The query below returns the top combinations of referral query string (of the entry page of the visit to our site) and on-site navigation path of up to 5 pages before leaving the site:

SELECT entry_refquerystring, entry_page || “,” || onsite_pagepath as onsite_pagepath, count(*) as session_count FROM nPath(
ON ( select * from clicks where year = 2009 )
PARTITION BY customerid, sessionid
ORDER BY timestamp
PATTERN ( ‘Entry.Onsite+.OffSite+$’ )
domain ilike “mysite.com” and refdomain ~* “yahoo.com|google.com|msn.com|live.com” as Entry,
domain ilike “mysite.com” as OnSite,
domain not ilike “mysite.com” as OffSite
first(page of Entry) as entry_page,
first(refquerystring of Entry) as entry_refquerystring,
accumulate(page of Onsite) as onsite_pagepath,
count(* of Onsite) as onsitecount_minus1
WHERE onsitecount_minus1 < 4
LIMIT 1000;

Bookmark and Share

An example of Aster Data’s nPath/MapReduce syntax | DBMS2 -- DataBase Management System Services on February 11th, 2009 at 9:33 pm #

[...] post on Aster Data’s introduction of MapReduce-based nPath, Steve Wooledge of Aster offers a more detailed example.  The particular case he works through is: … the question: for SEO/SEM-driven traffic that [...]

COP on February 26th, 2009 at 7:10 pm #

Why can’t I do this with a Perl script or any off-the-shelf Log Analysis software? I mean why cloud n mapreduce n all the baggage?

bottomline: do we really need this at this point?

[...] referred to how our customers use In-Database MapReduce (and nPath) for click-stream analytics . In our “MapReduce for Data Warehousing and [...]

Steve Wooledge on March 30th, 2009 at 4:18 pm #


Thanks for the comment. Yes, you could certainly write a Perl script or other custom script/tool to do this type of analysis. The challenge our customer run into when they do this is:

1) It’s not reusable - it was written for only one data type. SQL/MR functions are polymorphic and therefore more efficient.

2) The amount of data being collected breaks most databases. Clustering MySQL or traditional SMP databases gets difficult to manage without a lot of tuning. Add complex functions to the mix and it’s a recipe for pain and suffering.

3) There are many other basic reports and ad-hoc analysis that companies want to do on the same data set. Having a SQL-compliant database which can *also* enable massive parallelism for custom functions *in* the database is a requirement. That’s what Aster nCluster provides - the benefits of MapReduce with the standard SQL support and manageability of an RDBMS.

Do these points address the use-case that I’m sure you had in mind?

Post a comment