10
Feb
By Steve Wooledge in Analytics, Blogroll, nPath on February 10, 2009
   

It may not sound sexy, but analyzing a sequence of events over time is non-trivial in standard SQL. When Aster started providing customers ways to express time-series analysis more elegantly and have it return result sets in 90 seconds as opposed to 90 minutes, we started getting comments like,  “That’s beautiful … I would make out with my screen if I could!”.

Crazy, I know.

Time-series, or sequential pattern analysis, is useful in a number of industries and applications such as:

- Price changes over time in financial market data

- Path analysis within click-stream data to define anything from top referring sites to the “golden” paths customers navigate before purchasing

- Patterns which detect deviant activity such as spamming or insurance claims fraud

- Sessionization (mapping each event in a clickstream to a human user session)

More specifically, one customer wanted to improve the effectiveness of their advertising and drive more visitors to their site. They asked us to help determine the top paths people take to get to their site and top paths people take after leaving the site. Knowing the “before” path gave them insight into what sites to place advertisements on to drive traffic to their site. Additionally, pathing helps the customer understand behavior/preferences of users who visit their site (e.g., if espn.com is a top site that is in the path of the 5 pages leading up to the customer site, they know that many visitors like sports).

However, discovering relationships between rows of data is difficult to express in SQL, which must invoke multiple self-joins of the data. These joins dramatically expand the amount of data involved in the query and slow down query performance – not to mention complexity in developing and parsing these expressions.

What’s the solution? There are a few, but Aster’s approach has been to develop extensions to SQL which is executed in-database in a single-pass over the data in a massively-parallel fashion utilizing nPath, which is a SQL-MapReduce (SQL/MR) function used to perform regular expression pattern matching over a sequence of rows. It allows users to:

- Specify any pattern in an ordered collection – a sequence – of rows with symbols;

- Specify additional conditions on the rows matching these symbols; and

- Extract useful information from these row sequences.

I’ll share the syntax of nPath here to give you more context of how the query operates:

SELECT…
FROM nPath (
ON {table|query}
…various parameters…
)
WHERE…
GROUP BY…
HAVING…
ORDER BY…
etc.

nPath performs pattern matching and computes aggregates. The results of this computation are the output rows of nPath. The rows from nPath can subsequently be used like any other table in a SQL query: rows from nPath may be filtered with the WHERE clause, joined to rows from other tables, grouped with the GROUP BY clause, and so on.

The result? Incredibly powerful insight into a series of events which indicates a pattern or segment can be expressed in SQL, run in parallel on massive clusters of compute-power in an extremely efficient manner via a single pass over the data, and made accessible to business analysts through traditional BI tools.

What do you think? What other methods are people using to tackle these types of problem?


Comments:
Richard Hackathorn on February 10th, 2009 at 5:00 am #

More explanation of ‘various parameters’ in the syntax above. A simple example would be appreciated.

Julian Hyde on February 10th, 2009 at 12:08 pm #

Another approach to the problem is to use windowed aggregation, e.g. Sum(x) over (order by time partition by ticker rows 10 preceding).

This syntax is standard (SQL:2003) and supported by several vendors. At SQLstream we are applying the same operator to streaming data.

I would urge you to stick to standard SQL wherever possible. Innovate in implementation, not specification!

[...] sequential data via MapReduce so that you can then do ordinary SQL on it.  (Steve Wooledge’s blog post about nPath outlines why that might be needed.  Point 1 in Mayank Bawa’s August, 2008 post is much more [...]

Qi Su on February 10th, 2009 at 5:08 pm #

Julian – Good thinking. Traditional SQL window functions would certainly work to avoid inefficient self-joins for many simpler sequential event analysis use cases, for example, suppose we are looking for user time on page on the five pages prior to landing on my SEM landing page.

However, there are high-value complex analytics use-cases that cannot be expressed in traditional SQL or SQL window function extensions.

For example, web page abandonment is an important CRM use case, especially for e-commerce sites that want to find out where and why are customers abandoning their cart during the checkout process. In this case, we want to look for user sessions where the user enters the checkout process (a sequence of specific pages), never reaches the final order completion page, and then never returns to this site. This cannot be expressed by window functions as there is no limit on number of rows or time range to define a window. In contrast, this would be a simple pattern in nPath of “A+.B*$� where we describe A as pageviews on pages in the checkout flow excluding the order completion page, and describe B as pageviews not on this site.

Steve Wooledge on February 11th, 2009 at 5:49 pm #

Hi Richard – I blogged with a much more “real” example if you’d like to check it out. http://www.asterdata.com/blog/index.php/2009/02/11/more-npath-examples-web-analytics/

Interfacing Hadoop With MySQL « Semantic Void on March 5th, 2009 at 5:11 pm #

[...] well, but getting MySQL to do mapreduce-like processing is not supported unless you have access to nPath. The other solution is to get this data into an existing mapreduce framework like Hadoop. In a [...]

Abdulrazak on October 15th, 2013 at 9:40 am #

Hi All,
Please send across golden path examples if you have any.
Regards
Abdul

Post a comment

Name: 
Email: 
URL: 
Comments: