It may not sound sexy, but analyzing a sequence of events over time is non-trivial in standard SQL. When Aster started providing customers ways to express time-series analysis more elegantly and have it return result sets in 90 seconds as opposed to 90 minutes, we started getting comments like, “That’s beautiful … I would make out with my screen if I could!”.
Crazy, I know.
Time-series, or sequential pattern analysis, is useful in a number of industries and applications such as:
- Price changes over time in financial market data
- Path analysis within click-stream data to define anything from top referring sites to the “golden” paths customers navigate before purchasing
- Patterns which detect deviant activity such as spamming or insurance claims fraud
- Sessionization (mapping each event in a clickstream to a human user session)
More specifically, one customer wanted to improve the effectiveness of their advertising and drive more visitors to their site. They asked us to help determine the top paths people take to get to their site and top paths people take after leaving the site. Knowing the “before” path gave them insight into what sites to place advertisements on to drive traffic to their site. Additionally, pathing helps the customer understand behavior/preferences of users who visit their site (e.g., if espn.com is a top site that is in the path of the 5 pages leading up to the customer site, they know that many visitors like sports).
However, discovering relationships between rows of data is difficult to express in SQL, which must invoke multiple self-joins of the data. These joins dramatically expand the amount of data involved in the query and slow down query performance – not to mention complexity in developing and parsing these expressions.
What’s the solution? There are a few, but Aster’s approach has been to develop extensions to SQL which is executed in-database in a single-pass over the data in a massively-parallel fashion utilizing nPath, which is a SQL-MapReduce (SQL/MR) function used to perform regular expression pattern matching over a sequence of rows. It allows users to:
- Specify any pattern in an ordered collection – a sequence – of rows with symbols;
- Specify additional conditions on the rows matching these symbols; and
- Extract useful information from these row sequences.
I’ll share the syntax of nPath here to give you more context of how the query operates:
FROM nPath (
nPath performs pattern matching and computes aggregates. The results of this computation are the output rows of nPath. The rows from nPath can subsequently be used like any other table in a SQL query: rows from nPath may be filtered with the WHERE clause, joined to rows from other tables, grouped with the GROUP BY clause, and so on.
The result? Incredibly powerful insight into a series of events which indicates a pattern or segment can be expressed in SQL, run in parallel on massive clusters of compute-power in an extremely efficient manner via a single pass over the data, and made accessible to business analysts through traditional BI tools.
What do you think? What other methods are people using to tackle these types of problem?