Teradata Aster nPath™ Guide
By Michael Riordan, Teradata Aster
As we've seen in our introduction to the power of Aster's nPath (Using Aster Express: Act 3, On the Road to Aster nPath), the nPath function allows you to perform regular pattern matching over a sequence of rows. With it, you can find sequences of rows that match a pattern you’ve specified and easily extract information from these matched PATTERNs using SYMBOLs that represent the matched rows in the pattern.
Aster nPath uses regular expressions because they are simple, widely understood, and flexible enough to express most search criteria. While most uses of regular expressions focus on matching patterns in strings of text; Aster nPath enables matching patterns in sequences of rows.
Aster nPath is powerful and for simple examples is very easy to use. But as we go through our Aster nPath tutorials and start looking at more complex examples, I thought a quick detour would be helpful to take a deeper dive into the syntax options.
nPath, wrapped within a SQL expression, has the following syntax:
As you can see, there's quite a bit of power and flexibility in the nPath syntax, both in the parameters for nPath itself, as well as the SQL statement wrapper. Let's take a closer look at the nPath parameters.
The ON clause specifies the input relation, which can be a table, view, or query. If you provide a query, you must enclose it in parentheses.
The PARTITION BY expression defines the scope of a partition of input rows over which nPath searches for pattern matches.
The ORDER BY expression specifies the sort-order of the input rows.
The MODE clause indicates whether matched PATTERNs may overlap.
- In OVERLAPPING match mode, nPath finds every occurrence of the pattern in the partition, regardless of whether it might have been part of a previously found match.
- In NONOVERLAPPING match mode, nPath begins the next pattern search at the row that follows the last PATTERN match.
The PATTERN clause defines the sequence of rows nPath uses for its searches.
You express the pattern using symbols and operators.
The following are examples of operators may be used in a pattern.
The precedence of operators is, from highest to lowest:
1. Cascade operator (“.”)
2. Alternative operator (“|”)
3. Frequency operators (“?”, “*”, “+”)
Operators with equal precedence associate left to right.
The SYMBOLS clause defines the row-elements in the pattern, expressed as a comma-separated list of symbol definitions.
Each symbol definition is written in the form “symbol_predicate AS symbol” where symbol_predicate is an SQL predicate and symbol is a case-insensitive string you’ll use to represent rows that match this predicate.
It’s common to define each symbol as just one or two uppercase letters, since short symbols are easy to read when assembled into a pattern expression.
For example, a SYMBOLS clause for analyzing website visits might look like this:
A symbol is applied to a row only if the row satisfies the symbol’s predicate. If a null value is encountered when trying to match the symbol predicate, it’s treated as a non-match.
A symbol may be associated with the predicate “true”, meaning that the symbol can match any row.
Note that the predicates for different symbols may overlap, and therefore multiple symbols may match the same row.
In your symbol predicate, you can compare the current row to a preceding row to determine if it is considered a match for the symbol. These are called “LAG expressions”, which take the following 2 formats:
• expression-current is the name of the column from the the current row, or an expression operating on this column.
• the operator can be >, >=, <, <=, =, or !=
• expression-prev is the name of the column from the the current row, or an expression operating on this column.
• lag-rows is the number of rows to count back from the current row to reach the row we designate as the earlier row. For example, to compare with the immediately preceding row, use “1”.
• default-value is the optional value to be used when there are no rows which can be designated as “earlier row,” in such a case the default-value will be evaluated on the current row and used in place of the expression-prev.
The RESULT clause defines the output columns of this nPath query as a comma-separated list of expressions.
The RESULT clause is evaluated once for each matched PATTERN in the partition. In other words, nPath generates one output row per PATTERN match.
In the RESULT clause, each expression operates on one or more symbols, and each expression is followed by the alias to be applied to this column of output.
The form of an output column definition in the RESULTS clause is:
Since each symbol represents all the rows that matched that symbol’s predicate in this particular matched PATTERN, you must specify an expression (often just the column name) to state what values you want to retrieve from the matched rows, and then apply an aggregate function to the results of that expression, in order to generate a single, useful value from the set of matched rows in the symbol.
Aggregate functions can be COUNT, SUM, MAX, MIN, AVG, and the special nPath sequence aggregates FIRST, LAST, FIRST_NOTNULL, LAST_ NOTNULL, MAX_CHOOSE, and MIN_CHOOSE over each matched pattern.
For example, imagine that we want to count how many product pages a web visitor viewed during a visit to our website. To do this, our output column definition in the RESULT clause might look like the following.
Let’s assume we’ve defined the symbol PP to represent rows that record a user’s views of our product pages on the website:
Full nPath Syntax Example
To complete this quick overview, here's an nPath Query that pulls together much of the syntax options that we've covered:
We'll continue adding more nPath examples in the tutorials to follow. This syntax overview will hopefully be a valuable reference to help you understand the example queries and also guide you as you create your own nPath analytics on your own data.
And as always, have fun!