Aster nPath Functionality Volume I

By Mark Ott, Teradata Aster

This presentation will cover some of the finer points on Aster’s nPath function.  Here’s the overview:

  • Working with ACCUMULATE
  • Understanding MODE (overlapping | nonoverlapping)
  • Experimenting with Operator precedence and parentheses
  • Number of occurrences options
  • Using Multiple WHERE clauses and composite PATTERNs

If you haven’t read Mike Riordan’s Aster nPath Guide first, now is a good time.  It provides a great foundation of these concepts and is compulsory before we dig into the more advanced stuff.  However, at the risk of redundancy, here’s the nPath Operators we'll be using in our labs.

It worth noting the Operators above are 'greedy' in nPath.  This means they are matched from left-to-right and as many rows as possible that can fit in the wildcards will be.  You will see examples of this below.

Working with ACCUMULATE

We will be using the ACCUMULATE aggregate in our RESULTS clause so it’s important we understand how it works (Note this keyword can be used in a number of Aster functions so it’s not just limited to nPath).

ACCUMULATE provides a list of input columns that will be passed as-is to the result set when conditions are True for that Pattern.  It will be in the form of a comma delimited string of sequenced values enclosed in brackets.  I use it for troubleshooting when I want to understand the nPath logic.  Suppose you have the following table:

You run the following nPath code:

The result set is:
 ct_any_ab  |  accumulate
              2   |    [a, b]
              2   |    [a, b]

Basically nPath starts walking the partitioned ordered rows in top-to-bottom fashion looking for True conditions to return in the answer set.  In our example, the nPath mechanics works as follows:
Row 3 –  False (no match between row3 and row4).  Move on to row 4
Row 4 –  True (found match between row 4 and row 5) with RESULTS = 2  [a, b].  Move on to row 6
Row 6 –  False (no match between row 6 and row 7). Move on to row 7 ......

I think you get the idea by now.  Eventually row  10 and row 11 would result in another True which would return the second 2  [a, b].

So that’s a quick overview of how nPATH’s ACCUMULATE works.  I don’t want to muddy the waters, but I would be amiss without mentioning a few things in the above example. 

  1. Unlike Teradata, Aster is case-specific.  If I would have replaced the SYMBOLS line with this:  symbols(c2='A' as A, c2='B' as B)  then the result set would have been empty since there would have been no matches.  So be extremely careful with case in your code; and for that matter when INSERTING data 
  2. Notice in the sequence above that once a True was found in Row 2, the next row number to be checked was Row 4, and not Row 3.  That was due to the mode(nonoverlapping)clause.  We delve into this topic next but just wanted to point it out

OK, we’ve got our feet wet.  It’s time to dive into some more detailed examples.  Let’s go.  


The MODE clause indicates whether the matched PATTERNs may overlap. You get to choose between OVERLAPPING or NONOVERLAPPING as shown below.

In NONOVERLAP match mode, nPath begins the next pattern search at the row that follows the last PATTERN match (in our case, next pattern search would start at row 8).

In OVERLAPPING match mode, nPath finds every occurrence of the pattern in the partition, regardless of whether it might have been part of a previously found match. This means that, in OVERLAPPING mode, one row can match multiple symbols in a given matched PATTERN (in this case the next pattern match starts at row 4).

But don’t take my word for it (I’m from Missouri so you have to ‘Show Me’).   Let’s do the lab using the same 12-row data set we used earlier.   Since my pattern is ('A+.B.B.C'), this means we are looking for a PATTERN of at least 1 A, followed by 2 B’s, followed by a C.

With the code running mode (NONOVERLAPPING) , here’s the output.
  first     |    accumulate
      2     |  [a, a, b, b, c]
      1     |  [a, a, b, b, c]

Repeat, only this time using mode (OVERLAPPING)  you get 4 rows for the output.
  first     |    accumulate

      2      |   [a,  b,  b, c]
      2      |   [a , a,  b,  b, c]
      1      |   [a,  b,  b, c]
      1      |   [a , a,  b,  b, c]

Of course, OVERLAPPING will not guarantee you have more output rows than NONOVERLAPPING.  It all depends on the data.  Suppose I were to run the above code with a slightly different pattern as follows:
pattern ('A.B.B.C') then I would get the same answer regardless of mode type since there’s  only one match in sequence.
    [a,  b,  b, c]

Experimenting with Operator precedence and Parentheses

It is crucial to know the precedence of Operators from highest to lowest.  Here it is:

  1. Cascade ( . )
  2. Alternate ( | )
  3. Frequency ( ?, *, + )

If you wish to change order of execution, parenthesis is the way to do it.
Here’s the table we’ll be running our queries against:

I want to create a query that finds all combinations of  A,B or A,C.  Looking at the table above, I figured I would get 3 hits (Row 1-2, Row 4-5, Row 6-7).  Here's the code:

Hmmm.  I only had 2 hits (A,B) and was missing the (A,C) hit.  I could not understand why I did not have the A,C combination in my result set.  After going back to the User’s Guide, I figured it out.  I had forgotten that ( . )  takes precedence over  ( | ) .  So my logic was looking for PATTERNS of A.B rows or C rows as follows:

Row 1-2 is A.B                                   Match (A.B)
Row 3 is C                                         Match (C)
Row 4-5 is A.B                                   Match (A.B)
Row 6 is A (with no following B)        No Match
Row 7 is C                                         Match (C)

… when I was really wanted PATTERNS of A.B or A.C.  To fix the problem, I added paretheses to the code:

Sure enough I got my 3 hits.  2 hits for A,B pattern and 1 for A,C pattern.

 Ahh, that’s more like it.  Here you can see the 3 hits of 2 A,B and 1 A,C.  The bottom line; it is a best practice to use parentheses in your PATTERN string.  It takes away any ambiguity that may exist and makes it more readable for others looking at your code.

Number of Occurrence options

If you wish to search for an exact number of occurrences that totals more than one, the nPath coding can become tedious.  There are 3 options you can use:

  1. ' (x){a} '         Exactly A number of occurrences of X
  2. ' (x){a,} '       At least A number of occurrences of X
  3. ' (x){a,b} '      A to B occurrences of X

For example, suppose you wish to search for sub-sequence of (A.B|C) at least 4 times.  You could code it as:  PATTERN ( 'X. (Y.Z) . (A.B|C) . (A.B|C) . (A.B|C) . (A.B|C) * ' )  but that’s pretty messy.
An alternative is the following syntax: PATTERN ( 'X. (Y.Z) . (A.B|C) {4,} ' )

The only thing to be aware of is the need for your occurring PATTERN to be in enclosed in parentheses with the single tics on the outside of the entire PATTERN.  Here’s an example of doing it the right way:  PATTERN ('X.(A.(B|C)){1,3}') 

Using multiple WHERE clauses and Composite PATTERNs

It is possible to have multiple WHERE clauses in your nPath code.  Their position in the statement determines if they filter INPUT rows or OUTPUT rows. 

The below WHERE clause in the second SELECT filters INPUT rows where c1 = 1 or c1 =2.  That’s simple enough.

I'm a big believer in using WHERE to filter an many Input rows as possible.  It's a much better performer than scanning the entire Table's rows.

If you wanted to later filter Output rows, you can do that with a 2nd WHERE clause as demonstrated.  

Note I could have accomplished the same thing by having composite PATTERN search instead of using the 2nd WHERE clause.  Following code produces the same output as above.

Ensure you have space after the comma ( ie:  [a,  b] ).  If you write it as [a,b] , it will return 0 rows.
Be extremely careful when writing composite PATTERNS as I have notice (what I think) is a bug in the code. 

If you were to write  as:  PATTERN('A.B' | 'B.C'), you get back a message along the lines 'The connection to the server was lost.'  You will not get a result set until you fix the code.  I believe this is because you must have only one set of single tics in PATTERN.

In Conclusion

As you can see, there’s lot of functionality baked into the  nPath.  I consider it one of Aster’s premier functions since it has so many uses and can provide deep insights into customer’s behavior such as buying trends and patterns. 

Keep your eyes peeled for more nPath articles.  In particular, we’ll be talking about the LAG function next which can be embedded in SYMBOLS clause.  It can answer such questions as: “Find customers who view a product and then buy it with 1 minute on my web page.”   That would be a valuable piece of information if I were a retailer.

And if you are interested in running these labs now, here's the CREATE TABLE statment and INSERTs to help you along:
create table overlap (c1 int, c2 text) distribute by hash(c1);
insert into overlap values (1, 'a');
insert into overlap values (1, 'a');
insert into overlap values (1, 'b');
insert into overlap values (1, 'b');
insert into overlap values (1, 'c');
insert into overlap values (1, 'c');
insert into overlap values (2, 'a');
insert into overlap values (2, 'a');
insert into overlap values (2, 'b');
insert into overlap values (2, 'b');
insert into overlap values (2, 'c');
insert into overlap values (2, 'c');