Recognizing patterns in a sequence of rows has been a capability that was widely desired, but not really possible with SQL until now. There were many workarounds, but these were difficult to write, hard to understand, and inefficient to execute.
The aim of this tutorial is to explain the difference greedy and reluctant quantifiers. Pattern quantifiers that attempt to match as many instances as possible of the token to which they are applied are referred to as greedy, i.e. they will gobble up as many rows as possible. This is the default mode of operation.
First step is to setup our data table and then populate it with data
CREATE TABLE ticker (SYMBOL VARCHAR2(10), tstamp DATE, price NUMBER);
BEGIN
INSERT INTO ticker VALUES('ACME', '01-Apr-11', 12);
INSERT INTO ticker VALUES('ACME', '02-Apr-11', 17);
INSERT INTO ticker VALUES('ACME', '03-Apr-11', 19);
INSERT INTO ticker VALUES('ACME', '04-Apr-11', 21);
INSERT INTO ticker VALUES('ACME', '05-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '06-Apr-11', 12);
INSERT INTO ticker VALUES('ACME', '07-Apr-11', 15);
INSERT INTO ticker VALUES('ACME', '08-Apr-11', 20);
INSERT INTO ticker VALUES('ACME', '09-Apr-11', 24);
INSERT INTO ticker VALUES('ACME', '10-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '11-Apr-11', 19);
INSERT INTO ticker VALUES('ACME', '12-Apr-11', 15);
INSERT INTO ticker VALUES('ACME', '13-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '14-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '15-Apr-11', 14);
INSERT INTO ticker VALUES('ACME', '16-Apr-11', 14);
INSERT INTO ticker VALUES('ACME', '17-Apr-11', 14);
INSERT INTO ticker VALUES('ACME', '18-Apr-11', 24);
INSERT INTO ticker VALUES('ACME', '19-Apr-11', 23);
INSERT INTO ticker VALUES('ACME', '20-Apr-11', 22);
INSERT INTO ticker VALUES('GLOBEX', '01-Apr-11', 11);
INSERT INTO ticker VALUES('GLOBEX', '02-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '03-Apr-11', 13);
INSERT INTO ticker VALUES('GLOBEX', '04-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '05-Apr-11', 11);
INSERT INTO ticker VALUES('GLOBEX', '06-Apr-11', 10);
INSERT INTO ticker VALUES('GLOBEX', '07-Apr-11', 9);
INSERT INTO ticker VALUES('GLOBEX', '08-Apr-11', 8);
INSERT INTO ticker VALUES('GLOBEX', '09-Apr-11', 9);
INSERT INTO ticker VALUES('GLOBEX', '10-Apr-11', 9);
INSERT INTO ticker VALUES('GLOBEX', '11-Apr-11', 9);
INSERT INTO ticker VALUES('GLOBEX', '12-Apr-11', 9);
INSERT INTO ticker VALUES('GLOBEX', '13-Apr-11', 10);
INSERT INTO ticker VALUES('GLOBEX', '14-Apr-11', 11);
INSERT INTO ticker VALUES('GLOBEX', '15-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '16-Apr-11', 11);
INSERT INTO ticker VALUES('GLOBEX', '17-Apr-11', 8);
INSERT INTO ticker VALUES('GLOBEX', '18-Apr-11', 7);
INSERT INTO ticker VALUES('GLOBEX', '19-Apr-11', 5);
INSERT INTO ticker VALUES('GLOBEX', '20-Apr-11', 3);
INSERT INTO ticker VALUES('OSCORP', '01-Apr-11', 22);
INSERT INTO ticker VALUES('OSCORP', '02-Apr-11', 22);
INSERT INTO ticker VALUES('OSCORP', '03-Apr-11', 19);
INSERT INTO ticker VALUES('OSCORP', '04-Apr-11', 18);
INSERT INTO ticker VALUES('OSCORP', '05-Apr-11', 17);
INSERT INTO ticker VALUES('OSCORP', '06-Apr-11', 20);
INSERT INTO ticker VALUES('OSCORP', '07-Apr-11', 17);
INSERT INTO ticker VALUES('OSCORP', '08-Apr-11', 20);
INSERT INTO ticker VALUES('OSCORP', '09-Apr-11', 16);
INSERT INTO ticker VALUES('OSCORP', '10-Apr-11', 15);
INSERT INTO ticker VALUES('OSCORP', '11-Apr-11', 15);
INSERT INTO ticker VALUES('OSCORP', '12-Apr-11', 12);
INSERT INTO ticker VALUES('OSCORP', '13-Apr-11', 11);
INSERT INTO ticker VALUES('OSCORP', '14-Apr-11', 15);
INSERT INTO ticker VALUES('OSCORP', '15-Apr-11', 12);
INSERT INTO ticker VALUES('OSCORP', '16-Apr-11', 16);
INSERT INTO ticker VALUES('OSCORP', '17-Apr-11', 14);
INSERT INTO ticker VALUES('OSCORP', '18-Apr-11', 12);
INSERT INTO ticker VALUES('OSCORP', '19-Apr-11', 11);
INSERT INTO ticker VALUES('OSCORP', '20-Apr-11', 9);
commit;
END;
Now let's check to see how many rows are in our dataset
SELECT count(*) FROM ticker;
SELECT symbol, min(tstamp), max(tstamp), count(*) FROM ticker GROUP BY symbol;
You should have 60 rows of data spread across three symbols (ACME, GLOBEX, OSCORP) with 20 rows of data for each ticker symbol. Our ticker data for each symbol starts on April 1 and ends on April 20.
You can view the full data set using the following code:
SELECT * FROM ticker ORDER BY symbol, tstamp;
Compared to previous MATCH_RECOGNIZE tutorials, here there is a small change to the data for ticker symbol ACME. The price on Apr-16 has been changed to 14 so that there are now three consecutive rows (15-Apr, 16-Apr and 17-Apr) with the same value (14) for price.
SELECT * FROM ticker
WHERE symbol = 'ACME'
ORDER BY tstamp;
If we now search for V-shaped patterns and define the down and up variables so that there is potential for ties by using <= and >= in their respective definitions then we can explore this idea of greedy quantifiers by using the following code:
SELECT symbol, tstamp, price, mn, pattern,
first_down, first_price, last_up, last_price
FROM ticker MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES MATCH_NUMBER() AS mn,
CLASSIFIER() as pattern,
FIRST(strt.tstamp) AS first_down,
FIRST(strt.price) as first_price,
LAST(up.tstamp) AS last_up,
LAST(up.price) as last_price
ALL ROWS PER MATCH
PATTERN (strt down+ up+)
DEFINE
down AS (price <= PREV(price)),
up AS (price >= PREV(price))
)
WHERE symbol = 'ACME'
ORDER BY symbol, tstamp;
Note that on April 17 we match the row to the down variable because that variable is being greedy. It could have been matched to the 'up' variable because the price is actually equal to the previous row but 'down' took precedence.
What if we change down and make it reluctant by using the ? quantifier:
SELECT symbol, tstamp, price, mn, pattern,
first_down, first_price, last_up, last_price
FROM ticker MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES MATCH_NUMBER() AS mn,
CLASSIFIER() as pattern,
FIRST(strt.tstamp) AS first_down,
FIRST(strt.price) as first_price,
LAST(up.tstamp) AS last_up,
LAST(up.price) as last_price
ALL ROWS PER MATCH
PATTERN (strt down+? up+)
DEFINE
down AS (price <= PREV(price)),
up AS (price >= PREV(price))
)
WHERE symbol = 'ACME'
ORDER BY symbol, tstamp;
Now you can see that because down is reluctant it matches as few occurrences as possible and where a row could be mapped to either down or up it is up that takes precedence.
Obviously this can impact the way that rows are matched to your pattern. Therefore, you need to think carefully about how you are going to manage the situation where rows could be matched to more than one variable - do you have preference for which variable wins?
If you have measures that are tied to specific variables then these will be impacted by whether the variable is greedy or reluctant. Obvious examples are where you are performing some sort of calculation such as averages, sums or counts.
Therefore, when you are constructing your pattern please think very carefully about how greedy you want your matching process to be as it processes your data set.