1 / 13

Pattern Matching in Sequences of Rows March 2, 2007 Change Proposal (for SQL standards)

Pattern Matching in Sequences of Rows March 2, 2007 Change Proposal (for SQL standards). Authors: Fred Zemke (Oracle), Andrew Witkowski (Oracle), Mitch Cherniak (Streambase),Latha Colby (IBM). ISO/IEC JTC1/SC32 WG3:URC-nnn ANSI NCITS H2-2006-nnn. CS240B Notes by: Carlo Zaniolo

jud
Download Presentation

Pattern Matching in Sequences of Rows March 2, 2007 Change Proposal (for SQL standards)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern Matching in Sequences of RowsMarch 2, 2007Change Proposal(for SQL standards) Authors: Fred Zemke (Oracle), Andrew Witkowski (Oracle), Mitch Cherniak (Streambase),Latha Colby (IBM) ISO/IEC JTC1/SC32 WG3:URC-nnn ANSI NCITS H2-2006-nnn CS240B Notes by: Carlo Zaniolo Computer Science Department UCLA

  2. Match_Recognize Inspired by SQL-TS, but more verbose and more options. For instance: * — 0 or more matches + — 1 or more matches ? — 0 or 1 match { n } — exactly n matches { n, m } — between n and m (inclusive) matches Alternation: indicated by a vertical bar ( | ). More ...

  3. Example • Let Ticker (Symbol, Tstamp, Price) be a table with three columns representing historical stock prices. • Symbol is a character column, • Tstamp is a timestamp column (for simplicity shown as increasing integers) and • Price is a numeric column. • We want to partition the data by Symbol, sort it into increasing Tstamp order, and then detect the following pattern in Price: • a falling price, followed by • a rise in price that goes higher than the price was when the fall began. • After finding such patterns, it is desired to report the starting time, starting price, inflection time (last time duringthe decline phase), low price, end time, and end price.

  4. FROM Ticker MATCH_RECOGNIZE ( PARTITION BY Symbol ORDER BY Tstamp MEASURES A.Symbol AS a_symbol, A.Tstamp AS a_tstamp, A.Price AS a_price, MAX (C.Tstamp) AS max_c_tstamp, LAST (C.Price) AS last_c_price MAX (F.Tstamp) AS max_f_tstamp MATCH_NUMBER AS matchno ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW MAXIMAL MATCH PATTERN (A B C* D E* F+) DEFINE /* A defaults to True, matches any row */ B AS (B.price < PREV(B.price)), C AS (C.price <= PREV(C.price)), D AS D.Price > PREV(D.price)), E AS (E.Price >= PREV(E.Price)), F AS (F.Price >= PREV(F.price) AND F.price > A.price)) Example SELECT a_symbol, a_tstamp, /* start time */ a_price,/* start price */ max_c_tstamp, /* inflection time */ last_c_price, /* low price */ max_f_tstamp, /* end time */ last_c_price, /* end price */ Matchno

  5. FROM Ticker MATCH_RECOGNIZE ( PARTITION BY Symbol ORDER BY Tstamp MEASURESA.Symbol AS a_symbol, A.Tstamp AS a_tstamp, A.Price AS a_price, MAX (C.Tstamp) AS max_c_tstamp, LAST (C.Price) AS last_c_price MAX (F.Tstamp) AS max_f_tstamp MATCH_NUMBER AS matchno ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW MAXIMAL MATCH PATTERN (A B C* D E* F+) DEFINE /* A defaults to True, matches any row */ B AS (B.price < PREV(B.price)), C AS (C.price <= PREV(C.price)), D AS D.Price > PREV(D.price)), E AS (E.Price >= PREV(E.Price)), F AS (F.Price >= PREV(F.price) AND F.price > A.price)) SELECT a_symbol, a_tstamp, /* start time */ a_price, /* start price */ max_c_tstamp, /* inflection time */ last_c_price, /* low price */ max_f_tstamp, /* end time */ last_c_price, /* end price */ Matchno Measures: Naming and renaming

  6. FROM Ticker MATCH_RECOGNIZE ( PARTITION BY Symbol ORDER BY Tstamp MEASURES A.Symbol AS a_symbol, A.Tstamp AS a_tstamp, A.Price AS a_price, MAX (C.Tstamp) AS max_c_tstamp, LAST (C.Price) AS last_c_price MAX (F.Tstamp) AS max_f_tstamp MATCH_NUMBER AS matchno ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW MAXIMAL MATCH PATTERN (A B C* D E* F+) DEFINE /* A defaults to True, matches any row */ B AS (B.price < PREV(B.price)), C AS (C.price <= PREV(C.price)), D AS D.Price > PREV(D.price)), E AS (E.Price >= PREV(E.Price)), F AS (F.Price >= PREV(F.price) AND F.price > A.price)) SELECT a_symbol, a_tstamp, /* start time */ a_price, /* start price */ max_c_tstamp, /* inflection time */ last_c_price, /* low price */ max_f_tstamp, /* end time */ last_c_price, /* end price */ Matchno Define the pattern and te conditions which must be satisfied in each state of the pattern No condition on A

  7. FROM Ticker MATCH_RECOGNIZE ( PARTITION BY Symbol ORDER BY Tstamp MEASURES A.Symbol AS a_symbol, A.Tstamp AS a_tstamp, A.Price AS a_price, MAX (C.Tstamp) AS max_c_tstamp, LAST (C.Price) AS last_c_price MAX (F.Tstamp) AS max_f_tstamp MATCH_NUMBER AS matchno ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW MAXIMAL MATCH PATTERN (A B C* D E* F+) DEFINE /* A defaults to True, matches any row */ B AS (B.price < PREV(B.price)), C AS (C.price <= PREV(C.price)), D AS D.Price > PREV(D.price)), E AS (E.Price >= PREV(E.Price)), F AS (F.Price >= PREV(F.price) AND F.price > A.price)) SELECT a_symbol, a_tstamp, /* start time */ a_price, /* start price */ max_c_tstamp, /* inflection time */ last_c_price, /* low price */ max_f_tstamp, /* end time */ last_c_price, /* end price */ Matchno { ONE ROW | ALL ROWS } PER MATCH { MAXIMAL | INCREMENTAL } MATCH AFTER MATCH SKIP { TO NEXT ROW | PAST LAST ROW | TO LAST<variable> | TO FIRST <variable> }

  8. FROM Ticker MATCH_RECOGNIZE ( PARTITION BY Symbol ORDER BY Tstamp MEASURES A.Symbol AS a_symbol, A.Tstamp AS a_tstamp, A.Price AS a_price, MAX (C.Tstamp) AS max_c_tstamp, LAST (C.Price) AS last_c_price MAX (F.Tstamp) AS max_f_tstamp MATCH_NUMBER AS matchno CLASSIFIER AS Classy ALL ROWS PER MATCH AFTER MATCH SKIP PAST LAST ROW MAXIMAL MATCH PATTERN (A B C* D E* F+) DEFINE /* A defaults to True, matches any row */ B AS (B.price < PREV(B.price)), C AS (C.price <= PREV(C.price)), D AS D.Price > PREV(D.price)), E AS (E.Price >= PREV(E.Price)), F AS (F.Price >= PREV(F.price) AND F.price > A.price) ) T All Rows per Match SELECT T.Symbol, /* row’s symbol/ * T.Tstamp, /* row’s time */ T.Price, /* row’s price */ T.classy /* row’s classifier */ T.a_tstamp, /* start time */ T.a_price, /* start price */ T.max_c_tstamp, /*inflection time*/ T.last_c_price, /* low price */ T.max_f_tstamp, /* end time */ end price */ • ALL ROWS PER MATCH :one row for each row in the pattern. • In addition to partitioning, ordering and measure columns we can reference other columns. (via T) • CLASSIFIER component that may be used to declare a character result column whose contents on each row is the variable name that the row matched with.

  9. Syntactic Sugar • Variables can be repeated in the pattern clause • SUBSET: to rename a set of variables • Portion of the pattern can be excluded (when returning all rows) • Special construct to define alternations obtained as permutations of variables

  10. Singletons and group variables FROM Ticker MATCH_RECOGNIZE ( PARTITION BY symbol ORDER BY tstamp MEASURES FIRST(a.time) a_firsttime, LAST(d.time) d_lasttime, AVG(b.price) b_avgprice, AVG(d.price) d_avgprice PATTERN ( A B+ C+ D ) DEFINE A AS A.price > 100, B AS B.price > A.price, C AS C.price < AVG (B.price), D AS D.price > PREV(D.price) ) • If a variable is a singleton, then only individual columns may be referenced, not aggregates. • If the variable is used in an aggregate, then the aggregate is performed over all rows that have matched the variable so far. If desired, we can construe this as providing running aggregates with no special syntax, when a variable is referenced in an aggregate in its own definition, or we can continue to require special syntax to highlight that a running aggregate is meant.

  11. More ALL ROWS PER MATCH—only • CLASSIFIER is used to specify the name of a character string column, called the classifier column. In each row of output, the classifier column is set to the variable name in the PATTERN that the row matched. • MATCH_NUMBERMatches within a partition are numbered sequentially starting with 1 in the order they are chosenin the previous section. The MATCH_NUMBER component is used to specify a column name for an extra column of output from the MATCH_RECOGNIZE construct. The extra column is an exact numeric with scale 0, and provides the MATCH_NUMBER within a partition, starting with 1 for the first match, 2 for the second, etc. • FIRST and LAST special aggregates for group variables

  12. Windows SELECT sum_yprice OVER W, x_time OVER W, AVG(Y.Price) FROM T WINDOW W AS (PARTITION BY .. ORDER BY.. MEASURES SUM(Y.price) AS sum_yprice x.time AS x_time (PATTERN (X Y+ Z)...) )

  13. Conclusions • Specs proposed by 2 DBMS vendors (Oracle & IBM) and 2 DSMS startups (Coral8 and Streambase) • Very powerful: capabilities of SQL-TS plus several new constructs of convenience—particularly in controlling output. • Optimization techniques developed for SQL-TS could also be critical here.

More Related