1 / 34

Database 12c Row Pattern Matching

Database 12c Row Pattern Matching. Beating the Best Pre-12c Solutions [CON3450]. Stew ASHTON Oracle OpenWorld 2014. Photo Opportunity. Presentation available on http:// www.slideshare.net / stewashton /row-patternmatching12coow14 For exact link: See @ StewAshton on Twitter

lise
Download Presentation

Database 12c Row Pattern Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database 12c Row Pattern Matching Beating the Best Pre-12c Solutions [CON3450] Stew ASHTON Oracle OpenWorld 2014

  2. Photo Opportunity • Presentation available on http://www.slideshare.net/stewashton/row-patternmatching12coow14 • For exact link: • See @StewAshton on Twitter • Or see http://stewashton.wordpress.com

  3. Agenda • Who am I? • Pre-12c solutions compared to row pattern matching with MATCH_RECOGNIZE • For all sizes of data • Thinking in patterns • Watch out for “catastrophic backtracking” • Other things to keep in mind (time permitting)

  4. Who am I? • 33 years in IT • Developer, Technical Sales Engineer, Technical Architect • Aeronautics, IBM, Finance • Mainframe, client-server, Web apps • 25 years as an American in Paris • 9 years using Oracle database • Performance analysis • Replace Java with SQL • 2 years as internal “Oracle Development Expert”

  5. 1) “Fixed Difference” • Identify and group rows with consecutive values • My presentation: print slides to keep • Math: subtract known consecutives • If A-1 = B-2 then A = B-1 • Else A <> B-1 • Consecutive becomes equality,non-consecutive becomes inequality • “Consecutive” = fixed difference of 1

  6. 1) Pre-12c select min(page) firstpage, max(page) lastpage, count(*) cnt FROM ( SELECT page, page – Row_Number() over(order by page) as grp_id FROM t ) GROUP BY grp_id;

  7. Think “match a row pattern” • PATTERN • Uninterrupted series of input rows • Described as a list of conditions (“regular expressions”) PATTERN (A B*) "A" : 1 row, "B" : 0 or more rows, as many as possible • DEFINE each row condition [A undefined = TRUE] B AS page = PREV(page)+1 • Each series that matches the pattern is a “match” • "A" and "B" identify the rows that meet their conditions

  8. Input, Processing, Output SELECT * FROM t MATCH_RECOGNIZE ( ORDER BY page MEASURES A.pagefirstpage, LAST(page) lastpage, COUNT(*) cnt ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW PATTERN (A B*) DEFINE B AS page = PREV(page)+1 ); • Define input • Order input • Process pattern • using defined conditions • Output: rows per match • Output: columns per row • Go where after match? SELECT * FROM t MATCH_RECOGNIZE ( ORDER BY page PATTERN (A B*) DEFINE B AS page = PREV(page)+1 ONE ROW PER MATCH MEASURES A.pagefirstpage, LAST(page) lastpage, COUNT(*) cnt AFTER MATCH SKIP PAST LAST ROW );

  9. 1) Run_Stats comparison For one million rows: “Latches” are serialization devices: fewer means more scalable

  10. 1) Execution Plans

  11. 2) “Start of Group” • Identify group boundaries, often using LAG() • 3 steps instead of 2: • For each row: if start of group, assign 1Else assign 0 • Running total of 1s and 0s produces a group identifier • Group by the group identifier

  12. 2) Requirement Merge contiguous date ranges in same group

  13. with grp_starts as ( select a.*, case when start_ts= lag(end_ts) over( partition by group_name order by start_ts ) then 0 else 1 end grp_start from t a ), grps as ( select b.*, sum(grp_start) over( partition by group_name order by start_ts ) grp_id from grp_startsb) select group_name, min(start_ts) start_ts, max(end_ts) end_ts from grps group by group_name, grp_id;

  14. 2) Match_Recognize SELECT * FROM t MATCH_RECOGNIZE( PARTITION BY group_name ORDER BY start_ts MEASURES A.start_tsstart_ts, end_tsend_ts, next(start_ts) - end_ts gap PATTERN(A B*) DEFINE B AS start_ts = prev(end_ts) ); • New this time: • Added PARTITION BY • MEASURESadded gap using row outside the match! • ONE ROW PER MATCHandSKIP PAST LAST ROWare the defaults One solution replaces two methods: simple!

  15. Which row do we mean?

  16. 2) Run_Stats comparison For 500,000 rows:

  17. 2) Execution Plans

  18. 2) Predicate pushing Select * from <view> where group_name = 'X'

  19. 3) “Bin fitting”: fixed size • Requirement • Order by study_site • Put in “bins” with size = 65,000 max

  20. SELECT s first_site, MAX(e) last_site, MAX(sm) sum_cnt FROM ( SELECT s, e, cnt, sm FROM t MODEL MEASURES (study_site s, study_site e, cnt, cntsm) RULES ( sm[ > 1] = CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()] > 65000 THEN cnt[cv()] ELSE sm[cv() - 1] + cnt[cv()] END, s[ > 1] = CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()] > 65000 THEN s[cv()] ELSE s[cv() - 1] END ) ) GROUP BY s; DIMENSION BY (row_number() over(order by study_site) rn) rn [cv() – 1] [cv()] [cv()] [cv()] [cv() – 1] [cv()] rn [cv() - 1] [cv()] [cv()] [cv()] [cv() – 1] • DIMENSION with row_numberorders data and processing • rn can be used like a subscript • cv() means current row • cv()-1 means previous row

  21. New this time: • PATTERN(A+) replaces (A B*)means 1 or more rows • Why?In previous examples I used PREV(), which returns NULL on the first row. SELECT * FROM t MATCH_RECOGNIZE ( ORDER BY study_site MEASURES FIRST(study_site) first_site, LAST(study_site) last_site, SUM(cnt) sum_cnt PATTERN (A+) DEFINE A AS SUM(cnt) <= 65000 ); One solution replaces 3 methods: simpler!

  22. 3) Run_Stats comparison For one million rows:

  23. 3) Execution Plans

  24. 4) “Bin fitting”: fixed number • Requirement • Distribute values in 3 “bins” as equally as possible • “Best fit decreasing” • Sort values in decreasing order • Put each value in least full bin

  25. 4) Brilliant pre 12c solution SELECT bin, Max (bin_value) bin_value FROM ( SELECT * FROM items MODEL DIMENSION BY (Row_Number() OVER (ORDER BY item_value DESC) rn) MEASURES ( item_name, item_value, Row_Number() OVER (ORDER BY item_value DESC) bin, item_valuebin_value, Row_Number() OVER (ORDER BY item_value DESC) rn_m, 0 min_bin, Count(*) OVER () - 3 - 1 n_iters ) RULES ITERATE(100000) UNTIL (ITERATION_NUMBER >= n_iters[1]) ( min_bin[1] = Min(rn_m) KEEP (DENSE_RANK FIRST ORDER BY bin_value)[rn<= 3], bin[ITERATION_NUMBER + 3 + 1] = min_bin[1], bin_value[min_bin[1]] = bin_value[CV()] + Nvl(item_value[ITERATION_NUMBER+4], 0)) ) WHERE item_name IS NOT NULL group by bin;

  26. SELECT * from items MATCH_RECOGNIZE ( ORDER BY item_valuedesc MEASURES sum(bin1.item_value) bin1, sum(bin2.item_value) bin2, sum(bin3.item_value) bin3 PATTERN ((bin1|bin2|bin3)+) DEFINE bin1 AS count(bin1.*) = 1 OR sum(bin1.item_value)-bin1.item_value <= least( sum(bin2.item_value), sum(bin3.item_value) ), bin2 AS count(bin2.*) = 1 OR sum(bin2.item_value)-bin2.item_value <= sum(bin3.item_value) ); PATTERN ((bin1|bin2|bin3)+) bin1 AS count(bin1.*) = 1 OR sum(bin1.item_value)-bin1.item_value <= least( sum(bin2.item_value), sum(bin3.item_value) ), bin2 AS count(bin2.*) = 1 OR sum(bin2.item_value)-bin2.item_value <= sum(bin3.item_value) • ()+ = 1 or more of whatever is inside • '|' = alternatives, “preferred in the order specified” • Bin1 condition: • No rows here yet, • Or this bin least full • Bin2 condition • No rows here yet, or • This bin less full than 3

  27. 4) Run_Stats comparison For 10,000 rows:

  28. 4) Execution Plans

  29. Backtracking • What happens when there is no match??? • “Greedy” quantifiers - * + {2,} • are not that greedy • Take all the rows they can, BUTgive rows back if necessary – one at a time • Regular expression engines will test all possible combinations to find a match

  30. Repeating conditions select 'match' from ( select level n from dual connect by level <= 100 ) match_recognize( pattern(a b* c) define b as n > prev(n) , c as n = 0 ); Runs in 0.005 secs select 'match' from ( select level n from dual connect by level <= 100 ) match_recognize( pattern(a b* b* b* c) define b as n > prev(n) , c as n = 0 ); Runs in 5.4 secs

  31. Imprecise Conditions SELECT * FROM Ticker MATCH_RECOGNIZE ( PARTITION BY symbol ORDER BY tstamp MEASURES FIRST(tstamp) AS start_tstamp, LAST(tstamp) AS end_tstamp AFTER MATCH SKIP TO LAST UP PATTERN (STRT DOWN+ UP+ DOWN+ UP+) DEFINE DOWN AS price < PREV(price), UP AS price > PREV(price), STRT AS price >= nvl(PREV(PRICE),0) ); Runs in 0.02 seconds CREATE TABLE Ticker ( SYMBOL VARCHAR2(10), tstamp DATE, price NUMBER ); insert into ticker select 'ACME', sysdate + level/24/60/60, 10000-level from dual connect by level <= 5000; SELECT * FROM Ticker MATCH_RECOGNIZE ( PARTITION BY symbol ORDER BY tstamp MEASURES FIRST(tstamp) AS start_tstamp, LAST(tstamp) AS end_tstamp AFTER MATCH SKIP TO LAST UP PATTERN (STRT DOWN+ UP+ DOWN+ UP+) DEFINE DOWN AS price < PREV(price), UP AS price > PREV(price) ); Runs in 24 seconds INMEMORY: 13 seconds

  32. Keep in Mind • Backtracking • Precise conditions • Test data with no matches • To debug:Measures classifier() cl, match_number() mnAll rows per match with unmatched rows • No DISTINCT, no LISTAGG • MEASURES columns must have aliases • “Reluctant quantifier” = ? = JDBC bind variable • “Pattern variables” are range variables, not bind variables

  33. Output Row “shape” ORA-00918, anyone?

  34. Questions? More details at:stewashton.wordpress.com

More Related