High-performance Pattern Detection and Discovery for Databases and Data Streams

UCLA Computer Science Department High-performancePattern Detection and Discoveryfor Databases and Data Streams Barzan Mozafari Adviser: Prof. Carlo Zaniolo Committee Members: Prof. Junghoo Cho, Prof. D. Stott Parker, and Prof. Mark Hansen Winter 2011

Big Picture • Query Languages that allow for the expression of complex patterns • Scalable Systems that support such languages and can handle massive, high-arrival data • Efficient, One-pass Algorithms that can mine large amounts of stored or streaming data and extract useful patterns Query Patterns Matches Data Mining Data

Overview • Introduction • Query Languages for Pattern Detection • Kleene-* Constructs in SQL • Nested Words[SIGMOD’10, VLDB’10] • Optimization [Work in progress] • XSeq [Work in progress] • Conclusion

Complex Event Patterns • Sequences in DBs and CEP over data streams • Academic and industrial interest: • SQL-TS [PODS ‘01] • SASE [2006], SASE+ [2008] • SQL Change proposal, 2007 (by Oracle, IBM and Streambase) • Other industrial and academic languages: • Cayuga & CEL • CEDR • Microsoft CEP & LINQ

Our Contribution: K*SQL • A powerful language for: i. Expressing more complex patterns on relational streams and sequences ii. Querying data with more complex structures, e.g, XML and genomic data • A unifying engine for sequence patterns and XML • New optimization techniques • pattern search over nested words • Efficient query execution backend for other languages • XSeq: An XPath-resembling language to bring Kleene-* to XML applications

Regular Expressions in SQL rfid_readings (Time, SensorType, ensorId, ItemId)

Nested Kleene-*: K*SQL

Employees who spend >1 hour in the lab but leave without going to decontamination room Lab Lab Room2 Room12 Room7 Lab Room2 Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN Exit

Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L Lab Room2 Room12 Room7 Lab Room2 Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN ( L ) WHEREL.room = ‘Lab’ Exit

Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L+ L Lab Room2 Room12 Room7 Lab Room2 Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN( L+ ) WHERE L.room = ‘Lab’ Exit

Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L+ L Lab O Room2 O+ O Room12 O Room7 Lab Room2 Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN ( L+ O+ ) WHERE L.room = ‘Lab’ ANDO.room != ‘Decontamination’ Exit

Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L+ L Lab R R Room2 O+ R Room12 R Room7 L+ L Lab R R Room2 O+ R Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN ( (R: L+ O*) ) WHERE L.room = ‘Lab’ AND O.room != ‘Decontamination’ Exit

Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L+ L Lab R R Room2 O+ R Room12 R Room7 L+ R+ L Lab R R Room2 O+ R Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN ( (R: L+ O*)+ ) WHERE L.room = ‘Lab’ AND O.room != ‘Decontamination’ Exit

Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L+ L Lab R R Room2 O+ R Room12 R Room7 L+ R+ L Lab R R Room2 O+ R Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN ( (R: L+ O*)+ X) WHERE L.room = ‘Lab’ AND O.room != ‘Decontamination’ ANDX.room = ‘Exit’ X Exit

Employees who spend >1 hour in the lab but leave without going to decontamination room L Lab L+ L Lab R R Room2 O+ R Room12 R Room7 L+ R+ L Lab R R Room2 O+ R Room7 SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN( (R: L+ O*)+ X) WHERE L.room = ‘Lab’ AND O.room != ‘Decontamination’ AND X.room = ‘Exit’ ANDsum(R.Last(L).timestamp – R.First(L).timestamp) > 3600 X Exit

Strictly More Expressive, through:(i)Nested Kleene-*, (ii) Labels, i.e. Aliases SELECT badgeID FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN( (R: L+ O*)+ X) WHERE L.room = ‘Lab’ AND O.room != ‘Decontamination’ AND X.room = ‘Exit’ AND sum(R.Last(L).timestamp – R.First(L).timestamp) > 3600

Strictly More Expressive, through:(i)Nested Kleene-*, (ii) Labels, i.e. Aliases L Lab L+ L Lab R R Room2 O+ R Room12 R Room7 L+ R+ L Lab R R Room2 O+ R Room7 SELECT badgeID, Last(R).Last(L).timestamp – First(R).First(L).timestamp) FROM rfid PARTITIONBY badgeID ORDERBY timestamp ASPATTERN( (R: L+ O*)+ X) WHERE L.room = ‘Lab’ AND O.room != ‘Decontamination’ AND X.room = ‘Exit’ AND sum(R.Last(L).timestamp – R.First(L).timestamp) > 3600 X Exit

K*SQL Checkpoint • A powerful language with a very efficient implementation based on FSA • Subsumes SQL-MR, SASE+, Cayuga, SQL-TS • Many interesting applications • including queries on semistructured documents Very natural question: Can we handle full XML?

Automata and XML • Word Automata (FSA): only linear structure is explicit, cannot model parenthesis languages • Ordered Tree Automata (OTA): only hierarchical structure is explicit, exponentially less succinct for word queries • Pushdown Automata (PDA): Many problems are undecidable; expensive complexity

Advances in the Automata World Nested Words [Alur’06] • Linear sequence + well-nested edges • Positions labeled with symbols in S a2 a1 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 Positions classified as: • Call positions: both linear and hierarchical successors • Return positions: both linear and hierarchical predecessors • Internal positions: otherwise 20

Nested Word Applications U A U A U A G G G G G C C C C C XML Document Program RNA Sequence <conference> <name> CAV 2006 </name> <location> <city> Seattle </city> <hotel> Sheraton </hotel> </location> <sponsor> MSR </sponsor> <sponsor> Cadence </sponsor> </conference> global int x; bool P() { … x = 3; if Q x = 1 ; … } bool Q () { local int y; … x = y; return (x==0); } Primary structure: Linear sequence of nucleotides (A, C, G, U) Secondary structure: Hydrogen bonds between nucleotides

Odious Comparison NWA is exponentially more succinct than Tree Automata No query language has been proposed for NW

XML Sigmod Record:SAX-3 <!ELEMENTSigmodRecord(issue)* > <!ELEMENTissue (volume,number,articles) > <!ELEMENTvolume (#PCDATA)> <!ELEMENTnumber (#PCDATA)> <!ELEMENTarticles (article)* > <!ELEMENTarticle(title,initPage,endPage,authors) > <!ELEMENTtitle (#PCDATA)> <!ELEMENTinitPage (#PCDATA)> <!ELEMENTendPage (#PCDATA)> <!ELEMENTauthors (author)* > <!ELEMENTauthor (#PCDATA)> <!ATTLISTauthorposition CDATA #IMPLIED>

XPath <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. Find articles of Carlo Zanioloas the 2nd co-author //article[authors/author [@position = "01" and text()="Carlo Zaniolo"] ]/title/text()

K*SQL Question: Can we query nested words in K*SQL? In particular: can we express traditional XML queries • i.e. those often expressed via XPath/XQuery:

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <aut hors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN ( ) WHERE

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt ) WHEREOpArt.value = ‘<article>’

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt ) WHEREOpArt = open(‘article’)

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’)

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.tokenAS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’)

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) ANDisElement(E)

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’)

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDClArt = close(‘article’)

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’)

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) ANDpos.type = ‘attr’ AND pos.value = ’01’ AND pos.token = ‘position’

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) ANDpos = attribute (‘position’, ’01’)

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos Author ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) AND pos = attribute(‘position’, ‘01’) ANDauthor.token = `Carlo Zaniolo’

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos Author ClAu ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) AND pos = attribute(‘position’, ‘01’) AND author.value = `Carlo Zaniolo’ ANDClAu = close(‘author’)

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos Author ClAu E* ) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) AND pos = attribute(‘position’, ‘01’) AND author.value = `Carlo Zaniolo’ ANDClAu = close(‘author’)

Find articles of Carlo Zaniolo as the 2nd co-author <SigmodRecord> <issue> … <article> <title> Implementation of GEM </title> <initPage> 45 </initPage> … <authors> … <author position="01"> Carlo Zaniolo </author> … </authors> </article> …. SELECTTitle.token AS articleName FROM sigmod_record AS PATTERN (OpArt OpTitl Title ClTitl E* OpAuths E* OpAu Pos Author ClAu E* ClAuths ClArt) WHERE OpArt = open(‘article’) ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’) AND isElement(E) ANDOpAuths = open(‘authors’) ANDOpAu = open(‘author’) AND pos = attribute(‘position’, ‘01’) AND author.token = `Carlo Zaniolo’ ANDClAu = close(‘author’) ANDClAuths = close(‘authors’) ANDClArt = close(‘article’)

Sequence Queries over XML: ‘W’-Patterns in Stocks <!ELEMENTStocks(Stock)* > <!ELEMENTStock(symbol, date, price, volume)> <!ELEMENTsymbol (#PCDATA)> <!ELEMENTdate (#PCDATA)> <!ELEMENTprice (#PCDATA)> <!ELEMENTvolume (#PCDATA)>

W-patterns in NASDAQ transactions with volume>1000 <Stock symbol=“YHOO” date=“01-01-2010 23:10:00”> <price> 18.50 </price> <volume> 21 </volume> </Stock> <Stock symbol=“YHOO” date=“01-01-2010 23:16:00”> <price> 18.70 </price> <volume> 11 </volume> </Stock> … SELECT FIRST(Z).FIRST(X).Sym.token FROM Nasdaq PARTITION BY Y.X.Sym.token AS PATTERN (Z: (X: OpSt Sym Date OP Price1 CP OpV Volume ClV ClSt)* (Y: OpSt Sym Date OP Price2 CP OpV Volume ClV ClSt)* )^2 WHERE OpSt = open(‘Stock’) AND ClSt = open(‘Stock’) AND OP = open(‘price’) AND CP = close(‘price’) AND OpV = open(‘volume’) AND ClV = close(‘volume’) AND INT(volume.token) >= 100 AND Z.X.price1.token < Z.PREV(X).price1.token AND Z.Y.price2.token > Z.PREV(Y).price2.token

W-patterns in NASDAQ transactions with volume>1000 Y* Y* X* X* <Stock symbol=“YHOO” date=“01-01-2010 23:10:00”> <price> 18.50 </price> <volume> 21 </volume> </Stock> <Stock symbol=“YHOO” date=“01-01-2010 23:16:00”> <price> 18.70 </price> <volume> 11 </volume> </Stock> … SELECT FIRST(Z).FIRST(X).Sym.token FROM Nasdaq PARTITION BY Y.X.Sym.token AS PATTERN (Z: (X: OpSt Sym Date OP Price1 CP OpV Volume ClV ClSt)* (Y: OpSt Sym Date OP Price2 CP OpV Volume ClV ClSt)* )^2 WHERE OpSt = open(‘Stock’) AND ClSt = open(‘Stock’) AND OP = open(‘price’) AND CP = close(‘price’) AND OpV = open(‘volume’) AND ClV = close(‘volume’) AND INT(volume.token) >= 100 AND Z.X.price1.token < Z.PREV(X).price1.token AND Z.Y.price2.token > Z.PREV(Y).price2.token

Optimization in K*SQL • Compile-Time: • Inferring inter-predicate implications • Query re-writing, e.g. adding more constrainst • Greedy predicate assignment • Run-Time: Avoiding unnecessary backtracks • VPSearch: Extending KMP search algorithm to nested words and visibly pushdown words • Optimizing non-determinisitc queries • i.e. all-match query modes

K*SQL vs. XML Engines

References • [1] Data mining: Staking a claim on your privacy. Information and Privacy Commissioner, Ontario, Jan. 1998. • [2] Directive on privacy protection. European Union, Oct. 1998. • [3] The end of privacy. The Economist, May 1999. • [4] Daniel J. Abadi, Donald Carney, Ugur C etintemel, Mitch Cherniack, Christian Convey, C. Erwin, Eduardo F. Galvez, M. Hatoun, Anurag Maskey, Alex Rasin, A. Singer, Michael Stonebraker, Nesime Tatbul, Ying Xing, R. Yan, and Stanley B. Zdonik. Aurora: A data stream management system. In SIGMOD Conference, page 666, 2003. • [5] Mads Sig Ager, Olivier Danvy, and Henning Korsholm Rohde. Fast partial evaluation of pattern matching in strings. In PEPM, 2003. • [6] Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil Immerman. Ecient pattern matching over event streams. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 147{160, New York, NY, USA, 2008. ACM.

High-performance Pattern Detection and Discovery for Databases and Data Streams

High-performance Pattern Detection and Discovery for Databases and Data Streams

Presentation Transcript

Data and Databases

High Performance Recording and Manipulation of Distributed Streams

Data-Streams and Histograms

Intrusion Prevention and Detection for Web Databases

Data, Databases, and Discovery

Frequent Pattern Mining in Data Streams

Algorithms for pattern matching and pattern discovery in music

High Performance Cyberinfrastructure Discovery Tools for Data Intensive Research

Data Mining and Access Pattern Discovery

Review Spam Detection via Temporal Pattern Discovery

Data Analysis and High Performance Computing

Pattern Discovery and Recognition for Genetic Regulation

Pattern Databases

Basis for Pattern Detection

Data-Streams and Histograms

Data-Streams and Histograms

Review Spam Detection via Temporal Pattern Discovery

High Performance Recording and Manipulation of Distributed Streams