high performance pattern detection and discovery for databases and data streams n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
High-performance Pattern Detection and Discovery for Databases and Data Streams PowerPoint Presentation
Download Presentation
High-performance Pattern Detection and Discovery for Databases and Data Streams

Loading in 2 Seconds...

play fullscreen
1 / 58

High-performance Pattern Detection and Discovery for Databases and Data Streams - PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on

UCLA Computer Science Department. High-performance Pattern Detection and Discovery for Databases and Data Streams. Barzan Mozafari Adviser: Prof. Carlo Zaniolo Committee Members: Prof. Junghoo Cho, Prof. D. Stott Parker, and Prof. Mark Hansen Winter 2011. Big Picture.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'High-performance Pattern Detection and Discovery for Databases and Data Streams' - niran


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
high performance pattern detection and discovery for databases and data streams

UCLA

Computer Science Department

High-performancePattern Detection and Discoveryfor Databases and Data Streams

Barzan Mozafari

Adviser: Prof. Carlo Zaniolo

Committee Members:

Prof. Junghoo Cho,

Prof. D. Stott Parker, and

Prof. Mark Hansen

Winter 2011

big picture
Big Picture
  • Query Languages that allow for the expression of complex patterns
  • Scalable Systems that support such languages and can handle massive, high-arrival data
  • Efficient, One-pass Algorithms that can mine large amounts of stored or streaming data and extract useful patterns

Query

Patterns

Matches

Data Mining

Data

overview
Overview
  • Introduction
  • Query Languages for Pattern Detection
    • Kleene-* Constructs in SQL
    • Nested Words[SIGMOD’10, VLDB’10]
    • Optimization [Work in progress]
    • XSeq [Work in progress]
  • Conclusion
complex event patterns
Complex Event Patterns
  • Sequences in DBs and CEP over data streams
  • Academic and industrial interest:
    • SQL-TS [PODS ‘01]
    • SASE [2006], SASE+ [2008]
    • SQL Change proposal, 2007 (by Oracle, IBM and Streambase)
    • Other industrial and academic languages:
      • Cayuga & CEL
      • CEDR
      • Microsoft CEP & LINQ
our contribution k sql
Our Contribution: K*SQL
  • A powerful language for:

i. Expressing more complex patterns on relational streams and sequences

ii. Querying data with more complex structures, e.g, XML and genomic data

  • A unifying engine for sequence patterns and XML
  • New optimization techniques
    • pattern search over nested words
  • Efficient query execution backend for other languages
  • XSeq: An XPath-resembling language to bring Kleene-* to XML applications
regular expressions in sql
Regular Expressions in SQL

rfid_readings (Time, SensorType, ensorId, ItemId)

employees who spend 1 hour in the lab but leave without going to decontamination room
Employees who spend >1 hour in the lab but leave without going to decontamination room

Lab

Lab

Room2

Room12

Room7

Lab

Room2

Room7

SELECT badgeID

FROM rfid

PARTITIONBY badgeID

ORDERBY timestamp

ASPATTERN

Exit

employees who spend 1 hour in the lab but leave without going to decontamination room1
Employees who spend >1 hour in the lab but leave without going to decontamination room

L

Lab

L

Lab

Room2

Room12

Room7

Lab

Room2

Room7

SELECT badgeID

FROM rfid

PARTITIONBY badgeID

ORDERBY timestamp

ASPATTERN ( L )

WHEREL.room = ‘Lab’

Exit

employees who spend 1 hour in the lab but leave without going to decontamination room2
Employees who spend >1 hour in the lab but leave without going to decontamination room

L

Lab

L+

L

Lab

Room2

Room12

Room7

Lab

Room2

Room7

SELECT badgeID

FROM rfid

PARTITIONBY badgeID

ORDERBY timestamp

ASPATTERN( L+ )

WHERE L.room = ‘Lab’

Exit

employees who spend 1 hour in the lab but leave without going to decontamination room3
Employees who spend >1 hour in the lab but leave without going to decontamination room

L

Lab

L+

L

Lab

O

Room2

O+

O

Room12

O

Room7

Lab

Room2

Room7

SELECT badgeID

FROM rfid

PARTITIONBY badgeID

ORDERBY timestamp

ASPATTERN ( L+ O+ )

WHERE L.room = ‘Lab’

ANDO.room != ‘Decontamination’

Exit

employees who spend 1 hour in the lab but leave without going to decontamination room4
Employees who spend >1 hour in the lab but leave without going to decontamination room

L

Lab

L+

L

Lab

R

R

Room2

O+

R

Room12

R

Room7

L+

L

Lab

R

R

Room2

O+

R

Room7

SELECT badgeID

FROM rfid

PARTITIONBY badgeID

ORDERBY timestamp

ASPATTERN ( (R: L+ O*) )

WHERE L.room = ‘Lab’

AND O.room != ‘Decontamination’

Exit

employees who spend 1 hour in the lab but leave without going to decontamination room5
Employees who spend >1 hour in the lab but leave without going to decontamination room

L

Lab

L+

L

Lab

R

R

Room2

O+

R

Room12

R

Room7

L+

R+

L

Lab

R

R

Room2

O+

R

Room7

SELECT badgeID

FROM rfid

PARTITIONBY badgeID

ORDERBY timestamp

ASPATTERN ( (R: L+ O*)+ )

WHERE L.room = ‘Lab’

AND O.room != ‘Decontamination’

Exit

employees who spend 1 hour in the lab but leave without going to decontamination room6
Employees who spend >1 hour in the lab but leave without going to decontamination room

L

Lab

L+

L

Lab

R

R

Room2

O+

R

Room12

R

Room7

L+

R+

L

Lab

R

R

Room2

O+

R

Room7

SELECT badgeID

FROM rfid

PARTITIONBY badgeID

ORDERBY timestamp

ASPATTERN ( (R: L+ O*)+ X)

WHERE L.room = ‘Lab’

AND O.room != ‘Decontamination’

ANDX.room = ‘Exit’

X

Exit

employees who spend 1 hour in the lab but leave without going to decontamination room7
Employees who spend >1 hour in the lab but leave without going to decontamination room

L

Lab

L+

L

Lab

R

R

Room2

O+

R

Room12

R

Room7

L+

R+

L

Lab

R

R

Room2

O+

R

Room7

SELECT badgeID

FROM rfid

PARTITIONBY badgeID

ORDERBY timestamp

ASPATTERN( (R: L+ O*)+ X)

WHERE L.room = ‘Lab’

AND O.room != ‘Decontamination’

AND X.room = ‘Exit’

ANDsum(R.Last(L).timestamp –

R.First(L).timestamp)

> 3600

X

Exit

strictly more expressive through i nested kleene ii labels i e aliases
Strictly More Expressive, through:(i)Nested Kleene-*, (ii) Labels, i.e. Aliases

SELECT badgeID

FROM rfid

PARTITIONBY badgeID

ORDERBY timestamp

ASPATTERN( (R: L+ O*)+ X)

WHERE L.room = ‘Lab’

AND O.room != ‘Decontamination’

AND X.room = ‘Exit’

AND sum(R.Last(L).timestamp –

R.First(L).timestamp)

> 3600

slide17

Strictly More Expressive, through:(i)Nested Kleene-*, (ii) Labels, i.e. Aliases

L

Lab

L+

L

Lab

R

R

Room2

O+

R

Room12

R

Room7

L+

R+

L

Lab

R

R

Room2

O+

R

Room7

SELECT badgeID,

Last(R).Last(L).timestamp

– First(R).First(L).timestamp)

FROM rfid

PARTITIONBY badgeID

ORDERBY timestamp

ASPATTERN( (R: L+ O*)+ X)

WHERE L.room = ‘Lab’

AND O.room != ‘Decontamination’

AND X.room = ‘Exit’

AND sum(R.Last(L).timestamp –

R.First(L).timestamp)

> 3600

X

Exit

k sql checkpoint
K*SQL Checkpoint
  • A powerful language with a very efficient implementation based on FSA
  • Subsumes SQL-MR, SASE+, Cayuga, SQL-TS
  • Many interesting applications
    • including queries on semistructured documents

Very natural question:

Can we handle full XML?

automata and xml
Automata and XML
  • Word Automata (FSA): only linear structure is explicit, cannot model parenthesis languages
  • Ordered Tree Automata (OTA): only hierarchical structure is explicit, exponentially less succinct for word queries
  • Pushdown Automata (PDA): Many problems are undecidable; expensive complexity
advances in the automata world
Advances in the Automata World

Nested Words [Alur’06]

  • Linear sequence + well-nested edges
  • Positions labeled with symbols in S

a2

a1

a3

a4

a5

a6

a7

a8

a9

a10

a11

a12

Positions classified as:

  • Call positions: both linear and hierarchical successors
  • Return positions: both linear and hierarchical predecessors
  • Internal positions: otherwise

20

nested word applications
Nested Word Applications

U

A

U

A

U

A

G

G

G

G

G

C

C

C

C

C

XML Document

Program

RNA Sequence

<conference>

<name>

CAV 2006

</name>

<location>

<city>

Seattle

</city>

<hotel>

Sheraton

</hotel>

</location>

<sponsor>

MSR

</sponsor>

<sponsor>

Cadence

</sponsor>

</conference>

global int x;

bool P() {

x = 3;

if Q x = 1 ;

}

bool Q () {

local int y;

x = y;

return (x==0);

}

Primary structure: Linear sequence of nucleotides (A, C, G, U)

Secondary structure: Hydrogen bonds between nucleotides

odious comparison
Odious Comparison

NWA is exponentially more succinct than Tree Automata

No query language has been proposed for NW

xml sigmod record sax 3
XML Sigmod Record:SAX-3

<!ELEMENTSigmodRecord(issue)* >

<!ELEMENTissue (volume,number,articles) >

<!ELEMENTvolume (#PCDATA)>

<!ELEMENTnumber (#PCDATA)>

<!ELEMENTarticles (article)* >

<!ELEMENTarticle(title,initPage,endPage,authors) >

<!ELEMENTtitle (#PCDATA)>

<!ELEMENTinitPage (#PCDATA)>

<!ELEMENTendPage (#PCDATA)>

<!ELEMENTauthors (author)* >

<!ELEMENTauthor (#PCDATA)>

<!ATTLISTauthorposition CDATA #IMPLIED>

xpath
XPath

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

Find articles of Carlo Zanioloas the 2nd co-author

//article[authors/author [@position = "01" and text()="Carlo Zaniolo"]

]/title/text()

k sql
K*SQL

Question: Can we query nested words in K*SQL?

In particular:

can we express traditional XML queries

  • i.e. those often expressed via XPath/XQuery:
find articles of carlo zaniolo as the 2 nd co author
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<aut hors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(

)

WHERE

find articles of carlo zaniolo as the 2 nd co author1
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt

)

WHEREOpArt.value = ‘<article>’

find articles of carlo zaniolo as the 2 nd co author2
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt

)

WHEREOpArt = open(‘article’)

find articles of carlo zaniolo as the 2 nd co author3
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl

)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’)

find articles of carlo zaniolo as the 2 nd co author4
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.tokenAS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title

)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’)

find articles of carlo zaniolo as the 2 nd co author5
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title ClTitl

)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

find articles of carlo zaniolo as the 2 nd co author6
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title ClTitl E*

)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

ANDisElement(E)

find articles of carlo zaniolo as the 2 nd co author7
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title ClTitl E*

OpAuths

)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

AND isElement(E)

ANDOpAuths = open(‘authors’)

find articles of carlo zaniolo as the 2 nd co author8
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title ClTitl E*

OpAuths E*

)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

AND isElement(E)

ANDOpAuths = open(‘authors’)

ANDClArt = close(‘article’)

find articles of carlo zaniolo as the 2 nd co author9
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title ClTitl E*

OpAuths E* OpAu

)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

AND isElement(E)

ANDOpAuths = open(‘authors’)

ANDOpAu = open(‘author’)

find articles of carlo zaniolo as the 2 nd co author10
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title ClTitl E*

OpAuths E* OpAu Pos

)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

AND isElement(E)

ANDOpAuths = open(‘authors’)

ANDOpAu = open(‘author’)

ANDpos.type = ‘attr’ AND pos.value = ’01’

AND pos.token = ‘position’

find articles of carlo zaniolo as the 2 nd co author11
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title ClTitl E*

OpAuths E* OpAu Pos

)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

AND isElement(E)

ANDOpAuths = open(‘authors’)

ANDOpAu = open(‘author’)

ANDpos = attribute (‘position’, ’01’)

find articles of carlo zaniolo as the 2 nd co author12
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title ClTitl E*

OpAuths E* OpAu Pos Author

)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

AND isElement(E)

ANDOpAuths = open(‘authors’)

ANDOpAu = open(‘author’)

AND pos = attribute(‘position’, ‘01’)

ANDauthor.token = `Carlo Zaniolo’

find articles of carlo zaniolo as the 2 nd co author13
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title ClTitl E*

OpAuths E* OpAu Pos Author ClAu

)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

AND isElement(E)

ANDOpAuths = open(‘authors’)

ANDOpAu = open(‘author’)

AND pos = attribute(‘position’, ‘01’)

AND author.value = `Carlo Zaniolo’

ANDClAu = close(‘author’)

find articles of carlo zaniolo as the 2 nd co author14
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title ClTitl E*

OpAuths E* OpAu Pos Author ClAu E*

)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

AND isElement(E)

ANDOpAuths = open(‘authors’)

ANDOpAu = open(‘author’)

AND pos = attribute(‘position’, ‘01’)

AND author.value = `Carlo Zaniolo’

ANDClAu = close(‘author’)

find articles of carlo zaniolo as the 2 nd co author15
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title ClTitl E*

OpAuths E* OpAu Pos Author ClAu E*

ClAuths ClArt)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

AND isElement(E)

ANDOpAuths = open(‘authors’)

ANDOpAu = open(‘author’)

AND pos = attribute(‘position’, ‘01’)

AND author.token = `Carlo Zaniolo’

ANDClAu = close(‘author’)

ANDClAuths = close(‘authors’)

ANDClArt = close(‘article’)

find articles of carlo zaniolo as the 2 nd co author16
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title ClTitl E*

OpAuths E* OpAu Pos Author ClAu E*

ClAuths ClArt)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

AND isElement(E)

ANDOpAuths = open(‘authors’)

ANDOpAu = open(‘author’)

AND pos = attribute(‘position’, ‘01’)

AND author.token = `Carlo Zaniolo’

ANDClAu = close(‘author’)

ANDClAuths = close(‘authors’)

ANDClArt = close(‘article’)

find articles of carlo zaniolo as the 2 nd co author17
Find articles of Carlo Zaniolo as the 2nd co-author

<SigmodRecord>

<issue>

<article>

<title>

Implementation of GEM

</title>

<initPage>

45

</initPage>

<authors>

<author

position="01">

Carlo Zaniolo

</author>

</authors>

</article>

….

SELECTTitle.token AS articleName

FROM sigmod_record

AS PATTERN

(OpArt OpTitl Title ClTitl E*

OpAuths E* OpAu Pos Author ClAu E*

ClAuths ClArt)

WHERE OpArt = open(‘article’)

ANDOpTitl = open(‘title’) ANDClTitl = close(‘title’)

AND isElement(E)

ANDOpAuths = open(‘authors’)

ANDOpAu = open(‘author’)

AND pos = attribute(‘position’, ‘01’)

AND author.token = `Carlo Zaniolo’

ANDClAu = close(‘author’)

ANDClAuths = close(‘authors’)

ANDClArt = close(‘article’)

sequence queries over xml w patterns in stocks
Sequence Queries over XML: ‘W’-Patterns in Stocks

<!ELEMENTStocks(Stock)* >

<!ELEMENTStock(symbol, date, price, volume)>

<!ELEMENTsymbol (#PCDATA)>

<!ELEMENTdate (#PCDATA)>

<!ELEMENTprice (#PCDATA)>

<!ELEMENTvolume (#PCDATA)>

w patterns in nasdaq transactions with volume 1000
W-patterns in NASDAQ transactions with volume>1000

<Stock symbol=“YHOO” date=“01-01-2010 23:10:00”>

<price> 18.50 </price>

<volume> 21 </volume>

</Stock>

<Stock symbol=“YHOO” date=“01-01-2010 23:16:00”>

<price> 18.70 </price>

<volume> 11 </volume>

</Stock>

SELECT FIRST(Z).FIRST(X).Sym.token

FROM Nasdaq PARTITION BY Y.X.Sym.token

AS PATTERN

(Z: (X: OpSt Sym Date OP Price1 CP

OpV Volume ClV ClSt)*

(Y: OpSt Sym Date OP Price2 CP

OpV Volume ClV ClSt)*

)^2

WHERE

OpSt = open(‘Stock’) AND ClSt = open(‘Stock’)

AND OP = open(‘price’) AND CP = close(‘price’)

AND OpV = open(‘volume’) AND ClV = close(‘volume’)

AND INT(volume.token) >= 100

AND Z.X.price1.token < Z.PREV(X).price1.token

AND Z.Y.price2.token > Z.PREV(Y).price2.token

w patterns in nasdaq transactions with volume 10001
W-patterns in NASDAQ transactions with volume>1000

<Stock symbol=“YHOO” date=“01-01-2010 23:10:00”>

<price> 18.50 </price>

<volume> 21 </volume>

</Stock>

<Stock symbol=“YHOO” date=“01-01-2010 23:16:00”>

<price> 18.70 </price>

<volume> 11 </volume>

</Stock>

SELECT FIRST(Z).FIRST(X).Sym.token

FROM Nasdaq PARTITION BY Y.X.Sym.token

AS PATTERN

(Z: (X: OpSt Sym Date OP Price1 CP

OpV Volume ClV ClSt)*

(Y: OpSt Sym Date OP Price2 CP

OpV Volume ClV ClSt)*

)^2

WHERE

OpSt = open(‘Stock’) AND ClSt = open(‘Stock’)

AND OP = open(‘price’) AND CP = close(‘price’)

AND OpV = open(‘volume’) AND ClV = close(‘volume’)

AND INT(volume.token) >= 100

AND Z.X.price1.token < Z.PREV(X).price1.token

AND Z.Y.price2.token > Z.PREV(Y).price2.token

w patterns in nasdaq transactions with volume 10002
W-patterns in NASDAQ transactions with volume>1000

Y*

Y*

X*

X*

<Stock symbol=“YHOO” date=“01-01-2010 23:10:00”>

<price> 18.50 </price>

<volume> 21 </volume>

</Stock>

<Stock symbol=“YHOO” date=“01-01-2010 23:16:00”>

<price> 18.70 </price>

<volume> 11 </volume>

</Stock>

SELECT FIRST(Z).FIRST(X).Sym.token

FROM Nasdaq PARTITION BY Y.X.Sym.token

AS PATTERN

(Z: (X: OpSt Sym Date OP Price1 CP

OpV Volume ClV ClSt)*

(Y: OpSt Sym Date OP Price2 CP

OpV Volume ClV ClSt)*

)^2

WHERE

OpSt = open(‘Stock’) AND ClSt = open(‘Stock’)

AND OP = open(‘price’) AND CP = close(‘price’)

AND OpV = open(‘volume’) AND ClV = close(‘volume’)

AND INT(volume.token) >= 100

AND Z.X.price1.token < Z.PREV(X).price1.token

AND Z.Y.price2.token > Z.PREV(Y).price2.token

optimization in k sql
Optimization in K*SQL
  • Compile-Time:
    • Inferring inter-predicate implications
    • Query re-writing, e.g. adding more constrainst
    • Greedy predicate assignment
  • Run-Time: Avoiding unnecessary backtracks
    • VPSearch: Extending KMP search algorithm to nested words and visibly pushdown words
    • Optimizing non-determinisitc queries
      • i.e. all-match query modes
references
References
  • [1] Data mining: Staking a claim on your privacy. Information and Privacy Commissioner, Ontario, Jan. 1998.
  • [2] Directive on privacy protection. European Union, Oct. 1998.
  • [3] The end of privacy. The Economist, May 1999.
  • [4] Daniel J. Abadi, Donald Carney, Ugur C etintemel, Mitch Cherniack, Christian Convey, C. Erwin, Eduardo F. Galvez, M. Hatoun, Anurag Maskey, Alex Rasin, A. Singer, Michael Stonebraker, Nesime Tatbul, Ying Xing, R. Yan, and Stanley B. Zdonik. Aurora: A data stream management system. In SIGMOD Conference, page 666, 2003.
  • [5] Mads Sig Ager, Olivier Danvy, and Henning Korsholm Rohde. Fast partial evaluation of pattern matching in strings. In PEPM, 2003.
  • [6] Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil Immerman. Ecient pattern matching over event streams. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 147{160, New York, NY, USA, 2008. ACM.
references1
References
  • [7] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large
  • databases. In VLDB, 1994.
  • [8] Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. In SIG-
  • MOD, 2000.
  • [9] Shipra Agrawal, Vijay Krishnan, and Jayant R. Haritsa. On addressing eciency
  • concerns in privacy-preserving mining. In DASFAA, 2004.
  • [10] Rajeev Alur. Marrying words and trees. In PODS, 2007.
references2
References
  • [11] Rajeev Alur, Marcelo Arenas, Pablo Barcelo, Kousha Etessami, Neil Immerman, and1 Leonid Libkin. First-order and temporal logics for nestedwords. In LICS, 2007.
  • [12] Rajeev Alur, Swarat Chaudhuri, and P. Madhusudan. Languages of nested trees. In CAV, 2006.
  • [13] Rajeev Alur and P. Madhusudan. Visibly pushdown languages. In STOC, pages 202{ 211, 2004.
  • [14] Rajeev Alur and P. Madhusudan. Adding nesting structure to words. In Developments in Language Theory, pages 1{13, 2006.
  • [15] Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Keith Ito, Itaru Nishizawa, Justin Rosenstein, and Jennifer Widom. Stream: The stanford stream data manager. In SIGMOD, 2003.
  • [16] Brian Babcock, Mayur Datar, and Rajeev Motwani. Load shedding for aggregation queries over data streams. In ICDE '04: Proceedings of the 20th International Conference on Data Engineering, page 350, Washington, DC, USA, 2004. IEEE Computer Society.
references3
References
  • [17] RICARDO A. BAEZA-YATES and GASTON H. GONNET. Fast text searching for regular expressions or automaton searching on tries. 1996.
  • [18] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, and Carlo Zaniolo. A data stream language and system designed for power and extensibility. In CIKM, pages 337{346, 2006.
  • [19] Gerard Berry and Ravi Sethi. From regular expressions to deterministic automata.
  • [20] Philip Bille and Martin Farach-Colton. Fast and compact regular expression matching. 2008.
  • [21] Ronnie Chaiken, Bob Jenkins, Paul Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. Scope: Easy and ecient parallel processing of massive data sets. VLDB, 29(2):282{318, 2008.
  • [22] Ronnie Chaiken, Bob Jenkins, Per-Ake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. Scope: easy and ecient parallel processing of massive data sets. PVLDB, 1(2):1265{1276, 2008.
  • [23] Hei Chan and Adnan Darwiche. Sensitivity analysis in Bayesian networks: From single to multiple parameters. In 20'th Conference on Uncertainty in Articial Intelligence (UAI), 2004.
references4
References
  • [24] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining closed frequent itemsets over a stream sliding window. In Proceedings of the 2004 IEEE International Conference on Data Mining (ICDM'04), November 2004.
  • [25] Yun Chi, Philip S. Yu, Haixun Wang, and Richard R. Muntz. Loadstar: A load shedding scheme for classifying data streams. In SDM, 2005.
  • [26] Alexandre Evmievski, Johannes Gehrke, and Ramakrishnan Srikant. Limiting privacy breaches in privacy preserving data mining. In PODS, 2003.
  • [27] Sudipto Guha, Dimitrios Gunopulos, and Nick Koudas. Correlating synchronous and asynchronous data streams. In KDD, pages 529{534, 2003.
  • [28] Daniel Gyllstrom, Eugene Wu 0002, Hee-Jin Chae, Yanlei Diao, Patrick Stahlberg, and Gordon Anderson. Sase: Complex event processing over streams. CoRR, abs/cs/0612128, 2006.
  • [29] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000.
  • [30] HARUO HOSOYA, JERO ME VOUILLON, and BENJAMIN C. PIERCE. Regular expression types for xml. ACM Transactions on Programming Languages and Systems, 27(1):46{90, January 2005.
  • [31] Jeong-Hyon Hwang, Sanghoon Cha, Ugur C etintemel, and Stanley B. Zdonik. Borealisr: a replication-transparent stream processing system for wide-area monitoring applications. In SIGMOD Conference, pages 1303{1306, 2008.
references5
References
  • [32] E. Keogh and M. Pazzani. Learning augmented bayesian classiers: A comparison of distribution-based and classication-based approaches. In 7th. Int'l Workshop on AI and Statistics, 1999.
  • [33] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323{350, 1977.
  • [34] S. Rao Kosaraju. Ecient tree pattern matching. 1989.
  • [35] Yan-Nei Law, Haixun Wang, and Carlo Zaniolo. Query languages and data models for database sequences and data streams. In VLDB, 2004.
  • [36] Yan-Nei Law and Carlo Zaniolo. Improving the accuracy of continuous aggregates and mining queries on data streams under load shedding.
  • [37] Yan&#45;Nei Law and Carlo Zaniolo. Improving the accuracy of continuous aggregates and mining queries on data streams under load shedding. Int. J. Bus. Intell. Data Min., 3(1):99{117, 2008.
  • [38] JaeGil Lee, Jiawei Han, Xiaolei Li, and Hector Gonzalez. Traclass: Trajectory classification using hierarchical regionbased and trajectorybased clustering. VLDB, 29(2):282{ 318, 2008.
  • [39] C.K.-S. Leung, Q.I. Khan, and T. Hoque. Cantree: A tree structure for efficient incremental mining of frequent patterns. In ICDM, 2005.
references6
References
  • [40] Feifei Li, Jimeng Sun, Spiros Papadimitriou, George A. Mihaila, and Ioana Stanoi. Hiding in the crowd: Privacy preservation on evolving streams through correlation tracking. In ICDE, pages 686{695, 2007.
  • [41] Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo. Verifying and mining frequent patterns from large windows over data streams. In the 24th International Conference on Data Engineering (ICDE), 2008.
  • [42] Barzan Mozafari and Carlo Zaniolo. Publishing naive bayesian classiers: Privacy without accuracy loss. In the 35th International Conference on Very Large Data Bases (VLDB), 2009.
  • [43] Barzan Mozafari and Carlo Zaniolo. A scalable algorithm for optimal load shedding with aggregates and mining queries. In Under review process, 2009.
  • [44] Gonzalo Navarro and Mathieu Rafnot. Fast regular expression search. WAE, pages 198{212, 1999.
  • [45] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data processing. In Jason Tsong-Li Wang, editor, SIGMOD Conference, pages 1099{1110. ACM, 2008.
  • [46] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Jianyong Wang, Helen Pinto, Qiming Chen, Umeshwar Dayal, and M. Hsu. Mining sequential patterns by pattern-growth:The PrefixSpan approach. IEEE TKDE, 16(11):1424{1440, November 2004.
references7
References
  • [47] Vibhor Rastogi, Sungho Hong, and Dan Suciu. The boundary between privacy and utility in data publishing. In VLDB, 2007. 34
  • [48] Reza Sadri, Carlo Zaniolo, Amir M. Zarkesh, and Jafar Adibi. A sequential pattern query language for supporting instant data mining for e-services. In VLDB, pages 653{656, 2001.
  • [49] Reza Sadri, Carlo Zaniolo, Amir M. Zarkesh, and Jafar Adibi. Expressing and optimizing sequence queries in database systems. ACM Trans. Database Syst., 29(2):282{318, 2004.
  • [50] Nesime Tatbul, Ugur C etintemel, Stanley B. Zdonik, Mitch Cherniack, and Michael Stonebraker. Load shedding in a data stream manager. In VLDB, pages 309{320, 2003.
  • [51] Nesime Tatbul and Stanley B. Zdonik. Window-aware load shedding for aggregation queries over data streams. In VLDB, pages 799{810, 2006.
references8
References
  • [52] Hetal Thakkar, Barzan Mozafari, and Carlo Zaniolo. A data stream mining system. In ICDM, pages 79{88, 2008.
  • [53] Hetal Thakkar, Barzan Mozafari, and Carlo Zaniolo. Designing an inductive data stream management system: the stream mill experience. In SSPS in conjunction with EDBT, pages 79{88, 2008.
  • [54] Yi-Cheng Tu, Song Liu, Sunil Prabhakar, and Bin Yao. Load shedding in stream databases: A control-based approach. In VLDB, pages 787{798, 2006.
  • [55] Haixun Wang, Carlo Zaniolo, and Chang Luo. Atlas: A small but complete sql extension for data mining and data streams. In VLDB, pages 1113{1116, 2003.
  • [56] J. T. Yao and M. Zhang. A fast tree pattern matching algorithm for xml query.
  • [57] Fred Zemke, Andrew Witkowski, Mitch Cherniak, and Latha Colby. Pattern matching in sequences of rows. In [sql change proposal, march 2007], http://asktom.oracle.com/tkyte/row-patternrecogniton-11-public.pdf