buffering in query evaluation over xml streams l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Buffering in Query Evaluation over XML Streams PowerPoint Presentation
Download Presentation
Buffering in Query Evaluation over XML Streams

Loading in 2 Seconds...

play fullscreen
1 / 24

Buffering in Query Evaluation over XML Streams - PowerPoint PPT Presentation


  • 155 Views
  • Uploaded on

Buffering in Query Evaluation over XML Streams. Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center . XML Document. 1: < department > 2: < name > 3: Software Testing 4: </ name > 5: < employee id = 1 > 6: < name >

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Buffering in Query Evaluation over XML Streams' - ranae


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
buffering in query evaluation over xml streams

Buffering in Query Evaluation over XML Streams

Ziv Bar-Yossef

Technion

Marcus Fontoura

Vanja Josifovski

IBM Almaden Research Center

xml document
XML Document

1: <department>

2: <name>

3: Software Testing

4: </name>

5: <employee id= 1>

6: <name>

7: Alice

8: </name>

9: <position>

10: engineer

11: </position >

12: </employee >

13: <employeeid = 2>

14: <name>

15: Bob

16: </name>

17: <position >

18: engineer

19: </position >

20: </ employee >

21: <employeeid = 3>

22: <name>

23: Carole

24: </name>

25: <position >

26: assistant

27: </position >

28: </employee >

29: <manager id = 4>

30: <name>

31: John

32: </name>

33: </manager>

34: </department>

xml document tree

root

department

manager

name

name

@id

employee

4

John

position

@id

name

employee

engineer

1

Alice

@id

employee

position

name

3

assistant

@id

position

name

Carole

2

engineer

Bob

XML Document Tree

Software Testing

xpath queries
XPath Queries

[manager/name = “John”] [position = “engineer”]

/department /employee /name

root

department

manager

name

name

@id

employee

4

John

position

@id

name

employee

engineer

1

Alice

@id

employee

position

name

3

assistant

@id

position

name

Carole

2

engineer

Bob

xpath queries5
XPath Queries

[employee/name = manager/name]

/department /name

root

department

manager

name

name

@id

employee

4

John

position

@id

name

employee

engineer

1

Alice

@id

employee

position

name

3

assistant

@id

position

name

Carole

2

engineer

Bob

xpath
XPath
  • XPath 2.0
  • Forward axes only
  • Eval(Q,D): nodes in D that match Q
  • Two modes of XPath evaluation:
    • Full fledged evaluation: given Q,D, output Eval(Q,D)
    • Filtering: given Q,D, determine whether Eval(Q,D) is nonempty.
xml streams
XML Streams
  • XML stream: sequence of SAX events
    • startDocument(), endDocument(), startElement(name), endElement(name), text(str), …
  • Critical resources
    • Memory
    • Processing time
  • Why XML streams?
    • For transferring XML between systems
    • For efficient access to large XML documents
streaming xml algorithms
Streaming XML Algorithms
  • XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02]
  • X-scan [Ives, Levy, and Weld 00]
  • XMLTK [Avila-Campillo et al 02]
  • XTrie [Chan et al 02]
  • SPEX [Olteanu, Kiesling, and Bry 03]
  • Lazy DFAs [Green et al 03]
  • The XPush Machine [Gupta and Suciu 03]
  • XSQ [Peng and Chawathe 03]
  • FluX [Koch el al 04]
  • TurboXPath [Josifovski, Fontoura, and Barta 05]

All of them use lots of memory

on certain queries & documents

memory bottleneck i storage of large transition tables
Memory Bottleneck I: Storage of Large Transition Tables
  • Framework of most algorithms:
    • Q  NFA
    • Simulate NFA by DFA
  • Caveat: exponential blowup
  • However: exponential blowup is not necessary[Bar-Yossef, Fontoura, Josifovski 04]
    • Algorithm for filtering XML streams whose space is linear in the query size
memory bottleneck ii buffering of document fragments
Memory Bottleneck II:Buffering of Document Fragments
  • Scenario 1: buffering nodes, which may or may not be part of the output.

/department[manager/name = “John”]/employee[position = “engineer”]/name

root

department

manager

name

name

@id

employee

4

John

position

@id

name

employee

engineer

1

Alice

@id

employee

position

name

3

assistant

@id

position

name

Carole

2

engineer

Bob

memory bottleneck ii buffering of document fragments11
Memory Bottleneck II:Buffering of Document Fragments
  • Scenario 2: buffering nodes needed for evaluating pending predicates.

/department[employee/name = manager/name ]/name

root

department

manager

name

name

@id

employee

4

John

position

@id

name

employee

engineer

1

Alice

@id

employee

position

name

3

assistant

@id

position

name

Carole

2

engineer

Bob

memory bottleneck ii buffering of document fragments12
Memory Bottleneck II:Buffering of Document Fragments
  • Scenario 3: buffering multiple candidate matches that are nested within each other.
  • Relevant only when document is “recursive”
  • Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura, Josifovski 04]
our results
Our Results
  • Quantitative space lower bounds for:
    • Full-fledged evaluation of queries with predicates (Scenario 1)
    • Filtering/full-fledged evaluation of queries with “multi-variate” predicates (Scenario 2)
  • Matching upper bound
    • Eager evaluation of predicates
  • In all other scenarios: no buffering required
    • Filtering non-recursive documents using queries with “univariate” predicates is possible without buffering [Bar-Yossef, Fontoura, Josifovski 04]
related work
Related Work
  • Space complexity of XPath evaluation over non-streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03]
  • Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03]
  • Space complexity of select-project-join queries over relational data streams [Arasu et al 02]
document concurrency
Document Concurrency
  • Q: query
  • D = 1,…,n: document
    • Each i is an SAX event
  • t = (1,…,t)
  • Definition: x  D is alive at step t if x  t and  , s.t.
    • x  Eval(Q,t)
    • x  Eval(Q,t)
  • t-concurrency(D,Q): number of distinct nodes that are alive at step t
  • concurrency(D,Q): maxt t-concurrency(D,Q)
lower bound notions
Lower Bound Notions
  • A “normal” lower bound:

For every algorithm A, there exist Q and D s.t. A uses on Q and D (concurrency(D,Q)) bits of space.

    • Q and D may be “pathological”
    • Doesn’t say much about real-world queries/documents
  • An “ideal” lower bound:

For every A, every Q, and every D, A uses on Q and D (concurrency(D,Q)) bits of space.

    • Too good to be true
      • A can have D and Q “hard-coded”, and then know the result a priori
      • Space of A on D and Q = minimum description length of Q and D
our lower bound
Our Lower Bound
  • Theorem: For every A, every Q, and every D, there exists an almost isomorphic document D’, s.t. A uses on Q and D’, (concurrency(D,Q)) bits of space.
    • D’ is the same as D, except for a few extra empty nodes with auxiliary names.
    • Theorem holds only if:
      • Q is “star-free”
      • D is non-recursive
why isn t this obvious
Why isn’t this Obvious?
  • Reason 1: we want the theorem to work for every Q and D, not only ones with high MDL.
  • Reason 2:
    • Obvious: If x is alive at step t  A has to buffer x
      • Because: A may or may not need to output x
    • Not obvious: If x and y are alive at step t  A has to buffer both
      • If x and y are not “independent”, maybe it’s enough to buffer just x (or just y)
proof of lower bound
Proof of Lower Bound
  • C = t-concurrency(D,Q)
  • x1,…,xC = distinct nodes alive at step t
  • Recall: for every xi there exist i and i s.t.
    • xi Eval(Q, ti)
    • xi  Eval(Q, ti)
  • Lemma: there exist a single and a single s.t. for all i,
    • xi Eval(Q, t)
    • xi  Eval(Q, t)
proof of lower bound cont
Proof of Lower Bound (cont.)
  • For every S  { 1,…,C } define document DS:
  • DS is the same as D, except
    • For every i  S, we “mark” xi
    • Marking: an extra empty child with an auxiliary name
  • Note: DS is almost-isomorphic to D
  • tS = first t events in DS
proof of lower bound cont21
Proof of Lower Bound (cont.)
  • A = any algorithm
  • Consider state of A after processing tS:
    • If suffix = , none of the xi’s should be output
      •  A could not have output any xi by step t
    • If suffix = , no information in suffix about S but S can be reconstructed from output
      •  state of A at step t must have all information about S
  • Conclusion: space ≥ (C)
    • Actual proof: by one-way communication complexity
conclusions
Conclusions
  • Our contributions:
    • Quantitative space lower bounds
      • Full-fledged evaluation of queries with predicates
      • Filtering/full-fledged evaluation of queries with “multi-variate” predicates
    • Matching upper bound
  • Open problems:
    • Quantitative lower bounds for XQuery evaluation over streams
    • Address larger fragments of XPath
memory bottleneck ii buffering of document fragments23
Memory Bottleneck II:Buffering of Document Fragments
  • Scenario 3: buffering multiple candidate matches that are nested within each other.

root

a

//a[b and c]

c

a

c

b

a

b

  • Relevant only when document is “recursive”
  • Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura, Josifovski 04]
concurrency example
Concurrency: Example

/department[manager/name = “John”]/employee[position = “engineer”]/name

1: <department>

2: <name>

3: Software Testing

4: </name>

5: <employee id= 1>

6: <name>

7: Alice

8: </name>

9: <position>

10: engineer

11: </position >

12: </employee >

13: <employeeid = 2>

14: <name>

15: Bob

16: </name>

17: <position >

18: engineer

19: </position >

20: </ employee >

21: <employeeid = 3>

22: <name>

23: Carole

24: </name>

25: <position >

26: assistant

27: </position >

28: </employee >

29: <manager id = 4>

30: <name>

31: John

32: </name>

33: </manager>

34: </department>

dead

alive

alive