approximate xml query answers n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Approximate XML Query Answers PowerPoint Presentation
Download Presentation
Approximate XML Query Answers

Loading in 2 Seconds...

play fullscreen
1 / 32

Approximate XML Query Answers - PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on

Approximate XML Query Answers. Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas). XML. XML Data. Motivation. XML: de-facto standard for data exchange Development of the “ XML Warehouse”

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Approximate XML Query Answers' - alvis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
approximate xml query answers

Approximate XML Query Answers

Alkis Polyzotis (UC Santa Cruz)

Minos Garofalakis (Bell Labs)

Yannis Ioannidis (U. of Athens, Hellas)

motivation

XML

XML

Data

Motivation
  • XML: de-facto standard for data exchange
  • Development of the “XML Warehouse”
  • Conflict between “on-line” and query execution cost
    • Increased query response times
    • Users might wait for un-interesting results

Q

Warehouse

R

approximate query answers

Synopsis

XML

XML

XML

Data

Approximate Query Answers
  • Evaluate query over a concise data synopsis and obtain an approximation R’ of the true result
  • Use approximate result as timely feedback
    • User can assess the “value” of the query
  • Goal: reduce number of evaluated queries

R’

Q

Warehouse

R

contributions
Contributions
  • TreeSketch Synopses
    • Structural summaries for XML data
    • Approximate answers for complex twig queries
    • Summarization model  Structural clustering of elements
    • Efficient processing and construction
  • Element Simulation Distance
    • Novel distance metric for XML data
    • Captures “approximate” similarity between two XML trees
  • Experimental Results
    • Accurate approximate answers for low space budgets
    • Low-error selectivity estimates
    • Efficient construction algorithm
outline
Outline
  • Preliminaries
  • TreeSketches
    • Synopsis model
    • Computing approximate answers
    • Summary construction
  • Element Simulation Distance
  • Experimental Study
  • Conclusions
data and query model

Twig Query

r

q0

//section

p1

q1

./figure

.//equation

s2

s3

q2

q3

f6

f7

f5

f4

Nesting Tree

Binding Tuples

r

e10

c12

c13

e8

c9

c11

q0

q1

q2

q3

s2

r

s2

f4

e8

r

s2

f4

e10

e11

e13

f5

f7

r

s2

f5

e8

r

s2

f5

e10

Data and Query Model

XML Document

problem definition

r

q0

s

//section

q1

e

e

f

./figure

.//equation

q2

q3

Synopsis

r

s2

e11

e13

f5

f7

XML

Data

Problem Definition
  • Process twig query over a synopsis
  • Compute approximation of nesting tree

Approximate

Nesting Tree

True

Nesting Tree

graph synopsis

r

R(1)

p1

P(1)

s2

s3

S(2)

F(2)

F(2)

f6

f7

f5

f4

e10

c12

c13

e8

c9

c11

E(2)

C(4)

Graph Synopsis
  • Synopsis node  Set of elements of the same tag
  • Synopsis edge  Document edge(s)

XML Document

Graph Synopsis

treesketch synopsis

r

R(1)

1

p1

P(1)

2

s2

s3

S(2)

1

1

F(2)

F(2)

f6

f7

f5

f4

1

1

1

e10

c12

c13

e8

c9

c11

E(2)

C(4)

TreeSketch Synopsis
  • Augment graph-synopsis with edge counts
  • count[u,v]: mean #children in v per element in u

XML Document

TreeSketch

treesketch synopsis1

r

R(1)

1

p1

P(1)

2

s2

s3

S(2)

1

1

F(2)

F(2)

f6

f7

f5

f4

1

1

1

e10

c12

c13

e8

c9

c11

E(2)

C(4)

TreeSketch Synopsis
  • Is there a lossless synopsis?
  • What is the quality of a lossy synopsis?

XML Document

TreeSketch

count stability

r

R(1)

p1

P(1)

s2

s3

S(2)

F(2)

F(2)

f6

f7

f5

f4

e10

c12

c13

e8

c9

c11

E(2)

C(4)

Count Stability
  • (u,v) count-stable: all elements in u have the same child-count in v

XML Document

TreeSketch

1

2

1

1

1

1

1

count stable treesketch

r

p1

s2

s3

f6

f7

f5

f4

e10

c12

c13

e8

c9

c11

Count-Stable TreeSketch
  • A count-stable synopsis can recover the input tree
  • Efficient one-pass construction
  • Stable summary can be too large for practical use!

XML Document

TreeSketch

R(1)

1

P(1)

1

1

S(1)

S(1)

2

2

F(2)

F(2)

1

1

1

E(2)

C(4)

lossy treesketch

#F

r

R(1)

2

1

p1

P(1)

1

2

s2

s3

S(2)

1

2

#F

1

1

F(2)

F(2)

f6

f7

f5

f4

1

1

1

e10

c12

c13

e8

c9

c11

E(2)

C(4)

Lossy TreeSketch

XML Document

TreeSketch

treesketches and clustering
TreeSketches and Clustering
  • TreeSketch  Element clustering
    • All elements in a node are mapped to a “centroid”
    • Tight clusters  Accurate synopsis
  • Synopsis quality  Clustering error
    • Options: Manhattan Distance, Squared Error, …
    • Quality can be measured independent of a workload
    • Key for effective construction
computing approximate answers

R(1)

2

1

P(1)

S

2

1

1+1=2

S(2)

C

E

1

1

F(2)

F(2)

1

1

1

E(2)

C(4)

Computing Approximate Answers

Query

Approximate Nesting Tree

TreeSketch

  • Compute TreeSketch of approximate answer
  • Accuracy depends on quality of clustering

R

q0

//section

q1

.//caption

.//equation

q2

q3

treesketch construction
TreeSketch Construction
  • Given an XML tree T, build a TreeSketch of size B
  • Difficult clustering problem
    • Space dimensionality depends on the clustering itself
  • Construction based on bottom-up clustering
    • Compress perfect synopsis by merging clusters
    • Best merge determined by marginal gains
    • Heuristic to reduce number of candidate merges

Space Budget

Perfect

error of approximation

r

r

r

s

s

s

s

s

s

2

4

1

4

6

4

4

6

1

1

2

1

f

f

f

e

e

e

f

f

f

e

e

e

Error of Approximation
  • Error  Distance between R’ and R
  • Popular metric: Tree-edit distance
    • Min-cost sequence of operations that transform R’ to R
    • Measures syntactic differences between R and R’
  • Not intuitive for approximate answers!

Same counts

Opposite Trait

Different counts

Similar Trait

T1

T

T2

element simulation distance

f

f

Recursive application

of ESD

r

r

f

s

s

s

s

e

e

e

e

e

e

e

e

e

e

1

2

4

6

6

4

2

1

f

f

e

e

f

f

e

e

T

T2

Element Simulation Distance
  • Capture approximate similarity between R and R’
  • u simulates v: u and v have identical structure
  • ESD(u,v): “degree” of simulation between u,v
    • How well the structure of u matches the structure of v
  • Modeled as the distance between multi-sets
  • Efficient computation using perfect summaries
methodology
Methodology
  • Data Sets: XMark, DBLP, IMDB, SwissProt
  • Workload: 1000 random twig queries
  • Evaluation metrics:
    • Average ESD for approximate answers
    • Mean absolute relative error for selectivity estimation
approximate answers imdb
Approximate Answers - IMDB

IMDB (~102K Elements)

Avg. Result Size: 3,477 tuples

selectivity estimation swissprot
Selectivity Estimation - SwissProt

SwissProt (~182K Elements)

Avg. Result Size: 104,592 tuples

conclusions
Conclusions
  • Approximate query answering for XML databases
  • TreeSketch Synopses
    • Structural summaries for tree-structured XML
    • Approximate answers for twig-queries
    • Model: Graph Synopsis + Edge-counts
    • Efficient processing and construction
  • Element Simulation Distance
    • Capture approximate similarity between XML trees
  • Experimental Results
    • High accuracy for low space budgets
    • Efficient construction
treesketch model 2 2

#C

1

1

#E

TreeSketch Model (2/2)
  • Average number of children <--> Edge count

XML Document

TreeSketch

r

R

1

p1

P(1)

2

S(2)

s2

s3

1

1

F(2)

F(2)

f9

f9

f7

f5

1

1

1

E(2)

C(4)

e13

c17

c17

e11

c12

c14

slide29
XML

XML Document

r

p1

p: paper

s: section

c: caption

t: title

f: figure

e: equation

s2

s3

f9

f9

f7

f5

e13

c17

c17

e11

c12

c14

treesketch synopsis2

r

p1

s2

s3

2

f6

f7

f5

f4

e10

c12

c13

e8

c9

c11

TreeSketch Synopsis
  • Augment graph-synopsis with edge counts
  • count[u,v]: mean #children in v per element in u

XML Document

TreeSketch

R(1)

1

P(1)

2

S(2)

#F

2

F(4)

1

0.5

E(2)

C(4)

depth guided merging
Depth-Guided Merging
  • Key observation: Two elements have similar structure, if their children have similar structure
  • Bottom-up merging, based on depth
    • Depth: distance from the leaves of the tree
    • Build a pool of candidate merges by increasing depth
    • Replenish the pool when it falls below a given threshold
  • Reduced construction time - Accurate synopses
depth guided merging1
Depth-Guided Merging
  • Observation: Two elements have similar structure, if their children have similar structure
  • Heuristic: If a merge of two clusters is good, then merges of the child clusters are likely to have been good as well
  • Bottom-up merging strategy
  • Savings in construction time - Accurate synopses