slide1 n.
Download
Skip this Video
Download Presentation
Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum

Loading in 2 Seconds...

play fullscreen
1 / 34

Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum - PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on

Efficiently Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections. Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum. Outline. The Problem: Connections in XML Collections HOPI Basics [EDBT 2004] Efficiently Building HOPI

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum' - shika


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Efficiently Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections

Ralf Schenkel

joint work with

Anja Theobald, Gerhard Weikum

outline
Outline
  • The Problem: Connections in XML Collections
  • HOPI Basics [EDBT 2004]
  • Efficiently Building HOPI
  • Why Distances are Difficult
  • Incremental Index Maintenance
xml basics
XML Basics

article

article

title

sec

references

title

sec

references

entry

entry

XML

XML document

Element-level graph

<article>

<title>XML</title>

<sec>…</sec>

<references>

<entry>…</entry>

</references>

</article>

xml basics1
XML Basics

link

<article>

<title>XML</title>

<sec>…</sec>

<references>

<entry>…</entry>

</references>

</article>

<researcher>

<name>Schenkel</name>

<topics>…</topics>

<pubs>

<book>…</book>

</pubs>

</researcher>

<book>

<title>UML</title>

<author>…</author>

<content>

<chap>…</chap>

</content>

</book>

XML collection= docs + links

xml basics2
XML Basics

article

researcher

title

sec

references

name

topics

pubs

entry

book

book

Element-level graphof the collection

title

author

content

chap

xml basics3
XML Basics

Document-level graphof the collection

connections in xml
Connections in XML

article

researcher

article

researcher

title

sec

references

name

topics

pubs

entry

book

  • (Naive) Answers:
  • Use Transitive Closure!
  • Use any APSP algorithm!(+ store information)
  • Questions:
  • Is there a path from article to researcher?
  • How long is the shortest path from article to researcher?

book

title

author

content

chap

XPath(++)/NEXI(++)-Query

//article[about(“XML“)]//researcher[about(“DBS“)]

why naive is not enough
Why naive is not enough

Small example from real world: subset of DBLP

6,210 documents (publications)

168,991 elements

25,368 links (citations)

14 Megabytes (uncompressed XML)

Element-level graph has 168,991 nodes and

188,149 edges

Its transitive closure: 344,992,370 connections

2,632.1 MB

Complete DBLP has about 600,000 documents

The Web has …?

slide9
Goal

Find a compact representation for the

transitive closure

  • whose size is comparable to the data‘s size
  • that supports connection tests (almost) as fast as the transitive closure
  • that can be built efficiently for large data sets
hopi use two hop cover
HOPI: Use Two-Hop Cover

a

c

b

  • For each node a, maintain two sets of labels (which are nodes): Lin(a) and Lout(a)
  • For each connection (a,b),
    • choose a node c on the path from a to b (center node)
    • add c to Lout(a) and to Lin(b)
  • Then (a,b)Transitive Closure T  Lout(a)Lin(b)≠

Two-hop Cover of T (Edith Cohen et al., SODA 2002)

  • Minimize the sum of the label sizes(NP-complete  approximation required)
approximation algorithm
Approximation Algorithm

1

2

4

5

3

6

initial density:

2

4

1

I

O

5

2

6

What are good center nodes?

Nodes that can cover many uncovered connections.

Initial step:All connections are uncovered

2

 Consider the center graph of candidates

density of densest subgraph (here: same as initial density)

(We can cover 8 connections with 6 cover entries)

approximation algorithm1
Approximation Algorithm

1

2

4

5

3

6

initial density:

1

4

2

5

I

O

6

3

density of densest subgraph = initial density (graph is complete)

4

What are good center nodes?

Nodes that can cover many uncovered connections.

Initial step:All connections are uncovered

4

 Consider the center graph of candidates

Cover connections in subgraph with greatest density with corresponding center node

approximation algorithm2
Approximation Algorithm

1

2

4

5

3

6

1

I

O

2

2

What are good center nodes?

Nodes that can cover many uncovered connections.

Next step:Some connections already covered

2

 Consider the center graph of candidates

Repeat this algorithm until all connections are covered

Theorem: Generated Cover is optimal up to a logarithmic factor

optimizing performance edbt04
Optimizing Performance [EDBT04]
  • Density of densest subgraph of a node‘s center graph never increases when connections are covered
  • Precompute estimates, recompute on demand(using a Priority Queue)  ~2 computations per node
  • Initial Center Graphs are always their densest subgraphs
is that enough
Is that enough?

For our example:

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 1,289,930 entries

 compression factor of ~267

 queries are still fast (~7.6 entries/node)

But:

Computation took 45 hours and 80 GB RAM!

hopi divide and conquer
HOPI: Divide and Conquer

Framework of an Algorithm:

  • Partition the graph such that the transitive closures of the partitions fit into memory and the weight of crossing edges is minimized
  • Compute the two-hop cover for each partition
  • Combine the two-hop covers of the partitions into the final cover
step 3 cover joining
Step 3: Cover Joining

Using current Lin and Lout

t

Naive Algorithm (from EDBT ’04)

s

For each cross-partition link st:

  • Choose t as center node for all connectionsover st
  • Add t to Lin(d) of all descendants d of t and t itself
  • Add t to Lout(a) of all ancestors a of s and s itself

Join has to be done sequentially for all links

results with naive join
Results with Naive Join

Best combination of algorithms:

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 15,976,677 entries

 compression factor of ~21.6

 queries are still ok (~94.5 entries/node)

 build time is feasible (~3 hours with 1 CPU

and 1GB RAM)

Can we do better?

structurally recursive join alg
Structurally Recursive Join Alg

Basic Idea

  • Compute (small) graph from partitioning
  • Compute its two-hop cover Hin,Hout
  • Combine this cover with the partition covers
example
Example

7

8

4

5

2

3

1

6

Build partition-level skeleton graph PSG

example ctd
Example (ctd.)

1

2

7

8

Hin

2

2

7

2

Hout

2

2

2,7

2

8

7

1

2

Join Algorithm:

  • For each link source s,add Hout(s) to Lout(a) for each ancestor a of sin s‘ partition
  • For each link target t,add Hin(t) to LIn(t) for each descendant d of tin t‘s partition

Join can be done concurrently for all links

example ctd1
Example (ctd.)

Lout={…,2,7}

Lin={…,2}

7

8

4

5

Lemma:It is enough to cover connections from link sources to link targets

2

3

1

6

final results for index creation
Final Results for Index Creation

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 9,999,052 entries

 compression factor of ~34.5

 queries are still ok (~59.2 entries/node)

 build time is good (~23 minutes with 1 CPU and 1GB RAM)

Cover size 8 times larger than best,but ~118 times faster with ~1% memory

outline1
Outline
  • The Problem: Connections in XML Collections
  • HOPI Basics [EDBT 2004]
  • Efficiently Building HOPI
  • Why Distances are Difficult
  • Incremental Index Maintenance
why distances are difficult
Why Distances are Difficult

2

4

Lout(v)={(u,2), …}

Lin(w)= {(u,4), …}

v

u

w

  • Should be simple to add:

Lout(v)={u, …}

Lin(w)= {u, …}

dist(v,w)=dist(v,u)+dist(u,w)=2+4=6

  • But the devil is in the details…
why distances are difficult1
Why Distances are Difficult

2

4

v

u

w

dist(v,w)=1 Center node u does not reflect the correct distance of v and w

solution distance aware centergraph
Solution: Distance-aware Centergraph

1

2

4

5

3

6

1

4

2

5

I

O

6

3

4

  • Add edges to the center graph only if the corresponding connection is a shortest path
  • Correct, but two problems:
    • Expensive to build the center graph (2 additional lookups per connection)
    • Initial graphs are no longer complete

 bound is no longer tight

new bound for distance aware cgs
New Bound for Distance-Aware CGs

Estimation for Initial Density

Assume we know the CG (E=#edges). Then

But: precomputation takes 4h

 Reduces time to build two-hop cover by 2 hours

Solution: random sampling of large center graphs

outline2
Outline
  • The Problem: Connections in XML Collections
  • HOPI Basics [EDBT 2004]
  • Efficiently Building HOPI
  • Why Distances are Difficult
  • Incremental Index Maintenance
incremental maintenance
Incremental Maintenance

(join)

(delete+insert)

How to update the two-hop cover when

documents (nodes, elements) are

  • inserted in the collection
  • deleted from the collection
  • updated

Rebuilding the complete cover should be the last resort!

deleting good documents
Deleting „good“ documents

2

3

4

1

6

5

7

8

9

„good“ documents separate the document-level graph:

Ancestors of d and descendants of d are connected only through d

Delete document 6

 Deletions in covers of elements in documents 3,4,8,9 (+ doc 6)

deleting bad documents
Deleting „bad“ documents

2

3

4

1

6

5

7

8

9

„bad“ documents don‘t separate the doc-level graph:

Ancestors of d and descendants of d are connected through d and by other docs

  • Delete document 5
  • Deletions in covers of elements in documents 1,2,3,7 (+ doc 5)
  • Add 2-hop cover for connections starting in docs 1,2,3 (but not 4) and ending in 7
future work
Future Work
  • Applications with non-XML data
  • Length-Bound Connections: n-Hop-Cover
  • Distance-Aware Solution for Large Graphs with many cycles (partitioning breaks cycles)
  • Large-scale experiments with huge data
    • Complete DBLP (~600,000 docs)
    • IMDB (>1 Mio docs,cycles)

with many concurrent threads/processes

    • 64 CPU Sun server
    • 16 or 32 cluster nodes
conclusion
Conclusion
  • HOPI as connection and distance index for linked XML documents
  • Efficient Divide-and-Conquer Build Algorithm
  • Efficient Insertion and (sometimes) Deletion of Documents, Elements, Edges