Efficiently Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collecti...
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum PowerPoint PPT Presentation


  • 72 Views
  • Uploaded on
  • Presentation posted in: General

Efficiently Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections. Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum. Outline. The Problem: Connections in XML Collections HOPI Basics [EDBT 2004] Efficiently Building HOPI

Download Presentation

Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ralf schenkel joint work with anja theobald gerhard weikum

Efficiently Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections

Ralf Schenkel

joint work with

Anja Theobald, Gerhard Weikum


Outline

Outline

  • The Problem: Connections in XML Collections

  • HOPI Basics [EDBT 2004]

  • Efficiently Building HOPI

  • Why Distances are Difficult

  • Incremental Index Maintenance


Xml basics

XML Basics

article

article

title

sec

references

title

sec

references

entry

entry

XML

XML document

Element-level graph

<article>

<title>XML</title>

<sec>…</sec>

<references>

<entry>…</entry>

</references>

</article>


Xml basics1

XML Basics

link

<article>

<title>XML</title>

<sec>…</sec>

<references>

<entry>…</entry>

</references>

</article>

<researcher>

<name>Schenkel</name>

<topics>…</topics>

<pubs>

<book>…</book>

</pubs>

</researcher>

<book>

<title>UML</title>

<author>…</author>

<content>

<chap>…</chap>

</content>

</book>

XML collection= docs + links


Xml basics2

XML Basics

article

researcher

title

sec

references

name

topics

pubs

entry

book

book

Element-level graphof the collection

title

author

content

chap


Xml basics3

XML Basics

Document-level graphof the collection


Connections in xml

Connections in XML

article

researcher

article

researcher

title

sec

references

name

topics

pubs

entry

book

  • (Naive) Answers:

  • Use Transitive Closure!

  • Use any APSP algorithm!(+ store information)

  • Questions:

  • Is there a path from article to researcher?

  • How long is the shortest path from article to researcher?

book

title

author

content

chap

XPath(++)/NEXI(++)-Query

//article[about(“XML“)]//researcher[about(“DBS“)]


Why naive is not enough

Why naive is not enough

Small example from real world: subset of DBLP

6,210 documents (publications)

168,991 elements

25,368 links (citations)

14Megabytes (uncompressed XML)

Element-level graph has 168,991 nodes and

188,149 edges

Its transitive closure: 344,992,370 connections

2,632.1 MB

Complete DBLP has about 600,000 documents

The Web has …?


Ralf schenkel joint work with anja theobald gerhard weikum

Goal

Find a compact representation for the

transitive closure

  • whose size is comparable to the data‘s size

  • that supports connection tests (almost) as fast as the transitive closure

  • that can be built efficiently for large data sets


Hopi use two hop cover

HOPI: Use Two-Hop Cover

a

c

b

  • For each node a, maintain two sets of labels (which are nodes): Lin(a) and Lout(a)

  • For each connection (a,b),

    • choose a node c on the path from a to b (center node)

    • add c to Lout(a) and to Lin(b)

  • Then (a,b)Transitive Closure T  Lout(a)Lin(b)≠

Two-hop Cover of T (Edith Cohen et al., SODA 2002)

  • Minimize the sum of the label sizes(NP-complete  approximation required)


Approximation algorithm

Approximation Algorithm

1

2

4

5

3

6

initial density:

2

4

1

I

O

5

2

6

What are good center nodes?

Nodes that can cover many uncovered connections.

Initial step:All connections are uncovered

2

 Consider the center graph of candidates

density of densest subgraph (here: same as initial density)

(We can cover 8 connections with 6 cover entries)


Approximation algorithm1

Approximation Algorithm

1

2

4

5

3

6

initial density:

1

4

2

5

I

O

6

3

density of densest subgraph = initial density (graph is complete)

4

What are good center nodes?

Nodes that can cover many uncovered connections.

Initial step:All connections are uncovered

4

 Consider the center graph of candidates

Cover connections in subgraph with greatest density with corresponding center node


Approximation algorithm2

Approximation Algorithm

1

2

4

5

3

6

1

I

O

2

2

What are good center nodes?

Nodes that can cover many uncovered connections.

Next step:Some connections already covered

2

 Consider the center graph of candidates

Repeat this algorithm until all connections are covered

Theorem: Generated Cover is optimal up to a logarithmic factor


Optimizing performance edbt04

Optimizing Performance [EDBT04]

  • Density of densest subgraph of a node‘s center graph never increases when connections are covered

  • Precompute estimates, recompute on demand(using a Priority Queue)  ~2 computations per node

  • Initial Center Graphs are always their densest subgraphs


Is that enough

Is that enough?

For our example:

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 1,289,930 entries

 compression factor of ~267

 queries are still fast (~7.6 entries/node)

But:

Computation took 45 hours and 80 GB RAM!


Hopi divide and conquer

HOPI: Divide and Conquer

Framework of an Algorithm:

  • Partition the graph such that the transitive closures of the partitions fit into memory and the weight of crossing edges is minimized

  • Compute the two-hop cover for each partition

  • Combine the two-hop covers of the partitions into the final cover


Step 3 cover joining

Step 3: Cover Joining

Using current Lin and Lout

t

Naive Algorithm (from EDBT ’04)

s

For each cross-partition link st:

  • Choose t as center node for all connectionsover st

  • Add t to Lin(d) of all descendants d of t and t itself

  • Add t to Lout(a) of all ancestors a of s and s itself

Join has to be done sequentially for all links


Results with naive join

Results with Naive Join

Best combination of algorithms:

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 15,976,677 entries

 compression factor of ~21.6

 queries are still ok (~94.5 entries/node)

 build time is feasible (~3 hours with 1 CPU

and 1GB RAM)

Can we do better?


Structurally recursive join alg

Structurally Recursive Join Alg

Basic Idea

  • Compute (small) graph from partitioning

  • Compute its two-hop cover Hin,Hout

  • Combine this cover with the partition covers


Example

Example

7

8

4

5

2

3

1

6

Build partition-level skeleton graph PSG


Example ctd

Example (ctd.)

1

2

7

8

Hin

2

2

7

2

Hout

2

2

2,7

2

8

7

1

2

Join Algorithm:

  • For each link source s,add Hout(s) to Lout(a) for each ancestor a of sin s‘ partition

  • For each link target t,add Hin(t) to LIn(t) for each descendant d of tin t‘s partition

Join can be done concurrently for all links


Example ctd1

Example (ctd.)

Lout={…,2,7}

Lin={…,2}

7

8

4

5

Lemma:It is enough to cover connections from link sources to link targets

2

3

1

6


Final results for index creation

Final Results for Index Creation

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 9,999,052 entries

 compression factor of ~34.5

 queries are still ok (~59.2 entries/node)

 build time is good (~23 minutes with 1 CPU and 1GB RAM)

Cover size 8 times larger than best,but ~118 times faster with ~1% memory


Outline1

Outline

  • The Problem: Connections in XML Collections

  • HOPI Basics [EDBT 2004]

  • Efficiently Building HOPI

  • Why Distances are Difficult

  • Incremental Index Maintenance


Why distances are difficult

Why Distances are Difficult

2

4

Lout(v)={(u,2), …}

Lin(w)= {(u,4), …}

v

u

w

  • Should be simple to add:

Lout(v)={u, …}

Lin(w)= {u, …}

dist(v,w)=dist(v,u)+dist(u,w)=2+4=6

  • But the devil is in the details…


Why distances are difficult1

Why Distances are Difficult

2

4

v

u

w

dist(v,w)=1 Center node u does not reflect the correct distance of v and w


Solution distance aware centergraph

Solution: Distance-aware Centergraph

1

2

4

5

3

6

1

4

2

5

I

O

6

3

4

  • Add edges to the center graph only if the corresponding connection is a shortest path

  • Correct, but two problems:

    • Expensive to build the center graph (2 additional lookups per connection)

    • Initial graphs are no longer complete

       bound is no longer tight


New bound for distance aware cgs

New Bound for Distance-Aware CGs

Estimation for Initial Density

Assume we know the CG (E=#edges). Then

But: precomputation takes 4h

 Reduces time to build two-hop cover by 2 hours

Solution: random sampling of large center graphs


Outline2

Outline

  • The Problem: Connections in XML Collections

  • HOPI Basics [EDBT 2004]

  • Efficiently Building HOPI

  • Why Distances are Difficult

  • Incremental Index Maintenance


Incremental maintenance

Incremental Maintenance

(join)

(delete+insert)

How to update the two-hop cover when

documents (nodes, elements) are

  • inserted in the collection

  • deleted from the collection

  • updated

Rebuilding the complete cover should be the last resort!


Deleting good documents

Deleting „good“ documents

2

3

4

1

6

5

7

8

9

„good“ documents separate the document-level graph:

Ancestors of d and descendants of d are connected only through d

Delete document 6

 Deletions in covers of elements in documents 3,4,8,9 (+ doc 6)


Deleting bad documents

Deleting „bad“ documents

2

3

4

1

6

5

7

8

9

„bad“ documents don‘t separate the doc-level graph:

Ancestors of d and descendants of d are connected through d and by other docs

  • Delete document 5

  • Deletions in covers of elements in documents 1,2,3,7 (+ doc 5)

  • Add 2-hop cover for connections starting in docs 1,2,3 (but not 4) and ending in 7


Future work

Future Work

  • Applications with non-XML data

  • Length-Bound Connections: n-Hop-Cover

  • Distance-Aware Solution for Large Graphs with many cycles (partitioning breaks cycles)

  • Large-scale experiments with huge data

    • Complete DBLP (~600,000 docs)

    • IMDB (>1 Mio docs,cycles)

      with many concurrent threads/processes

    • 64 CPU Sun server

    • 16 or 32 cluster nodes


Conclusion

Conclusion

  • HOPI as connection and distance index for linked XML documents

  • Efficient Divide-and-Conquer Build Algorithm

  • Efficient Insertion and (sometimes) Deletion of Documents, Elements, Edges


  • Login