Efficiently Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collecti...
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on
  • Presentation posted in: General

Efficiently Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections. Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum. Outline. The Problem: Connections in XML Collections HOPI Basics [EDBT 2004] Efficiently Building HOPI

Download Presentation

Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Efficiently Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections

Ralf Schenkel

joint work with

Anja Theobald, Gerhard Weikum


Outline

  • The Problem: Connections in XML Collections

  • HOPI Basics [EDBT 2004]

  • Efficiently Building HOPI

  • Why Distances are Difficult

  • Incremental Index Maintenance


XML Basics

article

article

title

sec

references

title

sec

references

entry

entry

XML

XML document

Element-level graph

<article>

<title>XML</title>

<sec>…</sec>

<references>

<entry>…</entry>

</references>

</article>


XML Basics

link

<article>

<title>XML</title>

<sec>…</sec>

<references>

<entry>…</entry>

</references>

</article>

<researcher>

<name>Schenkel</name>

<topics>…</topics>

<pubs>

<book>…</book>

</pubs>

</researcher>

<book>

<title>UML</title>

<author>…</author>

<content>

<chap>…</chap>

</content>

</book>

XML collection= docs + links


XML Basics

article

researcher

title

sec

references

name

topics

pubs

entry

book

book

Element-level graphof the collection

title

author

content

chap


XML Basics

Document-level graphof the collection


Connections in XML

article

researcher

article

researcher

title

sec

references

name

topics

pubs

entry

book

  • (Naive) Answers:

  • Use Transitive Closure!

  • Use any APSP algorithm!(+ store information)

  • Questions:

  • Is there a path from article to researcher?

  • How long is the shortest path from article to researcher?

book

title

author

content

chap

XPath(++)/NEXI(++)-Query

//article[about(“XML“)]//researcher[about(“DBS“)]


Why naive is not enough

Small example from real world: subset of DBLP

6,210 documents (publications)

168,991 elements

25,368 links (citations)

14Megabytes (uncompressed XML)

Element-level graph has 168,991 nodes and

188,149 edges

Its transitive closure: 344,992,370 connections

2,632.1 MB

Complete DBLP has about 600,000 documents

The Web has …?


Goal

Find a compact representation for the

transitive closure

  • whose size is comparable to the data‘s size

  • that supports connection tests (almost) as fast as the transitive closure

  • that can be built efficiently for large data sets


HOPI: Use Two-Hop Cover

a

c

b

  • For each node a, maintain two sets of labels (which are nodes): Lin(a) and Lout(a)

  • For each connection (a,b),

    • choose a node c on the path from a to b (center node)

    • add c to Lout(a) and to Lin(b)

  • Then (a,b)Transitive Closure T  Lout(a)Lin(b)≠

Two-hop Cover of T (Edith Cohen et al., SODA 2002)

  • Minimize the sum of the label sizes(NP-complete  approximation required)


Approximation Algorithm

1

2

4

5

3

6

initial density:

2

4

1

I

O

5

2

6

What are good center nodes?

Nodes that can cover many uncovered connections.

Initial step:All connections are uncovered

2

 Consider the center graph of candidates

density of densest subgraph (here: same as initial density)

(We can cover 8 connections with 6 cover entries)


Approximation Algorithm

1

2

4

5

3

6

initial density:

1

4

2

5

I

O

6

3

density of densest subgraph = initial density (graph is complete)

4

What are good center nodes?

Nodes that can cover many uncovered connections.

Initial step:All connections are uncovered

4

 Consider the center graph of candidates

Cover connections in subgraph with greatest density with corresponding center node


Approximation Algorithm

1

2

4

5

3

6

1

I

O

2

2

What are good center nodes?

Nodes that can cover many uncovered connections.

Next step:Some connections already covered

2

 Consider the center graph of candidates

Repeat this algorithm until all connections are covered

Theorem: Generated Cover is optimal up to a logarithmic factor


Optimizing Performance [EDBT04]

  • Density of densest subgraph of a node‘s center graph never increases when connections are covered

  • Precompute estimates, recompute on demand(using a Priority Queue)  ~2 computations per node

  • Initial Center Graphs are always their densest subgraphs


Is that enough?

For our example:

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 1,289,930 entries

 compression factor of ~267

 queries are still fast (~7.6 entries/node)

But:

Computation took 45 hours and 80 GB RAM!


HOPI: Divide and Conquer

Framework of an Algorithm:

  • Partition the graph such that the transitive closures of the partitions fit into memory and the weight of crossing edges is minimized

  • Compute the two-hop cover for each partition

  • Combine the two-hop covers of the partitions into the final cover


Step 3: Cover Joining

Using current Lin and Lout

t

Naive Algorithm (from EDBT ’04)

s

For each cross-partition link st:

  • Choose t as center node for all connectionsover st

  • Add t to Lin(d) of all descendants d of t and t itself

  • Add t to Lout(a) of all ancestors a of s and s itself

Join has to be done sequentially for all links


Results with Naive Join

Best combination of algorithms:

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 15,976,677 entries

 compression factor of ~21.6

 queries are still ok (~94.5 entries/node)

 build time is feasible (~3 hours with 1 CPU

and 1GB RAM)

Can we do better?


Structurally Recursive Join Alg

Basic Idea

  • Compute (small) graph from partitioning

  • Compute its two-hop cover Hin,Hout

  • Combine this cover with the partition covers


Example

7

8

4

5

2

3

1

6

Build partition-level skeleton graph PSG


Example (ctd.)

1

2

7

8

Hin

2

2

7

2

Hout

2

2

2,7

2

8

7

1

2

Join Algorithm:

  • For each link source s,add Hout(s) to Lout(a) for each ancestor a of sin s‘ partition

  • For each link target t,add Hin(t) to LIn(t) for each descendant d of tin t‘s partition

Join can be done concurrently for all links


Example (ctd.)

Lout={…,2,7}

Lin={…,2}

7

8

4

5

Lemma:It is enough to cover connections from link sources to link targets

2

3

1

6


Final Results for Index Creation

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 9,999,052 entries

 compression factor of ~34.5

 queries are still ok (~59.2 entries/node)

 build time is good (~23 minutes with 1 CPU and 1GB RAM)

Cover size 8 times larger than best,but ~118 times faster with ~1% memory


Outline

  • The Problem: Connections in XML Collections

  • HOPI Basics [EDBT 2004]

  • Efficiently Building HOPI

  • Why Distances are Difficult

  • Incremental Index Maintenance


Why Distances are Difficult

2

4

Lout(v)={(u,2), …}

Lin(w)= {(u,4), …}

v

u

w

  • Should be simple to add:

Lout(v)={u, …}

Lin(w)= {u, …}

dist(v,w)=dist(v,u)+dist(u,w)=2+4=6

  • But the devil is in the details…


Why Distances are Difficult

2

4

v

u

w

dist(v,w)=1 Center node u does not reflect the correct distance of v and w


Solution: Distance-aware Centergraph

1

2

4

5

3

6

1

4

2

5

I

O

6

3

4

  • Add edges to the center graph only if the corresponding connection is a shortest path

  • Correct, but two problems:

    • Expensive to build the center graph (2 additional lookups per connection)

    • Initial graphs are no longer complete

       bound is no longer tight


New Bound for Distance-Aware CGs

Estimation for Initial Density

Assume we know the CG (E=#edges). Then

But: precomputation takes 4h

 Reduces time to build two-hop cover by 2 hours

Solution: random sampling of large center graphs


Outline

  • The Problem: Connections in XML Collections

  • HOPI Basics [EDBT 2004]

  • Efficiently Building HOPI

  • Why Distances are Difficult

  • Incremental Index Maintenance


Incremental Maintenance

(join)

(delete+insert)

How to update the two-hop cover when

documents (nodes, elements) are

  • inserted in the collection

  • deleted from the collection

  • updated

Rebuilding the complete cover should be the last resort!


Deleting „good“ documents

2

3

4

1

6

5

7

8

9

„good“ documents separate the document-level graph:

Ancestors of d and descendants of d are connected only through d

Delete document 6

 Deletions in covers of elements in documents 3,4,8,9 (+ doc 6)


Deleting „bad“ documents

2

3

4

1

6

5

7

8

9

„bad“ documents don‘t separate the doc-level graph:

Ancestors of d and descendants of d are connected through d and by other docs

  • Delete document 5

  • Deletions in covers of elements in documents 1,2,3,7 (+ doc 5)

  • Add 2-hop cover for connections starting in docs 1,2,3 (but not 4) and ending in 7


Future Work

  • Applications with non-XML data

  • Length-Bound Connections: n-Hop-Cover

  • Distance-Aware Solution for Large Graphs with many cycles (partitioning breaks cycles)

  • Large-scale experiments with huge data

    • Complete DBLP (~600,000 docs)

    • IMDB (>1 Mio docs,cycles)

      with many concurrent threads/processes

    • 64 CPU Sun server

    • 16 or 32 cluster nodes


Conclusion

  • HOPI as connection and distance index for linked XML documents

  • Efficient Divide-and-Conquer Build Algorithm

  • Efficient Insertion and (sometimes) Deletion of Documents, Elements, Edges


  • Login