- 72 Views
- Uploaded on
- Presentation posted in: General

Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Efficiently Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections

Ralf Schenkel

joint work with

Anja Theobald, Gerhard Weikum

- The Problem: Connections in XML Collections
- HOPI Basics [EDBT 2004]
- Efficiently Building HOPI
- Why Distances are Difficult
- Incremental Index Maintenance

article

article

title

sec

references

title

sec

references

entry

entry

XML

…

…

XML document

Element-level graph

<article>

<title>XML</title>

<sec>…</sec>

<references>

<entry>…</entry>

</references>

</article>

link

<article>

<title>XML</title>

<sec>…</sec>

<references>

<entry>…</entry>

</references>

</article>

<researcher>

<name>Schenkel</name>

<topics>…</topics>

<pubs>

<book>…</book>

</pubs>

</researcher>

<book>

<title>UML</title>

<author>…</author>

<content>

<chap>…</chap>

</content>

</book>

XML collection= docs + links

article

researcher

title

sec

references

name

topics

pubs

entry

book

book

Element-level graphof the collection

title

author

content

chap

Document-level graphof the collection

article

researcher

article

researcher

title

sec

references

name

topics

pubs

entry

book

- (Naive) Answers:
- Use Transitive Closure!
- Use any APSP algorithm!(+ store information)

- Questions:
- Is there a path from article to researcher?
- How long is the shortest path from article to researcher?

book

title

author

content

chap

XPath(++)/NEXI(++)-Query

//article[about(“XML“)]//researcher[about(“DBS“)]

Small example from real world: subset of DBLP

6,210 documents (publications)

168,991 elements

25,368 links (citations)

14Megabytes (uncompressed XML)

Element-level graph has 168,991 nodes and

188,149 edges

Its transitive closure: 344,992,370 connections

2,632.1 MB

Complete DBLP has about 600,000 documents

The Web has …?

Find a compact representation for the

transitive closure

- whose size is comparable to the data‘s size
- that supports connection tests (almost) as fast as the transitive closure
- that can be built efficiently for large data sets

a

c

b

- For each node a, maintain two sets of labels (which are nodes): Lin(a) and Lout(a)
- For each connection (a,b),
- choose a node c on the path from a to b (center node)
- add c to Lout(a) and to Lin(b)

- Then (a,b)Transitive Closure T Lout(a)Lin(b)≠

Two-hop Cover of T (Edith Cohen et al., SODA 2002)

- Minimize the sum of the label sizes(NP-complete approximation required)

1

2

4

5

3

6

initial density:

2

4

1

I

O

5

2

6

What are good center nodes?

Nodes that can cover many uncovered connections.

Initial step:All connections are uncovered

2

Consider the center graph of candidates

density of densest subgraph (here: same as initial density)

(We can cover 8 connections with 6 cover entries)

1

2

4

5

3

6

initial density:

1

4

2

5

I

O

6

3

density of densest subgraph = initial density (graph is complete)

4

What are good center nodes?

Nodes that can cover many uncovered connections.

Initial step:All connections are uncovered

4

Consider the center graph of candidates

Cover connections in subgraph with greatest density with corresponding center node

1

2

4

5

3

6

1

I

O

2

2

What are good center nodes?

Nodes that can cover many uncovered connections.

Next step:Some connections already covered

2

Consider the center graph of candidates

Repeat this algorithm until all connections are covered

Theorem: Generated Cover is optimal up to a logarithmic factor

- Density of densest subgraph of a node‘s center graph never increases when connections are covered
- Precompute estimates, recompute on demand(using a Priority Queue) ~2 computations per node
- Initial Center Graphs are always their densest subgraphs

For our example:

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 1,289,930 entries

compression factor of ~267

queries are still fast (~7.6 entries/node)

But:

Computation took 45 hours and 80 GB RAM!

Framework of an Algorithm:

- Partition the graph such that the transitive closures of the partitions fit into memory and the weight of crossing edges is minimized
- Compute the two-hop cover for each partition
- Combine the two-hop covers of the partitions into the final cover

Using current Lin and Lout

t

Naive Algorithm (from EDBT ’04)

s

For each cross-partition link st:

- Choose t as center node for all connectionsover st
- Add t to Lin(d) of all descendants d of t and t itself
- Add t to Lout(a) of all ancestors a of s and s itself

Join has to be done sequentially for all links

Best combination of algorithms:

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 15,976,677 entries

compression factor of ~21.6

queries are still ok (~94.5 entries/node)

build time is feasible (~3 hours with 1 CPU

and 1GB RAM)

Can we do better?

Basic Idea

- Compute (small) graph from partitioning
- Compute its two-hop cover Hin,Hout
- Combine this cover with the partition covers

7

8

4

5

2

3

1

6

Build partition-level skeleton graph PSG

1

2

7

8

Hin

2

2

7

2

Hout

2

2

2,7

2

8

7

1

2

Join Algorithm:

- For each link source s,add Hout(s) to Lout(a) for each ancestor a of sin s‘ partition
- For each link target t,add Hin(t) to LIn(t) for each descendant d of tin t‘s partition

Join can be done concurrently for all links

Lout={…,2,7}

Lin={…,2}

7

8

4

5

Lemma:It is enough to cover connections from link sources to link targets

2

3

1

6

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 9,999,052 entries

compression factor of ~34.5

queries are still ok (~59.2 entries/node)

build time is good (~23 minutes with 1 CPU and 1GB RAM)

Cover size 8 times larger than best,but ~118 times faster with ~1% memory

- The Problem: Connections in XML Collections
- HOPI Basics [EDBT 2004]
- Efficiently Building HOPI
- Why Distances are Difficult
- Incremental Index Maintenance

2

4

Lout(v)={(u,2), …}

Lin(w)= {(u,4), …}

v

u

w

- Should be simple to add:

Lout(v)={u, …}

Lin(w)= {u, …}

dist(v,w)=dist(v,u)+dist(u,w)=2+4=6

- But the devil is in the details…

2

4

v

u

w

dist(v,w)=1 Center node u does not reflect the correct distance of v and w

1

2

4

5

3

6

1

4

2

5

I

O

6

3

4

- Add edges to the center graph only if the corresponding connection is a shortest path

- Correct, but two problems:
- Expensive to build the center graph (2 additional lookups per connection)
- Initial graphs are no longer complete
bound is no longer tight

Estimation for Initial Density

Assume we know the CG (E=#edges). Then

But: precomputation takes 4h

Reduces time to build two-hop cover by 2 hours

Solution: random sampling of large center graphs

- The Problem: Connections in XML Collections
- HOPI Basics [EDBT 2004]
- Efficiently Building HOPI
- Why Distances are Difficult
- Incremental Index Maintenance

(join)

(delete+insert)

How to update the two-hop cover when

documents (nodes, elements) are

- inserted in the collection
- deleted from the collection
- updated

Rebuilding the complete cover should be the last resort!

2

3

4

1

6

5

7

8

9

„good“ documents separate the document-level graph:

Ancestors of d and descendants of d are connected only through d

Delete document 6

Deletions in covers of elements in documents 3,4,8,9 (+ doc 6)

2

3

4

1

6

5

7

8

9

„bad“ documents don‘t separate the doc-level graph:

Ancestors of d and descendants of d are connected through d and by other docs

- Delete document 5
- Deletions in covers of elements in documents 1,2,3,7 (+ doc 5)
- Add 2-hop cover for connections starting in docs 1,2,3 (but not 4) and ending in 7

- Applications with non-XML data
- Length-Bound Connections: n-Hop-Cover
- Distance-Aware Solution for Large Graphs with many cycles (partitioning breaks cycles)
- Large-scale experiments with huge data
- Complete DBLP (~600,000 docs)
- IMDB (>1 Mio docs,cycles)
with many concurrent threads/processes

- 64 CPU Sun server
- 16 or 32 cluster nodes

- HOPI as connection and distance index for linked XML documents
- Efficient Divide-and-Conquer Build Algorithm
- Efficient Insertion and (sometimes) Deletion of Documents, Elements, Edges