Loading in 5 sec....

Ralf Schenkel joint work with Anja Theobald, Gerhard WeikumPowerPoint Presentation

Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum

Download Presentation

Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum

Loading in 2 Seconds...

- 85 Views
- Uploaded on
- Presentation posted in: General

Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Efficiently Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections

Ralf Schenkel

joint work with

Anja Theobald, Gerhard Weikum

- The Problem: Connections in XML Collections
- HOPI Basics [EDBT 2004]
- Efficiently Building HOPI
- Why Distances are Difficult
- Incremental Index Maintenance

article

article

title

sec

references

title

sec

references

entry

entry

XML

…

…

XML document

Element-level graph

<article>

<title>XML</title>

<sec>…</sec>

<references>

<entry>…</entry>

</references>

</article>

link

<article>

<title>XML</title>

<sec>…</sec>

<references>

<entry>…</entry>

</references>

</article>

<researcher>

<name>Schenkel</name>

<topics>…</topics>

<pubs>

<book>…</book>

</pubs>

</researcher>

<book>

<title>UML</title>

<author>…</author>

<content>

<chap>…</chap>

</content>

</book>

XML collection= docs + links

article

researcher

title

sec

references

name

topics

pubs

entry

book

book

Element-level graphof the collection

title

author

content

chap

Document-level graphof the collection

article

researcher

article

researcher

title

sec

references

name

topics

pubs

entry

book

- (Naive) Answers:
- Use Transitive Closure!
- Use any APSP algorithm!(+ store information)

- Questions:
- Is there a path from article to researcher?
- How long is the shortest path from article to researcher?

book

title

author

content

chap

XPath(++)/NEXI(++)-Query

//article[about(“XML“)]//researcher[about(“DBS“)]

Small example from real world: subset of DBLP

6,210 documents (publications)

168,991 elements

25,368 links (citations)

14Megabytes (uncompressed XML)

Element-level graph has 168,991 nodes and

188,149 edges

Its transitive closure: 344,992,370 connections

2,632.1 MB

Complete DBLP has about 600,000 documents

The Web has …?

Find a compact representation for the

transitive closure

- whose size is comparable to the data‘s size
- that supports connection tests (almost) as fast as the transitive closure
- that can be built efficiently for large data sets

a

c

b

- For each node a, maintain two sets of labels (which are nodes): Lin(a) and Lout(a)
- For each connection (a,b),
- choose a node c on the path from a to b (center node)
- add c to Lout(a) and to Lin(b)

- Then (a,b)Transitive Closure T Lout(a)Lin(b)≠

Two-hop Cover of T (Edith Cohen et al., SODA 2002)

- Minimize the sum of the label sizes(NP-complete approximation required)

1

2

4

5

3

6

initial density:

2

4

1

I

O

5

2

6

What are good center nodes?

Nodes that can cover many uncovered connections.

Initial step:All connections are uncovered

2

Consider the center graph of candidates

density of densest subgraph (here: same as initial density)

(We can cover 8 connections with 6 cover entries)

1

2

4

5

3

6

initial density:

1

4

2

5

I

O

6

3

density of densest subgraph = initial density (graph is complete)

4

What are good center nodes?

Nodes that can cover many uncovered connections.

Initial step:All connections are uncovered

4

Consider the center graph of candidates

Cover connections in subgraph with greatest density with corresponding center node

1

2

4

5

3

6

1

I

O

2

2

What are good center nodes?

Nodes that can cover many uncovered connections.

Next step:Some connections already covered

2

Consider the center graph of candidates

Repeat this algorithm until all connections are covered

Theorem: Generated Cover is optimal up to a logarithmic factor

- Density of densest subgraph of a node‘s center graph never increases when connections are covered
- Precompute estimates, recompute on demand(using a Priority Queue) ~2 computations per node
- Initial Center Graphs are always their densest subgraphs

For our example:

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 1,289,930 entries

compression factor of ~267

queries are still fast (~7.6 entries/node)

But:

Computation took 45 hours and 80 GB RAM!

Framework of an Algorithm:

- Partition the graph such that the transitive closures of the partitions fit into memory and the weight of crossing edges is minimized
- Compute the two-hop cover for each partition
- Combine the two-hop covers of the partitions into the final cover

Using current Lin and Lout

t

Naive Algorithm (from EDBT ’04)

s

For each cross-partition link st:

- Choose t as center node for all connectionsover st
- Add t to Lin(d) of all descendants d of t and t itself
- Add t to Lout(a) of all ancestors a of s and s itself

Join has to be done sequentially for all links

Best combination of algorithms:

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 15,976,677 entries

compression factor of ~21.6

queries are still ok (~94.5 entries/node)

build time is feasible (~3 hours with 1 CPU

and 1GB RAM)

Can we do better?

Basic Idea

- Compute (small) graph from partitioning
- Compute its two-hop cover Hin,Hout
- Combine this cover with the partition covers

7

8

4

5

2

3

1

6

Build partition-level skeleton graph PSG

1

2

7

8

Hin

2

2

7

2

Hout

2

2

2,7

2

8

7

1

2

Join Algorithm:

- For each link source s,add Hout(s) to Lout(a) for each ancestor a of sin s‘ partition
- For each link target t,add Hin(t) to LIn(t) for each descendant d of tin t‘s partition

Join can be done concurrently for all links

Lout={…,2,7}

Lin={…,2}

7

8

4

5

Lemma:It is enough to cover connections from link sources to link targets

2

3

1

6

Transitive Closure: 344,992,370 connections

Two-Hop Cover: 9,999,052 entries

compression factor of ~34.5

queries are still ok (~59.2 entries/node)

build time is good (~23 minutes with 1 CPU and 1GB RAM)

Cover size 8 times larger than best,but ~118 times faster with ~1% memory

- The Problem: Connections in XML Collections
- HOPI Basics [EDBT 2004]
- Efficiently Building HOPI
- Why Distances are Difficult
- Incremental Index Maintenance

2

4

Lout(v)={(u,2), …}

Lin(w)= {(u,4), …}

v

u

w

- Should be simple to add:

Lout(v)={u, …}

Lin(w)= {u, …}

dist(v,w)=dist(v,u)+dist(u,w)=2+4=6

- But the devil is in the details…

2

4

v

u

w

dist(v,w)=1 Center node u does not reflect the correct distance of v and w

1

2

4

5

3

6

1

4

2

5

I

O

6

3

4

- Add edges to the center graph only if the corresponding connection is a shortest path

- Correct, but two problems:
- Expensive to build the center graph (2 additional lookups per connection)
- Initial graphs are no longer complete
bound is no longer tight

Estimation for Initial Density

Assume we know the CG (E=#edges). Then

But: precomputation takes 4h

Reduces time to build two-hop cover by 2 hours

Solution: random sampling of large center graphs

- The Problem: Connections in XML Collections
- HOPI Basics [EDBT 2004]
- Efficiently Building HOPI
- Why Distances are Difficult
- Incremental Index Maintenance

(join)

(delete+insert)

How to update the two-hop cover when

documents (nodes, elements) are

- inserted in the collection
- deleted from the collection
- updated

Rebuilding the complete cover should be the last resort!

2

3

4

1

6

5

7

8

9

„good“ documents separate the document-level graph:

Ancestors of d and descendants of d are connected only through d

Delete document 6

Deletions in covers of elements in documents 3,4,8,9 (+ doc 6)

2

3

4

1

6

5

7

8

9

„bad“ documents don‘t separate the doc-level graph:

Ancestors of d and descendants of d are connected through d and by other docs

- Delete document 5
- Deletions in covers of elements in documents 1,2,3,7 (+ doc 5)
- Add 2-hop cover for connections starting in docs 1,2,3 (but not 4) and ending in 7

- Applications with non-XML data
- Length-Bound Connections: n-Hop-Cover
- Distance-Aware Solution for Large Graphs with many cycles (partitioning breaks cycles)
- Large-scale experiments with huge data
- Complete DBLP (~600,000 docs)
- IMDB (>1 Mio docs,cycles)
with many concurrent threads/processes

- 64 CPU Sun server
- 16 or 32 cluster nodes

- HOPI as connection and distance index for linked XML documents
- Efficient Divide-and-Conquer Build Algorithm
- Efficient Insertion and (sometimes) Deletion of Documents, Elements, Edges