xml compression and indexing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
XML Compression and Indexing PowerPoint Presentation
Download Presentation
XML Compression and Indexing

Loading in 2 Seconds...

play fullscreen
1 / 25

XML Compression and Indexing - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

The Future of Web Search Barcelona, May 2006. XML Compression and Indexing. Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan]. Under patenting by Pisa-Rutgers Univ. Compressed Permuterm Index.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

XML Compression and Indexing


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
xml compression and indexing

The Future of Web Search

Barcelona, May 2006

XMLCompression and Indexing

Paolo Ferragina

Dipartimento di Informatica, Università di Pisa

[Joint with F. Luccio, G. Manzini, S. Muthukrishnan]

Under patenting by

Pisa-Rutgers Univ.

Paolo Ferragina, Università di Pisa

compressed permuterm index

Compressed Permuterm Index

Paolo Ferragina, Rossano Venturini

Dipartimento di Informatica, Università di Pisa

Under Y!-patenting

Paolo Ferragina, Università di Pisa

a basic problem
A basic problem

Given a dictionary D of strings, having variable length, design a compressed data structure that supports

  • string  id
  • Prefix(a): find all strings in D that are prefixed by a
  • Suffix(b): find all strings in D that are suffixed byb
  • Substring(g):find all strings in D that contain g
  • PrefixSuffix(a,b) = Prefix(a)  Suffix(b)

IR book of Manning-Raghavan-Schutze

 Tolerant Retrieval Problem (wildcards)

Prefix(a) = a*

Suffix(b) = *b

Substring(g) = *g*

PrefixSuffix(a,b) = a*b

Paolo Ferragina, Università di Pisa

a basic problem1
A basic problem

Given a dictionary D of strings, having variable length, design a compressed data structure that supports

  • string  id
  • Prefix(a): find all s in D that are prefixed by a
  • Suffix(b): find all s in D that are suffixed byb
  • Substring(g):find all s in D that contain g
  • PrefixSuffix(a,b) = Prefix(a)  Suffix(b)
  • Hashing
  •  Not exact searches

Paolo Ferragina, Università di Pisa

a basic problem2
A basic problem

Given a dictionary D of strings, having variable length, design a compressed data structure that supports

  • string  id
  • Prefix(a): find all s in D that are prefixed by a
  • Suffix(b): find all s in D that are suffixed byb
  • Substring(g):find all s in D that contain g
  • PrefixSuffix(a,b) = Prefix(a)  Suffix(b)
  • (Compacted) Trie
  •  Two versions: for D and for DR + Intersect answers
  •  No substring search (unless using Suffix Trie)
  •  Need to store D for resolving edge-labels

Paolo Ferragina, Università di Pisa

a basic problem3
A basic problem

Given a dictionary D of strings, having variable length, design a compressed data structure that supports

  • string  id
  • Prefix(a): find all s in D that are prefixed by a
  • Suffix(b): find all s in D that are suffixed byb
  • Substring(g):find all s in D that contain g
  • PrefixSuffix(a,b) = Prefix(a)  Suffix(b)
  • Front coding...

Paolo Ferragina, Università di Pisa

front coding

0 http://checkmate.com/All_Natural/

33 Applied.html

34 roma.html

38 1.html

38 tic_Art.html

34 yate.html

35 er_Soap.html

35 urvedic_Soap.html

33 Bath_Salt_Bulk.html

42 s.html

25 Essence_Oils.html

25 Mineral_Bath_Crystals.html

38 Salt.html

33 Cream.html

0 http://checkmate.com/All/Natural/Washcloth.html...

3035%

http://checkmate.com/All_Natural/

http://checkmate.com/All_Natural/Applied.html

http://checkmate.com/All_Natural/Aroma.html

http://checkmate.com/All_Natural/Aroma1.html

http://checkmate.com/All_Natural/Aromatic_Art.html

http://checkmate.com/All_Natural/Ayate.html

http://checkmate.com/All_Natural/Ayer_Soap.html

http://checkmate.com/All_Natural/Ayurvedic_Soap.html

http://checkmate.com/All_Natural/Bath_Salt_Bulk.html

http://checkmate.com/All_Natural/Bath_Salts.html

http://checkmate.com/All/Essence_Oils.html

http://checkmate.com/All/Mineral_Bath_Crystals.html

http://checkmate.com/All/Mineral_Bath_Salt.html

http://checkmate.com/All/Mineral_Cream.html

http://checkmate.com/All/Natural/Washcloth.html

...

Front-coding

uk-2002 crawl ≈250Mb

bzip≈ 10%

Be back on this, later on!

  •  Two versions: for D and for DR + Intersect answers
  • Need some extra data structures for bucket identification
  • No substring search

Paolo Ferragina, Università di Pisa

a basic problem4
A basic problem

Given a dictionary D of strings, having variable length, compress them in a way that we can efficiently support

  • string  id
  • Prefix(a): find all s in D that are prefixed by a
  • Suffix(b): find all s in D that are suffixed byb
  • Substring(g):find all s in D that contain byg
  • PrefixSuffix(a,b) = Prefix(a)  Suffix(b)
  • Permuterm Index (Garfield, 76)
  • Reduce any query to a “prefix query” over a larger dictionary

Paolo Ferragina, Università di Pisa

premuterm index garfield 1976
Premuterm Index [Garfield, 1976]
  • Take a dictionary D={yahoo,google}
  • Append a special char $ to the end of each string
  • Generate all rotations of these strings
    • yahoo$
    • ahoo$y
    • hoo$ya
    • oo$yah
    • o$yaho
    • $yahoo
    • google$
    • oogle$g
    • ogle$go
    • gle$goo
    • le$goog
    • e$googl
    • $google

Prefix(ya) = Prefix($ya)

Suffix(oo) = Prefix(oo$)

Substring(oo) = Prefix(oo)

PrefixSuffix(y,o)= Prefix(o$y)

Permuterm

Dictionary

Space problems

Any query on D reduces to a prefix-query on P[D]

Paolo Ferragina, Università di Pisa

compressed permuterm index1

SIGIR ‘07

Compressed Permuterm Index

It deploys two ingredients:

  • Permuterm index
  • Compressed full-text index

Theoretically:

  • Query ops take optimal time: proportional to pattern length
  • Space occupancy is |D| Hk(D) + o(|D| log |S|) bits

Technically:

A simple reduction step: Permuterm  Compressed index

  • Re-use known machinery on compressed indexes
  • Achieve bzip-compression at Front-coding speed

Paolo Ferragina, Università di Pisa

the burrows wheeler transform 1994

#mississipp i

i#mississipp

ippi#mississ

issippi#miss

ississippi# m

Sort the rows

mississippi#

T

pi#mississi p

ppi#mississ i

sippi#missi s

sissippi#mi s

ssippi#miss i

ssissippi#m i

The Burrows-Wheeler Transform(1994)

Take the text T = mississippi#

L

F

mississippi#

ississippi#m

ssissippi#mi

sissippi#mis

issippi#miss

ssippi#missi

sippi#missis

ippi#mississ

ppi#mississi

pi#mississip

i#mississipp

#mississippi

Paolo Ferragina, Università di Pisa

compressing l is effective

L is highly compressible

Compressing L is effective

Key observation:

  • L is locally homogeneous
  • Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

Paolo Ferragina, Università di Pisa

the fm index

The main idea is to reduce substring search to

some basicoperations over arrays of symbols

The FM-index

[Ferragina-Manzini, JACM ‘05]

Survey of Navarro-Makinen

contains many other indexes

The result:

  • Count(P): O(p) time
  • Locate(P): O(occ * polylog(|T|)) time
  • Display( T[i,i+L] ): O( L + polylog(|T|) ) time
  • Space occupancy: |T| Hk(T) + o(|T| log |S|) bits

New concept:The FM-index is an opportunistic data structure

Compressed Permuterm index

builds upon the best two features

of the FM-index

Paolo Ferragina, Università di Pisa

first ingredient l f mapping

i ssippi#miss

How do we map L’s onto F’s chars ?

i ssissippi# m

... Need to distinguishequal chars in F...

m ississippi#

p i#mississi p

p pi#mississ i

s ippi#missi s

s issippi#mi s

s sippi#miss i

s sissippi#m i

Take two equal L’s chars

Rotate rightward their rows

Same relative order !!

First ingredient: L  F mapping

F

L

unknown

# mississipp i

i #mississipp

i ppi#mississ

Paolo Ferragina, Università di Pisa

first ingredient l f mapping1

1

2

i ssippi#miss

6

The oracle

Rank( s , 9 )= 3

i ssissippi# m

7

m ississippi#

p i#mississi p

p pi#mississ i

s ippi#missi s

s issippi#mi s

s sippi#miss i

s sissippi#m i

9

First ingredient: L  F mapping

F

L

unknown

# mississipp i

i #mississipp

i ppi#mississ

FM-index is actually

Rank ds over BWT

O(1) time and Hk-space

Paolo Ferragina, Università di Pisa

second ingredient backward step

i ssippi#miss

i ssissippi# m

m ississippi#

p i#mississi p

p pi#mississ i

s ippi#missi s

s issippi#mi s

s sippi#miss i

s sissippi#m i

Backward step(i):

 Return LF[i], in O(1) time

Second ingredient: Backward step

F

L

unknown

# mississipp i

i #mississipp

i ppi#mississ

T scanned backward

by using LF-mapping

LF

...s

s

i...

LF

Paolo Ferragina, Università di Pisa

third ingredient substring search

P = si

Count(P[1,p]):

 Finds <fr,lr> in O(p) time

fr

occ=2

[lr-fr+1]

lr

Third ingredient: substring search

L

unknown

#mississipp

i#mississip

ippi#missis

issippi#mis

ississippi#

mississippi

pi#mississi

ppi#mississ

sippi#missi

sissippi#mi

ssippi#miss

ssissippi#m

i

p

s

s

m

#

p

i

s

s

i

i

Paolo Ferragina, Università di Pisa

the comprressed permuterm

Lexicographically sorted

Build FM-index to support substring searches

The Comprressed Permuterm

Z = $hat$hip$hop$hot$#

Some queries are trivial...

 Prefix(a) = Substring search($a) within Z

 Suffix(b) = Substring search(b$) within Z

 Substr(g) = Substring search(g) within Z

Paolo Ferragina, Università di Pisa

prefixsuffix search

i=3

Key property:

Last char of si is at L[i+1]

Cyclic-LF[i]

If (i > #D) return LF[i]

else return LF[i+1]

LF[3]

CLF[3]

PrefixSuffix search

unknown

Paolo Ferragina, Università di Pisa

prefixsuffix ho p

PrefixSuffix(P):

Search FM-index of Z using Cyclic-LF instead of LF

PrefixSuffix(ho,p)

unknown

$ho

LF

CLF

No change in time/space bounds

of compressed indexes

Paolo Ferragina, Università di Pisa

rank and select of strings
Rank and Select of strings

unknown

Z = $hat$hip$hop$hot$#

Other queries...

 Rank(s) = row of $s$

 Select(i)= backw from L[i+1]

Paolo Ferragina, Università di Pisa

experiments
Experiments

Three dictionaries:

  • Term dictionary: Trec WT10G
  • Host dictionary (reversed): UK-2005
  • Url dictionary (host reversed): first 190Mb of UK-2005

PrefixSuffix search needs *2

Paolo Ferragina, Università di Pisa

a test on urls

Choose your

trade-off

A test on URLs

MRS book says: “one disadvantage of the PI is that its dictionary becomes quite large, including as it does all rotations of each term”.

% dict-size

Now, they mention CPI 

Trade-off

  • Time of 2060 msec/char, and space close to bzip
  • Time close to Front-Coding (4 msec/char), but <50% of its space

Paolo Ferragina, Università di Pisa

slide25

We proposed an approach for dictionary storage:

+Theory: optimal time and entropy-bounds for space

+Practice:trades time vs space, thus fitting user needs

Paolo Ferragina, Università di Pisa