1 / 25

XML Compression and Indexing

The Future of Web Search Barcelona, May 2006. XML Compression and Indexing. Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan]. Under patenting by Pisa-Rutgers Univ. Compressed Permuterm Index.

Download Presentation

XML Compression and Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Future of Web Search Barcelona, May 2006 XMLCompression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan] Under patenting by Pisa-Rutgers Univ. Paolo Ferragina, Università di Pisa

  2. Compressed Permuterm Index Paolo Ferragina, Rossano Venturini Dipartimento di Informatica, Università di Pisa Under Y!-patenting Paolo Ferragina, Università di Pisa

  3. A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports • string  id • Prefix(a): find all strings in D that are prefixed by a • Suffix(b): find all strings in D that are suffixed byb • Substring(g):find all strings in D that contain g • PrefixSuffix(a,b) = Prefix(a)  Suffix(b) IR book of Manning-Raghavan-Schutze  Tolerant Retrieval Problem (wildcards) Prefix(a) = a* Suffix(b) = *b Substring(g) = *g* PrefixSuffix(a,b) = a*b Paolo Ferragina, Università di Pisa

  4. A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports • string  id • Prefix(a): find all s in D that are prefixed by a • Suffix(b): find all s in D that are suffixed byb • Substring(g):find all s in D that contain g • PrefixSuffix(a,b) = Prefix(a)  Suffix(b) • Hashing •  Not exact searches Paolo Ferragina, Università di Pisa

  5. A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports • string  id • Prefix(a): find all s in D that are prefixed by a • Suffix(b): find all s in D that are suffixed byb • Substring(g):find all s in D that contain g • PrefixSuffix(a,b) = Prefix(a)  Suffix(b) • (Compacted) Trie •  Two versions: for D and for DR + Intersect answers •  No substring search (unless using Suffix Trie) •  Need to store D for resolving edge-labels Paolo Ferragina, Università di Pisa

  6. A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports • string  id • Prefix(a): find all s in D that are prefixed by a • Suffix(b): find all s in D that are suffixed byb • Substring(g):find all s in D that contain g • PrefixSuffix(a,b) = Prefix(a)  Suffix(b) • Front coding... Paolo Ferragina, Università di Pisa

  7. 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html 0 http://checkmate.com/All/Natural/Washcloth.html... 3035% http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html ... Front-coding uk-2002 crawl ≈250Mb bzip≈ 10% Be back on this, later on! •  Two versions: for D and for DR + Intersect answers • Need some extra data structures for bucket identification • No substring search Paolo Ferragina, Università di Pisa

  8. A basic problem Given a dictionary D of strings, having variable length, compress them in a way that we can efficiently support • string  id • Prefix(a): find all s in D that are prefixed by a • Suffix(b): find all s in D that are suffixed byb • Substring(g):find all s in D that contain byg • PrefixSuffix(a,b) = Prefix(a)  Suffix(b) • Permuterm Index (Garfield, 76) • Reduce any query to a “prefix query” over a larger dictionary Paolo Ferragina, Università di Pisa

  9. Premuterm Index [Garfield, 1976] • Take a dictionary D={yahoo,google} • Append a special char $ to the end of each string • Generate all rotations of these strings • yahoo$ • ahoo$y • hoo$ya • oo$yah • o$yaho • $yahoo • google$ • oogle$g • ogle$go • gle$goo • le$goog • e$googl • $google Prefix(ya) = Prefix($ya) Suffix(oo) = Prefix(oo$) Substring(oo) = Prefix(oo) PrefixSuffix(y,o)= Prefix(o$y) Permuterm Dictionary Space problems Any query on D reduces to a prefix-query on P[D] Paolo Ferragina, Università di Pisa

  10. SIGIR ‘07 Compressed Permuterm Index It deploys two ingredients: • Permuterm index • Compressed full-text index Theoretically: • Query ops take optimal time: proportional to pattern length • Space occupancy is |D| Hk(D) + o(|D| log |S|) bits Technically: A simple reduction step: Permuterm  Compressed index • Re-use known machinery on compressed indexes • Achieve bzip-compression at Front-coding speed Paolo Ferragina, Università di Pisa

  11. #mississipp i i#mississipp ippi#mississ issippi#miss ississippi# m Sort the rows mississippi# T pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i The Burrows-Wheeler Transform(1994) Take the text T = mississippi# L F mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Paolo Ferragina, Università di Pisa

  12. L is highly compressible Compressing L is effective Key observation: • L is locally homogeneous • Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression ! Paolo Ferragina, Università di Pisa

  13. The main idea is to reduce substring search to some basicoperations over arrays of symbols The FM-index [Ferragina-Manzini, JACM ‘05] Survey of Navarro-Makinen contains many other indexes The result: • Count(P): O(p) time • Locate(P): O(occ * polylog(|T|)) time • Display( T[i,i+L] ): O( L + polylog(|T|) ) time • Space occupancy: |T| Hk(T) + o(|T| log |S|) bits   New concept:The FM-index is an opportunistic data structure Compressed Permuterm index builds upon the best two features of the FM-index Paolo Ferragina, Università di Pisa

  14. i ssippi#miss How do we map L’s onto F’s chars ? i ssissippi# m ... Need to distinguishequal chars in F... m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i Take two equal L’s chars Rotate rightward their rows Same relative order !! First ingredient: L  F mapping F L unknown # mississipp i i #mississipp i ppi#mississ Paolo Ferragina, Università di Pisa

  15. 1 2 i ssippi#miss 6 The oracle Rank( s , 9 )= 3 i ssissippi# m 7 m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i 9 First ingredient: L  F mapping F L unknown # mississipp i i #mississipp i ppi#mississ FM-index is actually Rank ds over BWT O(1) time and Hk-space Paolo Ferragina, Università di Pisa

  16. i ssippi#miss i ssissippi# m m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i Backward step(i):  Return LF[i], in O(1) time Second ingredient: Backward step F L unknown # mississipp i i #mississipp i ppi#mississ T scanned backward by using LF-mapping LF ...s s i... LF Paolo Ferragina, Università di Pisa

  17. P = si Count(P[1,p]):  Finds <fr,lr> in O(p) time fr occ=2 [lr-fr+1] lr Third ingredient: substring search L unknown #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m i p s s m # p i s s i i Paolo Ferragina, Università di Pisa

  18. Lexicographically sorted Build FM-index to support substring searches The Comprressed Permuterm Z = $hat$hip$hop$hot$# Some queries are trivial...  Prefix(a) = Substring search($a) within Z  Suffix(b) = Substring search(b$) within Z  Substr(g) = Substring search(g) within Z Paolo Ferragina, Università di Pisa

  19. i=3 Key property: Last char of si is at L[i+1] Cyclic-LF[i] If (i > #D) return LF[i] else return LF[i+1] LF[3] CLF[3] PrefixSuffix search unknown Paolo Ferragina, Università di Pisa

  20. PrefixSuffix(P): Search FM-index of Z using Cyclic-LF instead of LF PrefixSuffix(ho,p) unknown $ho LF CLF No change in time/space bounds of compressed indexes Paolo Ferragina, Università di Pisa

  21. Rank and Select of strings unknown Z = $hat$hip$hop$hot$# Other queries...  Rank(s) = row of $s$  Select(i)= backw from L[i+1] Paolo Ferragina, Università di Pisa

  22. Experiments Three dictionaries: • Term dictionary: Trec WT10G • Host dictionary (reversed): UK-2005 • Url dictionary (host reversed): first 190Mb of UK-2005 PrefixSuffix search needs *2 Paolo Ferragina, Università di Pisa

  23. Paolo Ferragina, Università di Pisa

  24. Choose your trade-off A test on URLs MRS book says: “one disadvantage of the PI is that its dictionary becomes quite large, including as it does all rotations of each term”. % dict-size Now, they mention CPI  Trade-off • Time of 2060 msec/char, and space close to bzip • Time close to Front-Coding (4 msec/char), but <50% of its space Paolo Ferragina, Università di Pisa

  25. We proposed an approach for dictionary storage: +Theory: optimal time and entropy-bounds for space +Practice:trades time vs space, thus fitting user needs Paolo Ferragina, Università di Pisa

More Related