1 / 38

The Power of Prefix Search (with a nice open problem)

The Power of Prefix Search (with a nice open problem). Talk at ADS 2007 in Bertinoro, October 3 rd. Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany. Overview. Part 1 Definition of our prefix search problem Applications Demos of our search engine Part 2

sumi
Download Presentation

The Power of Prefix Search (with a nice open problem)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Power of Prefix Search(with a nice open problem) Talk at ADS 2007 in Bertinoro, October 3rd Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany

  2. Overview • Part 1 • Definition of our prefix search problem • Applications • Demos of our search engine • Part 2 • Problem definition again • One way to solve it • Another way to solve it • Your way to solve it

  3. Part 1 Definition, Applications, Demos

  4. Problem Definition — Formal • Context-Sensitive Prefix Search • Preprocess • a given collection of text documents such that queries of the following kind can be processed efficiently • Given • an arbitrary set of documentsD • and a range of words W • Compute • all word-in-document pairs (w,d)such that w є W and d є D

  5. Problem Definition — Visual D74 J W Q D3 Q DA • Data is given as • documents containing words • documents have ids (D1, D2, …) • words have ids (A, B, C, …) • Query • given a sorted list of doc ids • and a range of word ids D17 B WU K A D43 D Q D1 A O E W H D92 P U D E M D53 J D E A D78 K L S D27 K L D F D9 E E R D4 K L K A B D88 P A E G Q D2 B F A D32 I L S D H D98 E B A S D13 A O E W H D13 D17 D88 … C D E F G H

  6. Problem Definition — Visual D74 J W Q D3 Q DA • Data is given as • documents containing words • documents have ids (D1, D2, …) • words have ids (A, B, C, …) • Query • given a sorted list of doc ids • and a range of word ids • Answer • all matching word-in-doc pairs • with scores • and positions D17 B WU K A D17 B WU K A D43 D Q D1 A O E W H D92 P U D E M D53 J D E A D78 K L S D27 K L D F D9 E E R D4 K L K A B D88 P A E G Q D88 P A E G Q D2 B F A D32 I L S D H D98 E B A S D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H

  7. Problem Definition — Visual D74 J W Q D3 Q DA • Data is given as • documents containing words • documents have ids (D1, D2, …) • words have ids (A, B, C, …) • Query • given a sorted list of doc ids • and a range of word ids • Answer • all matching word-in-doc pairs • with scores • and positions D17 B WU K A D17 B WU K A D43 D Q D1 A O E W H D92 P U D E M D53 J D E A D78 K L S D27 K L D F D9 E E R D4 K L K A B D88 P A E G Q D88 P A E G Q D2 B F A D32 I L S D H D98 E B A S D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H

  8. Application 1: Autocompletion • After each keystroke • display completions of the last query word that lead to the best hits, together with the best such hits • e.g., for the query probabilistic alg display algorithm and algebra and show hits for both

  9. Application 2: Error Correction • As before, but also … • … display spelling variants of completions that would lead to a hit • e.g., for the query probabilistic algorithm also consider a document containing probalistic aigorithm • Implementation • if, say, aigorithm occurs as a misspelling of algorithm, then for every occurrence of aigorithm in the index aigorithm Doc. 17 also add algorithm::aiogorithm Doc. 17

  10. Application 3: Query Expansion • As before, but also … • … display words related to completions that would lead to a hit • e.g., for the query russia metal also consider documents containing russia aluminium • Implementation • for, say, every occurrence of aluminium in the index aluminium Doc. 17 also add (once for every occurrence) s:67:aluminium Doc. 17 and (one once for the whole collection) s:aluminium:67 Doc. 00

  11. Application 4: Faceted Search • As before, but also … • … along with the completions and hits, display a breakdown of the result set by various categories • e.g., for the query algorithm show (prominent) authors of articles containing these words • Implementation • for, say, an article by Camil Detrescu that appeared in SODA 2006, add author:Camil_Demetrescu Doc. 17 venue:SODA Doc. 17 year:2006 Doc. 17 • also add camil:author:Camil_Demetrescu Doc. 17 demetrescu:author:Camil_Demetrescu Doc. 17etc.

  12. Application 5: Semantic Search • As before, but also … • … display “semantic” completions • e.g., for the query beatles musician display instances of the class musician that occur together with the word beatles • Implementation • cannot simply duplicate index entries of an entity for each category it belongs to, e.g. John Lennon is a singer, songwriter, person, human being, organism, guitarist, pacifist, vegetarian, entertainer, musician, … • tricky combination of completions and joins  SIGIR’07 and still more applications …

  13. Part 2 Solutions and Open Problem

  14. Solution 1: Inverted Index • For example, probab* alg* given the documents: D13, D17, D88, … (ids of hits for probab*) and the word range : C D E F G (ids for alg*) • Iterate over all words from the given range C (algae) D8, D23, D291, ... D (algarve) D24, D36, D165, ... E (algebra) D13, D24, D88, ... F (algol) D56, D129, D251, ... G (algorithm) D3, D15, D88, ... • Intersect each list with the given one and merge the results D13 D88 D88 …E E G … running time |D|∙ |W| + log |W|∙ merge volume

  15. A General Idea • Precompute inverted lists for ranges of words list for A-D • Note • each prefix corresponds to a word range • ideally precompute list for each possible prefix • too much space • but lots of redundancy

  16. Solution 2: AutoTree SPIRE’06 / JIR’07 • Trick 1: Relative bit vectors • the i-th bit of the root node corresponds to the i-th doc • the i-th bit of any other node corresponds to the i-th set bit of its parent node aachen-zyskowski 1111111111111… corresponds to doc 5 maakeb-zyskowski 1001000111101… corresponds to doc 5 maakeb-stream 1001110… corresponds to doc 10

  17. Solution 2: AutoTree SPIRE’06 / JIR’07 • Tricks 2: Push up the words • For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node algorithm D = 5, 7, 10 W = max* advance advance advance advance aachen aachen aachen algol art 1 1 1 1 1 1 1 1 1 1 … maximum manning maximal maximal manner D = 5, 10 (→ 2, 5) report: maximum 1 0 0 0 1 0 0 1 1 1 … mazza middle maple D = 5 report: Ø →STOP 1 0 0 1 1 …

  18. Solution 2: AutoTree SPIRE’06 / JIR’07 • Tricks 3: divide into blocks • and build a tree over each block as shown before

  19. Solution 2: AutoTree SPIRE’06 / JIR’07 • Tricks 3: divide into blocks • and build a tree over each block as shown before

  20. Solution 2: AutoTree SPIRE’06 / JIR’07 • Tricks 3: divide into blocks • and build a tree over each block as shown before • Theorem: • query processing time O(|D| + |output|) • uses no more space than an inverted index • AutoTree Summary: +output-sensitive • not IO-efficient (heavy use of bit-rank operations) • compression not optimal

  21. Parenthesis • Despite its quadratic worst-case complexity, the inverted index is hard to beat in practice • very simple code • lists are highly compressible • perfect locality of access • Number of operations is a deceptive measure • 100 disk seeks take about half a second • in that time can read 200 MB of contiguous data(if stored compressed) • main memory: 100 non-local accesses 10 KB data block data

  22. Solution 3: HYB SIGIR’06 / IR’07 • Flat division of word range into blocks list for A-D list for E-J list for K-N

  23. Solution 3: HYB SIGIR’06 / IR’07 • Flat division of word range into blocks • Replace doc ids by gaps and words by frequency ranks: • Encode both gaps and ranks such that x  log2 x bits +0  0+1  10+2  110 1st (A)  0 2nd (C)  10 3rd (D)  111 4th (B)  110 • An actual block of HYB

  24. Solution 3: HYB SIGIR’06 / IR’07 • Flat division of word range into blocks • Theorem: • Let n = number of documents, m = number of words • If blocks are chosen of equal volume ε ∙ n • Then query time ε ∙ n and empiricial entropy HHYB ~ (1+ ε) ∙ HINV • HYB Summary: + IO-efficient (mere scans of data) + very good compression • not output-sensitive

  25. Open Problem • A solution for context-sensitive prefix search which is both output-sensitive and IO-efficient • Note: the interesting queries are those with large D and W but small result set • Similar situation for substring search / suffix arrays • all algorithms with good compression have poor locality of access • But prefix search is easier … • … and more relevant for text search Thank you!

  26. INV vs. HYB — Space Consumption Theorem: The empirical entropy of INV isΣ ni∙ (1/ln 2 + log2(n/ni)) Theorem: The empirical entropy of HYB with block size ε∙nis Σ ni∙ ((1+ε)/ln 2 + log2(n/ni)) ni= number of documents containing i-th word, n = number of documents Nice match of theory and practice

  27. INV vs. HYB — Query Time • Experiment: type ordinary queries from left to right db , dbl , dblp , dblp un , dblp uni , dblp univ , dblp unive , ... INV HYB HYB beats INV by an order of magnitude

  28. Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

  29. Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

  30. Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

  31. Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

  32. Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

  33. System Design — High Level View Compute ServerC++ Web ServerPHP User ClientJavaScript Debugging such an application is hell!

  34. Basic Problem Definition • Definition: Context-sensitive prefix search and completion • Given a query consisting of • sorted list Dof doc ids Doc15Doc183Doc185Doc17351 … • range Wof word ids Word1893 – Word7329 • Compute as a result • all (w, d) w Є W, d Є DDoc15Doc15Doc17351... sorted by doc id Word7014Word5112Word2011… • Refinements • positions Pos12Pos73Pos44... • scores 0.70.30.5...

  35. Basic Problem Definition • For example, dblp uni • set D = document ids from result for dblp • range W = word ids of all words starting with uni →multi-dimensional query processed as sequence of 1½ dimensional queries • For example, intersect completions of resultsfor conf:sigir author: and conf:sigmod author:

  36. Basic Problem Definition • For example, dblp uni • set D = document ids from result for dblp • range W = word ids of all words starting with uni →multi-dimensional query processed as sequence of 1½ dimensional queries • For example, intersect completions of resultsfor conf:sigir author: and conf:sigmod author: → efficient, because completions are from small range

  37. Conclusions • Context-sensitive prefix search and completion • is a fundamental operation • supports autocompletion search, semantic search, faceted search, DB-style selects and joins, ontology search, … • efficient support via HYB index • very good compression properties • perfect locality of access • Some open issues • integrate top-k query processing • what else can we do with it? • very short prefixes

More Related