1 / 11

K-tree/forest: Efficient Indexes for Boolean Queries

K-tree/forest: Efficient Indexes for Boolean Queries. Rakesh M. Verma and Sanjiv Behl University of Houston www.cs.uh.edu/~rmverma. Boolean queries. Alice and Bob -- Retrieve documents containing Bob and Alice Alice or Bob -- Retrieve documents containing either Bob or Alice or both

avi
Download Presentation

K-tree/forest: Efficient Indexes for Boolean Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston www.cs.uh.edu/~rmverma

  2. Boolean queries • Alice and Bob -- Retrieve documents containing Bob and Alice • Alice orBob -- Retrieve documents containing either Bob or Alice or both • Alice and not Bob, … University of Houston

  3. Existing solutions Query: Bob and Alice Inverted file • Retrieve inverted list (on disk) for Bob • Retrieve inverted list for Alice • Merge the lists to compute intersection, or • For “And” only: retrieve the shorter list and scan the docs (disk I/Os “saved?” at expense of CPU time) • Google times for query: Bob – 0.11s, Alice – 0.1s, Bob and Alice – 0.2s University of Houston

  4. Existing solutionsQuery: Bob and Alice Build Secondary index on inverted lists • Retrieve secondary index on Bob’s list from disk (assuming secondary index on Bob’s list is smaller) • Search for Alice in secondary index • Retrieve documents University of Houston

  5. K-tree (Leaves point to lists on disk) Alice 0 1 Bob Bob 0 0 1 1 University of Houston

  6. Experiments • Data • 1 million word documents divided into pages of 100 words each • Pages indexed by keywords contained • Methods • BST-based inverted file using merge or scan technique • K-tree • Queries of type: • Single keyword • Two keywords “and/and-not’’ University of Houston

  7. Results for single word query MethodI/O’s • BST-based inverted file 31.26 • K-tree (parallel) 25.36 • K-tree (sequential) 37.05 • K-tree (sequential with no fragmentation) 31.26 Note: index in memory, inverted lists on disk for all methods. Results are averages for all possible queries of type listed before. University of Houston

  8. Results for 2-words and query Method I/O’s • BST-based inverted file (merge) 62.52 • BST-based inverted file (scan) 10.13 • K-tree (parallel) 00.57 • K-tree (sequential) 00.77 • K-tree(sequential with no fragmentation) 00.61 Note: index in memory, inverted lists on disk for all methods. Results are averages for all possible queries of type listed before. University of Houston

  9. K-forest • Tradeoff: size of K-forest vs. post-processing • In general choose size of subset, s, by C(K,s)2s <= avail. Memory.K can be reduced by standard techniques and by considering frequency. Index on sub- sets of size 3 K-trees for 3 keywords University of Houston

  10. K-tree highlights • Advantages: • And/But queries – no post processing • Or queries – require some K-tree traversal • Easy to implement • Easy to parallelize, especially for shorterand/and-not queries and allor queries • Disadvantage: • Size 2K for K keywords – but this is overkill since user queries are typically short (over 90% of queries contain at most 5 keywords). Very rare to have queries with 10 or more keywords. University of Houston

  11. Conclusions and Future Work • We have presented efficient structures (K-tree/forest) for boolean queries • One direction is to do more experiments using for example TREC collections • Another direction is to study how document characteristics can help in choosing the ``right set of keywords’’ to include in these structures University of Houston

More Related