1 / 32

Dynamic Faceted Search for Discovery-driven Analysis

Dynamic Faceted Search for Discovery-driven Analysis. Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling Date: 2008/12/18. Outline. Introduction Terminology and Problem Statement Measure of “Interestingness”

karlyn
Download Presentation

Dynamic Faceted Search for Discovery-driven Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Faceted Search for Discovery-driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling Date: 2008/12/18

  2. Outline • Introduction • Terminology and Problem Statement • Measure of “Interestingness” • Implementing Dynamic Faceted Search • Evaluation • Conclusion and Future work

  3. Introduction • Today’s faceted search systems are designed for browsing catalog data and are not directly suitable for discovery-driven exploration • To preserve browsing consistency, facets selected for navigation tend to be “static” • When browsing online catalogs, the navigational facets are single-dimensional only

  4. Introduction • Propose a dynamic faceted search system for the kind of discovery-driven analysis that is often performed in On-Line Analytical Processing (OLAP) systems • From a potentially large search result, this paper wants to automatically and dynamically discover a small set of facets and values that are deemed most “interesting” to a user

  5. Terminology and Problem Statement • Defn 1. • A repository D is a collection of documents • Each of which is composed of some free text and one or more <facet: value> pairs • Given a value f in facet F, we call <F:f> an instance of F • All unique values associated with a facet F form the domain of F

  6. Terminology and Problem Statement • Defn 2. • Organize the domain of these facets into a facet hierarchy • Each node in the hierarchy stores a <facet: value> pair • A node <F1: f1> is the parent of another node <F2: f2> if for each document, F2 = f2 implies F1 = f1

  7. Terminology and Problem Statement • Defn 3. • Assume a query q on the repository has the form “keywords && F1 = f1 && F2 = f2…” • The result of q is denoted by Dq • Includes the set of documents having the specified keywords • Satisfying all constraints on selected facets

  8. Terminology and Problem Statement • Defn 4. • Given a query q, define a facet summary for a facet set F1, …, Fm as a list of tuples <f1, …, fm, A(f1, …, fm)> over Dq • fi is an instance of facet Fi • A(f1, …, fm) is an aggregate of documents in Dq that contain all these facet instances

  9. Terminology and Problem Statement • Problem Definition: • Given a repository of documents with n facets, a query q, 2 integers K1 & K2 •  select K1 facet sets and a facet summary for each with up to K2 tuples that are the most “interesting” to a user

  10. Measure of “Interestingness” • Interestingness: How surprising an actual aggregated value is, given a certain expectation

  11. Measure of “Interestingness”*Setting the Expectation • For a given set of facet values f1, …, fmfrom F1, …, Fm: • CD(f1, …, fm ): the count of the number of documents with all those facet values in D • Cq(f1, …, fm ): the count of the number of documents with all those facet values in Dq • E[Cq(f1, …, fm )]: an “expected” value for Cq(f1, …, fm ) • Natural、navigational、ad hoc

  12. Measure of “Interestingness”*Setting the Expectation • Natural: • For an individual facet instance <F:f>: (uniformity assumption) • For an instance f1, …, fm of a facet set: (independence assumption)

  13. Measure of “Interestingness”*Setting the Expectation • Navigational: • Ad hoc: • User can tell the system to set expectation based on an arbitrary query q of the user’s choice • Set the count for each facet value proportionally based on the distribution of the result of q

  14. Measure of “Interestingness”*Measuring Degree of Interestingness • Single facet instance: • By evaluating it with respect to a scenario in which its associated count is generated by random sampling • The smaller the probability of observing the count under random sampling, the more interesting the facet instance

  15. Measure of “Interestingness”*Measuring Degree of Interestingness • p-value: • Suppose that a certain facet value occurs in r out of R documents in the repository and in q out of Q documents in the output of a certain query • Also suppose • The interestingness of that facet value vis-à-vis the query: the probability that in a random sample of size Q there will be at least q documents with that facet value • hypergeometric distribution  normal distribution or Poisson distribution

  16. Measure of “Interestingness”*Measuring Degree of Interestingness • The whole facet: • For each facet F, we consider the p-values of only the k most interesting values in F • , replace  • The final measure: • MaxWeight: assign 1 to w1 and 0 to the rest • AvgWeight: assign each wi an equal weight • HybridWeight: average the interesingness computed by MaxWeight and AvgWeight

  17. Implementing Dynamic Faceted Search • Solr: indexes facets without storing them • Enumerates every facet instance <F: f> from the index and intersects its posting list with Dq • From the intersected set, it derives the count on facet value f • Caches each posting list to a bitset • If the bitset is dense: bitmap • Otherwise: a hash map of document IDs

  18. Implementing Dynamic Faceted Search • Improving Solr: • Solr limitation 1: has to choose a threshold that decides the representation of the bitset • represent a bitset as a compressed bitmap using Word-Aligned Hybrid (WAH) code

  19. Implementing Dynamic Faceted Search • WAH • There are 2 types of words: • Literal words: a verbatim representation of 31 bits • Fill words: encodes the length of a list of all 0’s and 1’s in 30 bits • A bitmap is broken into groups of 31 bits first and then converted into a sequence of literal and fill words • Operations on bitmaps such as intersection can be performed on WAH code directly without decoding

  20. Implementing Dynamic Faceted Search • Improving Solr: • Solr limitation 2: it has to intersect the matching document set Dq with the bitset of every facet instance • reduce the number of intersections by building a directory structure called bitset tree on top of the bitsets of a facet

  21. Implementing Dynamic Faceted Search • Building and Using a Bitset Tree • Starting with the leaf nodes, for each bitset b corresponding to facet instance <F: f>, we create an entry <b, null> • Then divide all entries into groups of size s • For each group, we generate a leaf node holding all entries in that group

  22. Evaluation*Setup • DBLP • Contains about 13,000 papers published in 26 venues (e.g., SIGMOD, VLDB, TODS, etc) in the past 30 years • It has 14 facets organized in 6 hierarchies, including author, venue, time (e.g., decade, year), location (e.g., country, city), number of authors per paper, number of citations per paper • Use the title of each paper as text for keywords searches • Conduct the user survey

  23. Evaluation*Setup • Patent • Has about 1.8 million U.S. patents from the past 30 years • 16 facets organized into 10 hierarchies • Use for performance evaluation

  24. Evaluation*Result from a User Survey • Performed tests on 3 keyword queries • 2 are provided by author: “distributed”, “mining” • Users pick the 3 keyword • 1 base on natural • 2 base on navigational • 1 used complete repository • 1 used previous query

  25. Evaluation*Result from a User Survey

  26. Evaluation*Result from a User Survey • Our dynamic approach also received some negative feedback • Overall, the feedback for the natural expectation is neutral • Different ways of aggregating the degree of interestingness • HybridWeight(7) > MaxWeight(6) > AvgHeight(2)

  27. Evaluation*Performance Results • Environment: • Implemented in Java • 3GHz P4 desktop machine with 1GB memory • A single disk drive, running Linux • Version: • simple: inverted index • Solr • compressed: improves Solr by WAH code • tree: improves Solr by bitset trees • compressed-tree: both WAH and bitset tree on Solr

  28. Evaluation*Performance Results • Scaling with Data Size • Run a query that matches 25,000 docs using tree • Break the total time into search time & summary computation time

  29. Evaluation*Performance Results

  30. Evaluation*Performance Results

  31. Conclusion and Future Work • Develop a novel dynamic faceted search system • support OLAP-style discovery-driven analysis • on a large set of structured and unstructured data • Propose an intuitive and effective way of measuring “interestingness” • Propose a novel navigational ,method of setting a user’s expectation

  32. Conclusion and Future Work • Incorporate user feedback in facet selection • How to extend the aggregates to functions other than count • Sum, average on some numerical measures • How to support dynamic faceted search in a distributed environment

More Related