Faceted Searching and Browsing Over Large Collections

Faceted Searching and Browsing Over Large Collections Wisam Dakka, Columbia University

Search Beyond Navigational Queries • Data grows as user needs become more complex, go from just navigation to discovery • [Digital video camera], [energy-efficient cars] • Challenges for major search engines • Discovery or research queries • Limited user activity • Several dimensions of relevance in results but no structure • Prices, stores, reviews, locations, and recent news • Google Views: Faceted search with structure for discovery queries

xRank: Pushing Structure for Special Queries • Search • Learn • Explore • Relate • Scan • Track

[Digital Video Camera] on Yahoo!

Large Collections and Lengthy Results • Most users examine only first or second page of query results • Relevant results not only on first page, but on subsequent pages

Weaknesses of “Plain’’ Search • Search often unsatisfactory • Poor ranking • Large number of relevant items • Broad-scope queries • Search sometimes insufficient • Why do we go to movie rental store or bookstore? • Not effective for curious users and users with little knowledge of collections

Alternatives for Search: The Topic Facet Our contribution: Summarization-aware topic faceted searching and browsing of news articles

Alternatives for Search: The Time Facet Our contribution: General strategy to naturally impose time in the retrieval task

Alternatives for Search: Multiple Facets Our contribution: Automatically building faceted hierarchies

[Barak Obama] [Google IPO] Agenda: Alternatives Alongside Search • Searching and browsing with the topic facet • Searching and browsing with the time facet • Searching and browsing with multiple facets • Extracting useful facets • Automatically constructing faceted hierarchies • Conclusion and future work

[Barak Obama] [Google IPO] Part 2: The Time Facet Time-Faceted Searching and Browsing

Time in News Archives • Topic-relevance ranking may not be sufficiently powerful • Consider query [Madrid bombing] • [Madrid bombing prefer:03/11/2004−04/30/2004] • Searchers often do not know exact time or date a given event occurred

Identify relevant time periods using query terms Restrict query results to these time periods Diversify the top-10 results Alternatively, redefine relevance of a document as a combination of topic relevance and time relevance Improve query reformulations using relevant time periods What to Do When Relevant Time Periods are Unknown?

General Time-Sensitive Queries [Mad Cow] [Hurricane Florida] [Abu Ghraib] [American beheading] [Barak Obama] [Google IPO] • Time-sensitive results • Prioritizing relevant documents from relevant time periods • Ranking those documents first • Temporal relevance or • The likelihood that day is relevant to query using distribution of relevant documents in archive

Temporal Relevance or • Given [Madrid Bombing], what is the probability that today is relevant vs. 04/13/2004? • Simple to compute if relevant documents known • Use estimation when relevant documents unknown The probability that we see relevant documents at time t # Rel Docs

Estimating Techniques for • SUM: Compute value as a normalized weighted sum of the relevance scores of documents published on 04/13/2004 [Diaz and Jones] • BINNING: Compute value as F(bin(04/13/2004)) • Choose a distribution function F • Arrange days in bins and order bins based on their priority • Let bin(04/13/2004) be the priority value of 04/13/2004 bin • WORD: Compute value using frequency of query words on 04/13/2004 • Keep track of word frequency for each day in a special index Top-k matching documents Smoothing is applied

Binning for Estimating • Select distribution function • Arrange days in bins and order bins based on their priority • Daily frequency, past frequency, moving window, accumulated mean, bump shapes • Let bin(t) be the priority value of time t bin • Return F(bin(t)) 13 7 4 k k k k 5 2 1 3 6 7 6 F 1 4 5 13 k 2 3

Answering Queries: Background q=[Madrid Bombing] d= a document in the collection • To answer q, score each d based on d and q content • LM: Rank based on likelihood of generating q from d • BM25: Rank d based on the odds of d being observed in R R= documents relevant to [Madrid Bombing]

Answering Time-Sensitive Queries • Related Work: Answering recency queries • [Barak Obama Speech] or [Myanmar cyclone] • “Boost” topic relevance scores of most recent documents, to promote recent articles • Modify prior in language models  Does not work for other time-sensitive queries • Goal: General framework for all queries • A document has two components: content and time • Combine traditional relevance (content) with temporal relevance

LM for Time-Sensitive Queries q d Time Content This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document Implemented as part of Indri  Developed analogous integration with BM25 (also implemented as part of Lemur)

BM25 for Time-Sensitive Queries q d Time Content This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document This is the content of the document We showed two ways to approximate this factor Implemented as part of Lemur

Evaluating LM and BM25 • Data collections and queries • TREC News Archive • Portion of TREC volumes 4 and 5, 1991-94 • Three sets of time-sensitive queries with relevance judgments • Newsblaster Archive • Six years of news crawled daily from multiple sources • Amazon Mechanical Turk relevance judgments for 76 queries • LM and BM25 with temporal relevance • SUM, BINNING, and WORD • TREC evaluation metrics • P@k and MAP

Performance Over Newsblaster • BUMP and SUM-based improve precision at top recall cutoff levels significantly • precision of our techniques drops for higher recall cutoff levels

Contributions  Identify “most important” time period(s) for queries without user input  Estimate temporal relevance using different techniques  Combine temporal relevance and topic relevance for all time-sensitive queries using several state-of-the-art retrieval models  Evaluate extensively our proposed methods to investigate the implications of adding time into retrieval task

Part 3: Searching and Browsing with Multiple Facets* A. Extracting Useful Facets B. Automatic Construction of Hierarchies * Work published in CIKM05, SIGIR06 Workshop, ICDE07 Demo, and ICDE08

Facets for Searching and Browsing • A facet is a “clearly defined, mutually exclusive, and collectively exhaustive aspect, property, or characteristic of a class or specific subject” [S. R. Ranganathan] Location People Time Topic Actor Animal Useful facets for large collections New York Times Corbis Flickr YouTube

Beyond Topic and Time Facets • Objective • Automatically generate a faceted interface over a large collection • e.g., The New York Times or YouTube • Challenges • We do not knowwhat facets appear in the collection • We need to build the hierarchy for each facet • We need to associate items with facets • e.g., what terms describe the facet in a picture (dog->animal) • Approaches • Supervised and unsupervised extraction of facet terms • Hierarchy construction algorithm for each facet

orange, fish, tail, cute Extraction of Facet Terms Goal: For each new item in the collection, extract descriptive terms and extract a set of usefulfacets • General idea: • Identify important terms within each item • Corbis and YouTube user-provided tags • Derive context for each important term from external resources • e.g., Wikipedia, WordNet, … • Associate terms with facets • Supervised: Group terms with predefined facet like in Corbis • Unsupervised: Cluster terms Cat Dog feline, carnivore, mammal, animal, living being, object, entity

Supervised Extraction: Results Using SVM and Ripper • Baseline • 10% (F1) slightly above random classification • Adding hypernyms 71% (F1) • Adding associated keywords • Ripper • Investigate whether rule-based assignments are sufficient • High-level WordNethypernyms • 55% (F1), significantly worse than SVM • Some classes (facets) work well with simple, rule-based assignment of terms to facets • Generic Animals (93.3%) • Action Process Activity (35.9%) SVM with hypernyms and associated keywords * F1 = harmonic mean of Precision & Recall

Identifying Important Terms for News • Named Entities using LingPipe named entity recognizer • Output: named entities (e.g., Elizabeth II) • Wikipedia Terms using Wikipedia titles, redirects, and anchor text • Output: Wikipedia-listed entities • Yahoo Terms using Yahoo term extractor • Output: significant words or phrases

Extracting Context for News • Document terms too specific for facet hierarchies • Solution: Expand terms by querying external resources • Wikipedia • WordNet

Comparative Term Frequency Analysis Expanded Text DB Original Text DB • Context expansion introduces many noisy terms • However: Facet terms infrequent in original collection, yet frequent in expanded one • Frequency-based shifting • Rank-based shifting • Log-likelihood statistic • Use identified terms to build facet hierarchies

Recall and Precision Data Set: 24 sources (SNB) Recall • Single day of Newsblaster • Month and single day of NYT • Recall: • 5 users per story • Keep terms listed by >2 users • Measure overlap • Precision: • Is hierarchy term useful? • Is it correctly placed? • Term precise if >4 users say yes Precision

Efficient Hierarchy Construction • After identifying facets, need to navigate within each facet • Subsumption algorithm (Croft and Sanderson, SIGIR1999) • Improved version of subsumption algorithm • For best parameter values three times faster than original subsumption algorithm • Good integration with relational databases • Extensive experiments

Ranking Methods: Maximize Coverage • Ranking categories is important and difficult • Important: limited cognitive ability to understand presented information • Difficult: lack of explicit user goals while browsing • Frequency-basedRanking (Baseline) • Users see first categories with greatest wealth of information • Set-cover Ranking • Maximizing cardinality of top-k ranked categories • Merit-based Ranking • Ranks higher categories that enable users to access their contents with smallest average cost

Evaluation Results • Generation algorithm runs three times faster than original subsumption algorithm • Merit-based performs well and offers fast access to contents of collection • Merit-based rankings efficient to implement on top of relational database systems, while set-cover rankings typically take longer to compute

Task-based User Study Over News Articles • Five users, “locate news items of interest” • Search interface that was augmented with our facet hierarchies • Repeat 5 times (different topics) • Initially, keyword search, then facet hierarchies • “War in Iraq” then refinements • Then, used facet hierarchies directly, keywords later • Keyword search was gradually reduced by up to 50% • Time required to complete each task dropped by 25% (compared to search only) • Satisfaction remained statistically steady

Summary of Contributions • Supervised extraction of facets for collection like Corbis • Unsupervised discovery of useful facet terms for news • Identifying important terms in a document using Wikipedia • Deriving important context, useful for facet navigation, using multiple external resources • Evaluating quality and usefulness of the generated facets using extensive user studies with Amazon Mechanical Turk service • Efficient hierarchy construction algorithm • Ranking alternatives • Extensive evaluation • Human evaluation to examine usefulness and effectiveness of hierarchies for free-text collection

Conclusions • Developed efficient summarization-aware search for Newsblaster • Integrated time in state-of-the-art retrieval models • Time-sensitive queries • Temporal relevance • Developed extraction techniques for useful facets • News collections • Corbis • “Created” efficient hierarchy construction algorithm with ranking alternatives • Performed extensive evaluations

Future Work • Complex user needs • Detecting discovery queries • Introducing structure and facets into Web search results for such queries • Using structure data used for QA • Manually or automatically extracted • Using informative and authoritative sources • Integrating of smart views and hierarchies for data representation • Enhancing snippet generation • Temporal summaries • Searching for less tech-savvy users • Elderly or newcomers

Part 1. The Topic Facet* Summarization-Aware Search and Browsing * Work published in JCDL 2007

Topical Hierarchy of News Events With Machine Summaries

Informative snippets: Summaries highlight essence of news to help users navigate Browsing ability: Users should be able to navigate articles in a format similar to browsing Newsblaster Speed: Users should not have to wait 12 hours for query results; they should not even wait 12 minutes! Quality: Users should get relevant results Summarization-Aware Search and Browsing What Makes Search Effective in Newsblaster?

Summarization-Aware Search and Browsing • Offline summarization • Summaries are query-independent • Irrelevant documents and relevant documents might be mixed • Sensitive to summary quality and coverage/coherence • Online summarization • Unacceptably high running time • Hybrid alternative • Some offline clusters might be relevant (no summarization) • Some documents in irrelevant clusters might be relevant

A Hybrid Search Alternative: Reusing Offline Summaries and Clusters When Possible • Select an initial set of offline clusters • Identify relevant offline clusters using a supervised machine learning classifier (more details soon) • Build online clusters using relevant documents from irrelevant clusters • Rank offline and online clusters • Generate summaries for online clusters in the top-k clusters • Return the top-k clusters and their summaries

Identifying Relevant Offline Clusters • Classification task: Given a query and a set of clusters, identify clusters that are relevant to the query • Cluster-level features: • (aggregate) Okapi similarity of cluster documents and query • (aggregate) Okapi similarity of cluster document titles and query • Okapi similarity of cluster summary and query • “recall”: fraction of overall matching documents in cluster • “precision”: fraction of cluster documents that match query • … • Query-level features: • number of “matching” documents in collection • number of “retrieved” clusters • average size of retrieved clusters • (aggregate) Okapi similarity of query and summaries of retrieved clusters • … Further details are omitted from this talk

Step 3: Ranking All Clusters (New and Old) • Not specific to Hybrid Search, but an essential part of it • Only top few clusters returned to users • Need to summarize online only new clusters among top clusters for query • Alternate ranking strategies: • By average Okapi score of matching documents in cluster • By maximum Okapi score of matching documents in cluster • By distance of document with highest Okapi score to cluster “centroid”

Evaluation Questions • Result Quality: How accurate are documents and summaries? • Document P@kand Summary P@k • Usefulness: How helpful are summaries for leading readers to relevant documents? • NDCG (Normalized Discounted Cumulative Gain) • Efficiency: How efficient are our techniques? • Response time • Evaluation Settings • Data set: Several days of Newsblaster • Labeling: Amazon Mechanical Turk • A service for distributing small tasks to a large number of users, paying a few cents per micro-task

Quality of Documents and Summaries in Results P@20 documents P@k summaries • HybridOkapi: At least as good as the state-of-the-art flat-list search • Careful use of offline clusters does not damage overall accuracy • HybridOkapior OnOKapi:On average, returned more relevant summaries than OffDocOkapi

Usefulness of Summaries in Results Can MTURK annotators use the summaries to predict the perfect ranking? • HybridOkapi and OnOkapi summaries substantially outperform OffDocOkapi summaries • OffDocOkapi summaries are computed in a query-independent fashion • Top-3 summaries of each technique shown to 5 annotators • Use NDCG to measure quality of ranking • NDCG=1 means perfect ranking

Faceted Searching and Browsing Over Large Collections

Faceted Searching and Browsing Over Large Collections

Presentation Transcript

WebDewey Basics: Searching Browsing WebDewey

Keyword Searching and Browsing in Databases using BANKS

Entity Categorization Over Large Document Collections

Interactively Browsing Large image databases

CMDI Virtual Language Observatory Faceted Browsing

Adaptive Faceted Browsing in Job Offers

Bandits and Browsing: Data Mining and Network Analysis for Library Collections

Faceted browsing for ACL Anthology

Entity Categorization Over Large Document Collections

SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS

Searching and Browsing Using Tags

Creating a New Faceted Browsing Function for Millennium WebPAC Pro

Internet Searching and Browsing in a Multilingual World

Potential of freely faceted classification for knowledge retrieval and browsing

Exploring Digital Libraries: Integrating Browsing, Searching, and Visualization

Keyword Searching and Browsing in Databases using BANKS

Searching Large Scientific Data

NLP Support for Faceted Navigation in Scholarly Collections