Data-Driven Query Segmentation for Web Search Enhancement

Phrase Identification from Queries and Its Use for Web Search Fuchun Peng Microsoft Bing 7/23/2010

Motivation • Query is often treated as a bag of words • But when people are formulating queries, they use “concepts” as building blocks sports psychology (course) simmons college’s Q: simmons college sports psychology A1: “simmons college”, “sports psychology” A2: “college sports” • Can we automatically segment the query to recover the concepts?

Outline • Summary of Segmentation approaches • Use for Improving Search Relevance • Query rewriting • Ranking features • Conclusions

w1 w2 w3 w4 w5 Y N N Y Supervised Segmentation • Supervised learning (Bergsma et al, EMNLP-CoNLL07) • Binary decision at each possible segmentation point • Features: POS, web counts, the, and, … • Problem: • Limited-range context • Features specifically designed for noun phrases

Training Data Annotation • Manual Data Preparation • Linguistic driven • [San jose international airport] • Relevance driven • [San jose] [international airport]

3,4 MI 1,2 4,5 threshold 2,3 w1 w2 w3 w4 w5 Mutual-information based (Risviket al. WWW 2003) MI(w1,w2) = P(w1w2) / P(w1)P(w2) insert segment boundary w1w2 | w3w4w5 Iterative update • Problem: • only captures short-range correlation (between adjacent words) • What about my heart will go on?

Frequency Based Approach(Hagen et al SIGIR 2010)

LM Based Approach(Tan & Peng WWW 2008) • Assume the query is generated by independent sampling from a probability distribution of concepts: simmons collegesports psychology P=0.000016×0.000002 P(sports psychology)=0.000002 P(simmons college)=0.000016 > unigram model P=0.000007×0.000006×0.000024 simmonscollege sports psychology P(simmons)=0.000007 P(college sports)=0.000006 P(psychology)=0.000024 • Enumerate all possible segmentations; Rank by probability of being generated by the unigram model • How to estimate parameters P(w) for the unigram model?

Parameter (Concept Prob.) Estimation I • We have ngram (n=1..5) counts in a web corpus • 464M documents; L = 33B tokens • Approximate counts for longer ngrams are often computable: e.g. #(harry potter and the goblet of fire) is in [5783, 6399] • #(ABC)=#(AB)+#(BC)-#(AB OR BC) >= #(AB)+#(BC)-#(B) Solved by DP

Parameter Estimation • Maximum Likelihood Estimate: PMLE(t) = #(t) / N • Problem: • #(potter and the goblet of) = 6765 • P(potterand the goblet of) > P(harrypotter and the goblet of fire)? Wrong! • not prob. of seeing t in text, but prob. of seeing tas a self-contained concept in text

Choose parameters to maximize the posterior probability given query-relevant corpus / minimize the total description length) t: a query substring C(t): longest matching count of t D = {(t, C(t)}: query-relevant corpus s(t): a segmentation of t θ: unigram model parameters (ngram probabilities) θ = argmax P(D|θ)P(θ) = argmax log P(D|θ) + log P(θ) log P(D|θ) = ∑t log P(t|θ)C(t) P(t|θ) = ∑ s(t) P(s(t)|θ) posterior prob. DL of corpus DL of parameters Parameter Estimation Query-relevant web corpus

System Architecture

Evaluation – Data sets • Three human-segmented datasets • 3 data sets, for training, validation, and testing, 500 queries for each set • Segmented by three editors A, B, C

w1 w2 w3 w4 w5 Y N N Y Evaluation -- metrics • Evaluation metric: • Boundary classification accuracy • Whole query accuracy: the percentage of queries with perfect boundary classification accuracy • Segment accuracy: the percentage of segments being recovered • Truth [abc] [de] [fg] • Prediction: [abc] [de fg]: precision

Results

Results I

Use for Improving Relevance • Phrase Proximity Boosting • Phrase Level Query Expansion

Phrase Proximity Boosting • Classifying a segment into one of three categories • Strong concept: no word reordering, no word insertion/deletion • Treat the whole segment as a single unit in matching and ranking • Weak concept: allow word reordering or deletion/insertion • Boost documents matching the weak concepts • Not a concept • Do nothing

Phrase Proximity Boosting • Concept based BM25 • Weighted by the confidence of concepts • Concept based min coverage • Weighted by the confidence of concepts

Phrased Based Expansion • Phrase level replacement • [San Francisco] -> [sf] • [red eye flight] ->[late night flight]

Relevance Results • Significant relevance boosting • Affects 40% query traffic • Significant DCG gain (1.5% for affected queries) • Significant online CTR gain (0.5% over all)

Conclusions • Data is segmentation is important for query segmentation • Phrases are important for improving relevance

References • Bergsma et al, EMNLP-CoNLL07 • Risvik et al. WWW 2003 • Hagen et al SIGIR 2010 • Tan & Peng, WWW 2008

Thank you!

Parameter Estimation III • Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches) Q=harry potter and the goblet of fire ... … Harry Potter and the Goblet of Fire is the fourth novel in theHarry Potter series written by J.K. Rowling ... ... harry potter and the goblet of fire += 1 the+= 2 harry potter += 1

Parameter Estimation III • Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches) Q= potter and the goblet ... … Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling ... ... potter and the goblet += 1 the+= 2 potter += 1 Directly compute longest matching counts using raw ngram frequency: O(|Q|2)

Data-Driven Query Segmentation for Web Search Enhancement

Data-Driven Query Segmentation for Web Search Enhancement

Presentation Transcript

Tool Identification, Use, and Safety

Structured Queries for Legal Search

On Understanding and Classifying Web Queries

Search Engine Optimization- Where To Use Your Keyword Phrase

Credit and Its Use

Exploring Reduction for Long Web Queries

Example queries for Federated search

Reducing Latency of Web Search Queries

Tool Identification and Use

Retroactive Answering of Search Queries

Clustering Web Queries

Language Identification of Search Engine Queries

Personalized Web Search by Mapping User Queries to Categories

Retroactive Answering of Search Queries

Detecting Dominant Locations from Search Queries

Answering Similar Region Search Queries

Building Taxonomy of Web Search Intents for Name Entity Queries

Electron Diffraction Search and Identification Strategies

Language use and identification

Personalized Web Search Uncommon Responses to Common Queries

Exactly How to Pick Efficient Search Phrase Phrases for Search Engines

Easy to Use Web Search