1 / 30

Phrase Identification from Queries and Its Use for Web Search

Phrase Identification from Queries and Its Use for Web Search. Fuchun Peng Microsoft Bing 7/23/2010. Motivation. Query is often treated as a bag of words But when people are formulating queries, they use “concepts” as building blocks. sports psychology (course). simmons college ’s.

clem
Download Presentation

Phrase Identification from Queries and Its Use for Web Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phrase Identification from Queries and Its Use for Web Search Fuchun Peng Microsoft Bing 7/23/2010

  2. Motivation • Query is often treated as a bag of words • But when people are formulating queries, they use “concepts” as building blocks sports psychology (course) simmons college’s Q: simmons college sports psychology A1: “simmons college”, “sports psychology” A2: “college sports” • Can we automatically segment the query to recover the concepts?

  3. Outline • Summary of Segmentation approaches • Use for Improving Search Relevance • Query rewriting • Ranking features • Conclusions

  4. w1 w2 w3 w4 w5 Y N N Y Supervised Segmentation • Supervised learning (Bergsma et al, EMNLP-CoNLL07) • Binary decision at each possible segmentation point • Features: POS, web counts, the, and, … • Problem: • Limited-range context • Features specifically designed for noun phrases

  5. Training Data Annotation • Manual Data Preparation • Linguistic driven • [San jose international airport] • Relevance driven • [San jose] [international airport]

  6. 3,4 MI 1,2 4,5 threshold 2,3 w1 w2 w3 w4 w5 Mutual-information based (Risviket al. WWW 2003) MI(w1,w2) = P(w1w2) / P(w1)P(w2) insert segment boundary w1w2 | w3w4w5 Iterative update • Problem: • only captures short-range correlation (between adjacent words) • What about my heart will go on?

  7. Frequency Based Approach(Hagen et al SIGIR 2010)

  8. LM Based Approach(Tan & Peng WWW 2008) • Assume the query is generated by independent sampling from a probability distribution of concepts: simmons collegesports psychology P=0.000016×0.000002 P(sports psychology)=0.000002 P(simmons college)=0.000016 > unigram model P=0.000007×0.000006×0.000024 simmonscollege sports psychology P(simmons)=0.000007 P(college sports)=0.000006 P(psychology)=0.000024 • Enumerate all possible segmentations; Rank by probability of being generated by the unigram model • How to estimate parameters P(w) for the unigram model?

  9. Parameter (Concept Prob.) Estimation I • We have ngram (n=1..5) counts in a web corpus • 464M documents; L = 33B tokens • Approximate counts for longer ngrams are often computable: e.g. #(harry potter and the goblet of fire) is in [5783, 6399] • #(ABC)=#(AB)+#(BC)-#(AB OR BC) >= #(AB)+#(BC)-#(B) Solved by DP

  10. Parameter Estimation • Maximum Likelihood Estimate: PMLE(t) = #(t) / N • Problem: • #(potter and the goblet of) = 6765 • P(potterand the goblet of) > P(harrypotter and the goblet of fire)? Wrong! • not prob. of seeing t in text, but prob. of seeing tas a self-contained concept in text

  11. Choose parameters to maximize the posterior probability given query-relevant corpus / minimize the total description length) t: a query substring C(t): longest matching count of t D = {(t, C(t)}: query-relevant corpus s(t): a segmentation of t θ: unigram model parameters (ngram probabilities) θ = argmax P(D|θ)P(θ) = argmax log P(D|θ) + log P(θ) log P(D|θ) = ∑t log P(t|θ)C(t) P(t|θ) = ∑ s(t) P(s(t)|θ) posterior prob. DL of corpus DL of parameters Parameter Estimation Query-relevant web corpus

  12. System Architecture

  13. Evaluation – Data sets • Three human-segmented datasets • 3 data sets, for training, validation, and testing, 500 queries for each set • Segmented by three editors A, B, C

  14. w1 w2 w3 w4 w5 Y N N Y Evaluation -- metrics • Evaluation metric: • Boundary classification accuracy • Whole query accuracy: the percentage of queries with perfect boundary classification accuracy • Segment accuracy: the percentage of segments being recovered • Truth [abc] [de] [fg] • Prediction: [abc] [de fg]: precision

  15. Results

  16. Results I

  17. Outline • Summary of Segmentation approaches • Use for Improving Search Relevance • Query rewriting • Ranking features • Conclusions

  18. Use for Improving Relevance • Phrase Proximity Boosting • Phrase Level Query Expansion

  19. Phrase Proximity Boosting • Classifying a segment into one of three categories • Strong concept: no word reordering, no word insertion/deletion • Treat the whole segment as a single unit in matching and ranking • Weak concept: allow word reordering or deletion/insertion • Boost documents matching the weak concepts • Not a concept • Do nothing

  20. Phrase Proximity Boosting • Concept based BM25 • Weighted by the confidence of concepts • Concept based min coverage • Weighted by the confidence of concepts

  21. Phrased Based Expansion • Phrase level replacement • [San Francisco] -> [sf] • [red eye flight] ->[late night flight]

  22. Relevance Results • Significant relevance boosting • Affects 40% query traffic • Significant DCG gain (1.5% for affected queries) • Significant online CTR gain (0.5% over all)

  23. Outline • Summary of Segmentation approaches • Use for Improving Search Relevance • Query rewriting • Ranking features • Conclusions

  24. Conclusions • Data is segmentation is important for query segmentation • Phrases are important for improving relevance

  25. References • Bergsma et al, EMNLP-CoNLL07 • Risvik et al. WWW 2003 • Hagen et al SIGIR 2010 • Tan & Peng, WWW 2008

  26. Thank you!

  27. Parameter Estimation II • Solution 1: Offline segment the web corpus, then collect counts for ngrams being segments ... … | Harry Potter and the Goblet of Fire | is | the | fourth | novel | in | the | Harry Potter series | written by | J.K. Rowling | ... ... harry potter and the goblet of fire += 1 potter and the goblet of += 0 C. G. de Marcken, Unsupervised Language Acquisition, 96 Fuchun Peng, Self-supervised Chinese Word Segmentation, IDA01 • Technical difficulties

  28. Parameter Estimation III • Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches) Q=harry potter and the goblet of fire ... … Harry Potter and the Goblet of Fire is the fourth novel in theHarry Potter series written by J.K. Rowling ... ... harry potter and the goblet of fire += 1 the+= 2 harry potter += 1

  29. Parameter Estimation III • Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches) Q= potter and the goblet ... … Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling ... ... potter and the goblet += 1 the+= 2 potter += 1 Directly compute longest matching counts using raw ngram frequency: O(|Q|2)

More Related