1 / 38

Statistical Learning Methods for Information Retrieval

NUS April 12, 2006. Statistical Learning Methods for Information Retrieval. Hang Li Microsoft Research Asia. Talk Outline. Expert Search: Two-Stage Model Relevance Ranking: Ranking SVM for IR. Two Stage Model for Expert Search.

paley
Download Presentation

Statistical Learning Methods for Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NUS April 12, 2006 Statistical Learning Methods for Information Retrieval Hang Li Microsoft Research Asia

  2. Talk Outline • Expert Search: Two-Stage Model • Relevance Ranking: Ranking SVM for IR

  3. Two Stage Model for Expert Search Yunbo Cao, Jingjing Liu, Shenghua Bao, Hang Li, Nick Craswell

  4. Expert Search Who knows about X query people

  5. Expert Search -- Example • Who knows about digital ink? Person Query

  6. Expert Search -- Example • Who knows about digital ink? Person Query

  7. Related Work • Profile-based approach [Craswell] Co-occurrences between keywords and personal names

  8. Two Stage Model for Expert Search • Rank people using two probability models • Relevance model • Co-occurrence model Co-occurrence Prior Relevance

  9. d1 e1 q d2 e2 d3 Two Stage Model for Expert Search

  10. Two-stage model • Model • Document Relevance Model --- Language Model • Co-occurrence Model --- Mixture of Submodels

  11. Document Relevance Model • Who knows about timed text?

  12. Window-based Sub-model irrelevant person relevant person query

  13. Title-Author Sub-model query author

  14. Block-based Sub-model • The co-occurrences appear in the tree structure of <section>s (tags: <H1> <H2> <H3> <H4> <H5> <H6> ) Query: W3C Management Team <H1> <H2> persons <H2>

  15. Neighbor-based Sub-model irrelevant person query relevant person

  16. Cluster-based Sub-Model • People that often co-occur share same expertise areas • Cluster people and then use cluster-based model

  17. Expert Search -Implementation Co-occurrence Model Query

  18. TREC Expert Search • Document collection • A crawl of W3C site (http://w3c.org) in June 2004 • 331,307 web pages • Ground truth • W3C working groups with the names of groups as query topics and the members of groups as experts (10 training topics and 50 test topics)

  19. Experimental Results

  20. Experimental Results

  21. Ranking SVM for IR Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, Hsiao-Wuen Hon

  22. General Model for Ranking documents (information) relevance scores for ranking query (or question)

  23. Learning to Rank (Herbrich et al., 2000; Chris Burges 2005) documents relevance scores for ranking query (or question) Multiple ranks Ranking SVM, RankNet

  24. Learning Model

  25. Evaluation Measures • MRR (Mean Reciprocal Rank) • WTA (Winners Take All) • MAP (Mean Average Precision) • NDCG (Normalized Discounted Cumulative Gain)

  26. NDCG • Query: • DCG at position m: • NDDG at position m: average over queries • Example • (3, 3, 2, 2, 1, 1, 1) • (7, 7, 3, 3, 1, 1, 1) • (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) • (7, 11.41, 12.91, 14.2, 14.59, 14.95, 15.28) rank r gain discount

  27. Ranking SVM • Given • We learn a function • Consider a linear function • Transforming to classification

  28. Ranking SVM (cont’) • Ranking Model: f(x)

  29. Direct Application of Ranking SVM to Document Retrieval • Query document pair  feature vector • Combining instance pairs from all queries

  30. Problems with Direct Application • Cost sensitiveness: negative effects of making errors on top d: definitely relevant, p: partially relevant, n: not relevant ranking 1: p d p n n n n ranking 2: d p n p n n n • Query normalization: number of instance pairs varies according to query q1: d p p n n n n q2: d d p p p n n n n n q1 pairs: 2*(d, p) + 4*(d, n) + 8*(p, n) = 14 q2 pairs: 6*(d, p) + 10*(d, n) + 15*(p, n) = 31

  31. Rank Pair Discrepancy

  32. Query Normalization

  33. New Loss function

  34. Optimization (Gradient Descent)

  35. Optimization (Quadratic Programming)

  36. Experimental Results (OHSUMED)

  37. Experimental Results (MSN)

  38. Thank You!

More Related