1 / 82

Formal Retrieval Frameworks

Formal Retrieval Frameworks. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu. Outline.

joella
Download Presentation

Formal Retrieval Frameworks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Formal Retrieval Frameworks ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu

  2. Outline • Risk Minimization Framework [Lafferty & Zhai 01, Zhai & Lafferty 06] • Axiomatic Retrieval Framework [Fang et al. 04, Fang & Zhai 05, Fang & Zhai 06]

  3. Risk Minimization Framework

  4. Risk Minimization: Motivation • Long-standing IR Challenges • Improve IR theory • Develop theoretically sound and empirically effective models • Go beyond the limited traditional notion of relevance (independent, topical relevance) • Improve IR practice • Optimize retrieval parameters automatically • SLMs are very promising tools … • How can we systematically exploit SLMs in IR? • Can SLMs offer anything hard/impossible to achieve in traditional IR?

  5. Long-Standing IR Challenges • Limitations of traditional IR models • Strong assumptions on “relevance” • Independent relevance • Topical relevance • Can we go beyond this traditional notion of relevance? • Difficulty in IR practice • Ad hoc parameter tuning • Can’t go beyond “retrieval” to support info. access in general

  6. More Than “Relevance” Desired Ranking Redundancy Readability Relevance Ranking

  7. Retrieval Parameters • Retrieval parameters are needed to • model different user preferences • customize a retrieval model according to different queries and documents • So far, parameters have been set through empirical experimentation • Can we set parameters automatically?

  8. Systematic Applications of Language Models to IR • Many different variants of language models have been developed, but are there many more models to be studied? • Can we establish a road map for exploring language models in IR?

  9. Two Main Ideas of the Risk Minimization Framework • Retrieval as a decision process • Systematic language modeling

  10. Idea 1: Retrieval as Decision-Making(A more general notion of relevance) ? Unordered subset ? … Ranked list Query 1 2 3 4 ? Clustering Given a query, - Which documents should be selected? (D) - How should these docs be presented to the user? () Choose: (D,)

  11. Idea 2: Systematic Language Modeling ? Retrieval Decision: QUERY MODELING Query Language Model Query USER MODELING User Loss Function Documents Document Language Models DOC MODELING

  12. Generative Model of Document & Query [Lafferty & Zhai 01b] U q User Query Partially observed observed R d S Document Source inferred

  13. Applying Bayesian Decision Theory [Lafferty & Zhai 01b, Zhai 02, Zhai & Lafferty 06] Loss L L L queryq userU q Choice: (D1,1) 1 Choice: (D2,2) doc setC sourceS ... Choice: (Dn,n) N loss hidden observed RISK MINIMIZATION Bayes risk for choice (D, )

  14. Benefits of the Framework • Systematic exploration of retrieval models (covering almost all the existing retrieval models as special cases) • Derive general retrieval principles (risk ranking principle) • Automatic parameter setting • Go beyond independent-relevance (subtopic retrieval)

  15. Special Cases of Risk Minimization • Set-based models (choose D) • Ranking models (choose ) • Independent loss • Relevance-based loss • Distance-based loss • Dependent loss • MMR loss • MDR loss Boolean model Probabilistic relevance model Generative Relevance Theory Vector-space Model Two-stage LM KL-divergence model Subtopic retrieval model

  16. Case 1: Two-stage Language Models Loss function Risk ranking formula U q Stage 2: compute Stage 2 (Mixture model) Stage 1 Two-stage smoothing S d Stage 1: compute (Dirichlet prior smoothing)

  17. Case 2: KL-divergence Retrieval Models Loss function Risk ranking formula U q S d

  18. Case 3: Aspect Generative Model of Document & Query U q User Query d Document S Source PLSI: LDA:  =(1,…, k)

  19. Optimal Ranking for Independent Loss “Risk ranking principle” [Zhai 02] Decision space = {rankings} Sequential browsing Independent loss Independent risk = independent scoring

  20. Automatic Parameter Tuning • Retrieval parameters are needed to • model different user preferences • customize a retrieval model to specific queries and documents • Retrieval parameters in traditional models • EXTERNAL to the model, hard to interpret • Parameters are introduced heuristically to implement “intuition” • No principles to quantify them, must set empirically through many experiments • Still no guarantee for new queries/documents • Language models make it possible to estimate parameters…

  21. The Way to Automatic Tuning ... • Parameters must be PART of the model! • Query modeling (explain difference in query) • Document modeling (explain difference in doc) • De-couple the influence of a query on parameter setting from that of documents • To achieve stable setting of parameters • To pre-compute query-independent parameters

  22. Parameter Setting in Risk Minimization Estimate Estimate Query model parameters Set User model parameters Doc model parameters Query Language Model Query User Loss Function Documents Document Language Models

  23. Generative Relevance Hypothesis [Lavrenko 04] • Generative Relevance Hypothesis: • For a given information need, queries expressing that need and documents relevant to that need can be viewed as independent random samples from the same underlying generative model • A special case of risk minimization when document models and query models are in the same space • Implications for retrieval models: “the same underlying generative model” makes it possible to • Match queries and documents even if they are in different languages or media • Estimate/improve a relevant document model based on example queries or vice versa

  24. Risk minimization can easily go beyond independent relevance…

  25. Aspect Retrieval Query: What are the applications of robotics in the world today? Find as many DIFFERENT applications as possible. Aspect judgments A1 A2 A3 … ... Ak d1 1 1 0 0 … 0 0 d2 0 1 1 1 … 0 0 d3 0 0 0 0 … 1 0 …. dk 1 0 1 0 ... 0 1 Example Aspects: A1: spot-welding robotics A2: controlling inventory A3: pipe-laying robots A4: talking robot A5: robots for loading & unloading memory tapes A6: robot [telephone] operators A7: robot cranes … … Must go beyond independent relevance!

  26. Evaluation Measures #doc 1 2 3 … … #asp 2 5 8 … … #uniq-asp 2 4 5 AC: 2/1=2.0 4/2=2.0 5/3=1.67 AU: 2/2=1.0 4/5=0.8 5/8=0.625 Accumulated counts • Aspect Coverage (AC): measures per-doc coverage • #distinct-aspects/#docs • Equivalent to the “set cover” problem, NP-hard • Aspect Uniqueness(AU): measures redundancy • #distinct-aspects/#aspects • Equivalent to the “volume cover” problem, NP-hard • Examples 0 0 0 1 0 0 1 0 1 0 1 1 0 0 1 0 0 0 1 0 1 … ... d1 d2 d3

  27. Dependent Relevance Ranking • In general, the computation of the optimal ranking is NP-hard • A general greedy algorithm • Pick the first document according to INDEPENDENT relevance • Given that we have picked k documents, evaluate the CONDITIONAL relevance of each candidate document • Choose the document that has the highest conditional relevance value

  28. Loss Function L( k+1| 1 … k) Maximal Marginal Relevance (MMR)  1 Novelty/Redundancy Nov ( k+1| 1 … k) The best dk+1 is novel & relevant  k Relevance Rel( k+1) ? dk+1 k+1 Maximal Diverse Relevance (MDR) Aspect Coverage Distrib. p(a|i)  1 The best dk+1 is complementary in coverage  k k+1 known d1 … dk

  29. Maximal Marginal Relevance (MMR) Models • Maximizing aspect coverage indirectly through redundancy elimination • Conditional-Rel. = novel + relevant • Elements • Redundancy/Novelty measure • Combination of novelty and relevance

  30. A Mixture Model for Redundancy Ref. document Maximum Likelihood Expectation-Maximization P(w|Old) Collection P(w|Background) =?  1-

  31. Cost-based Combination of Relevance and Novelty Relevance score Novelty score

  32. Maximal Diverse Relevance (MDR) Models • Maximizing aspect coverage directly through aspect modeling • Conditional-rel. = complementary coverage • Elements • Aspect loss function • Generative Aspect Model

  33. Aspect Generative Model of Document & Query U q User Query d Document S Source PLSI: LDA:  =(1,…, k)

  34. Aspect Loss Function U q  S d

  35. Aspect Loss Function: Illustration perfect redundant “Already covered” p(a|1)... p(a|k -1) non-relevant New candidate p(a|k) Combined coverage Desired coverage p(a|Q)

  36. Risk Minimization: Summary • Risk minimization is a general probabilistic retrieval framework • Retrieval as a decision problem (=risk min.) • Separate/flexible language models for queries and docs • Advantages • A unified framework for existing models • Automatic parameter tuning due to LMs • Allows for modeling complex retrieval tasks • Lots of potential for exploring LMs… • For more information,see [Zhai 02]

  37. Future Research Directions • Modeling latent structures of documents • Introduce source structures (naturally suggest structure-based smoothing methods) • Modeling multiple queries and clickthroughs of the same user • Let the observation include multiple queries and clickthroughs • Collaborative search • Introduce latent interest variables to tie similar users together • Modeling interactive search

  38. Axiomatic Retrieval Framework Most of the following slides are from Hui Fang’s presentation

  39. Traditional Way of Modeling the Relevance Vector Space Models [Salton et al.75, Salton et al. 83, Salton et al. 89, Singhal96] QRep Rel≈Sim(DRep,QRep) Rel≈P(R=1|DRep,QRep) DRep Probabilistic Models [Fuhr et al 92, Lafferty et al 03, Ponte et al 98, Robertson et al. 76, Turtle et al. 91, Rijbergen et al 77] test collection Query Relevance? Document • No way to predict the performance and identify the weaknesses • Sophisticated parameter tuning

  40. No Way to Predict the Performance

  41. Sophisticated Parameter Tuning “k1, b and k3 are parameters which depend on the nature of the queries and possibly on the database; k1 and b default to 1.2 and 0.75 respectively, but smaller values of b are sometimes advantageous; in long queries k3 is often set to 7 or 1000.” [Robertson et al. 1999]

  42. High Parameter Sensitivity

  43. Hui Fang’s Thesis Work [Fang 07] Propose a novel axiomatic framework, where relevance is directly modeled with term-based constraints • Predict the performance of a function analytically [Fang et al., SIGIR04] • Derive more robust and effective retrieval functions [Fang & Zhai, SIGIR05, Fang & Zhai, SIGIR06] • Diagnose weaknesses and strengths of retrieval functions [Fang & Zhai, under review]

  44. Traditional Way of Modeling the Relevance Query QRep Vector Space Models Rel≈Sim(DRep,QRep) Relevance? Rel≈P(R=1|DRep,QRep) Probabilistic Models Document DRep test collection

  45. Axiomatic Approach to Relevance Modeling Constraint 1 (1) Predict performance Constraint 2 We are here … Constraint m Rel(Q,D) (2) Develop more robust functions (3) Diagnose weaknesses Collection (constraint 1) Collection (constraint 2) Collection (constraint m) … Query QRep Vector Space Models Rel≈Sim(DRep,QRep) Relevance? Rel≈P(R=1|DRep,QRep) Probabilistic Models Document DRep test collection

  46. Part 1: Define retrieval constraints[Fang et. al. SIGIR 2004]

  47. Inversed Document Frequency • Pivoted Normalization Method • Dirichlet Prior Method • Okapi Method 1+ln(c(w,d)) Parameter sensitivity Document Length Normalization Alternative TF transformation Term Frequency Empirical Observations in IR (Cont.)

  48. Research Questions • How can we formally characterize these necessary retrieval heuristics? • Can we predict the empirical behavior of a method without experimentation?

  49. Term Frequency Constraints (TFC1) Let q be a query with only one term w. w q : If d1: and d2: then TF weighting heuristic I: Give a higher score to a document with more occurrences of a query term. • TFC1

  50. Term Frequency Constraints (TFC2) w1 w2 q: d1: d2: TF weighting heuristic II: Favor a document with more distinct query terms. • TFC2 Let q be a query and w1, w2be two query terms. Assume and If and then

More Related