1 / 30

A Formal Study of Information Retrieval Heuristics

A Formal Study of Information Retrieval Heuristics. Hui Fang , Tao Tao , ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004. Presented by CHU Huei-Ming 2004/01/17. Outline. Formal Definitions of Heuristic Retrieval Constraints

gavan
Download Presentation

A Formal Study of Information Retrieval Heuristics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Formal Study of Information Retrieval Heuristics Hui Fang , Tao Tao , ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented by CHU Huei-Ming 2004/01/17

  2. Outline • Formal Definitions of Heuristic Retrieval Constraints • Analysis of Three Representative Retrieval Formulas • Pivoted Normalization Method • Okapi Method • Dirichlet Prior Method • Experiments • Conclusion and Future Work

  3. Formal Definitions of Heuristic Retrieval Constraints • Six intuitive and desirable constraints • Any reasonable retrieval formula should satisfy • Term Frequency Constraints (TFCs) • Term Discrimination Constraints (TDC) • Length Normalization Constraints (LNCs) • TF-Length Constraints (TF-LNC)

  4. Formal Definitions of Heuristic Retrieval Constraints • Term Frequency Constraints (TFCs) • TFC1: q={w} , Assume |d1|=|d2|. If c(w,d1) > c(w,d2), then f(d1,q) > f(d2,q) • TFC2: q={w} , Assume |d1|=|d2|=|d3| , c(w,d1)>0, If c(w,d2) - c(w,d1) =1 , c(w,d3) - c(w,d2) =1 then f(d2,q) - f(d1,q) > f(d3,q) - f(d2,q)

  5. Formal Definitions of Heuristic Retrieval Constraints • Term Discrimination Constraints (TDC) • Let q be a query , and w1,w2 q be two query term • Assume |d1|=|d2| , c(w1,d1) + c(w2,d1)= c(w1,d2) + c(w2,d2) • If idf(w1) ≥ idf(w2) and c(w1,d1) > c(w2,d2) , then f(d1,q) ≥ f(d2,q)

  6. Formal Definitions of Heuristic Retrieval Constraints • Length Normalization Constraints (LNCs) • LNC1 • Let q be a query , d1 and d2 are two documents • If some word w’ q , c(w’,d2) = c(w’,d1) +1 but for any query term w, c(w,d2) = c(w,d1)then f(d1,q) ≥ f(d2,q) • LNC2 • Let q be a query ,∀ k >1 , d1 and d2 are two documents • If |d1| = k · |d2| and for all terms w , c(w, d1) = k · c(w, d2), • then f(d1, q) ≥ f(d2, q).

  7. Formal Definitions of Heuristic Retrieval Constraints • TF-Length Constraints (TF-LNC) • q={w}, d1 and d2 are two documents • If c(w,d1) > c(w,d2) and |d1|=|d2| + c(w,d1) - c(w,d2) • then f(d1,q) > f(d2,q)

  8. Formal Definitions of Heuristic Retrieval Constraints

  9. Analysis of Three Representative Retrieval Formulas • Pivoted Normalization Method • Okapi Method • Dirichlet Prior Method

  10. Analysis of Three Representative Retrieval FormulasPivoted Normalization Method • Retrieval function • Analyzing

  11. Analysis of Three Representative Retrieval FormulasPivoted Normalization Method • Check TF-LNC constraint when |d1|=avdl , it is equivalent to the • TF-LNC is satisfied only if s is blow a certain upper bound

  12. Analysis of Three Representative Retrieval FormulasPivoted Normalization Method • Check the LNC2 constraint

  13. Analysis of Three Representative Retrieval FormulasPivoted Normalization Method • Consider common case when |d2|=avdl • Performance can be bad for a large s

  14. Analysis of Three Representative Retrieval FormulasPivoted Normalization Method • Check TDC constraint • It is equivalent to c(w2,d1) ≥ c(w1,d2) this is conditional satisfied

  15. Analysis of Three Representative Retrieval FormulasOkapi Method • Retrieval function • k1 (between 1.0~2.0 ) b (usually 0.75) and k3 (between 0 ~1000)

  16. Analysis of Three Representative Retrieval FormulasOkapi Method • Analysis • When df(w)> N/2 ,the IDF part in the formula will be a negative value • When the IDF part is positive (mostly true for keyword query) • TFC and LNCs are satisfied • TF-LNC constraint : considering a common case when |d2|=avdl the constraint is equivalent to b ≤ avdl / c(w, d2) • TDC is equivalent to c(w2,d1) ≥ c(w1,d2) same as the formula above

  17. Analysis of Three Representative Retrieval FormulasOkapi Method • Modify Okapi Method • Solve the problem of negative IDF • Replace the original IDF in Okapi with the regular IDF in the pivoted normalization formula • The performance is better on the verbose queries • Analysis result

  18. Analysis of Three Representative Retrieval FormulasDirichlet Prior Method • Retrieval function • Use Dirichlet prior smoothing method to smooth a document language model • Rank the documents according to the likelihood of the query according to the estimated language model of each document

  19. Analysis of Three Representative Retrieval FormulasDirichlet Prior Method • Analysis • LNC2 constraint is equivalent to c(w,d2) ≥ |d2| p(w|C) • Which is usually satisfied for content-carrying words • TDC constraint led to some lower bound for parameter

  20. Analysis of Three Representative Retrieval FormulasDirichlet Prior Method • Analysis • TDC : consider a common case of w2 , p(w2|C)=1/avdl • Means for discriminative words with a high term frequency in a document , needs to be sufficiently large • In order to balance the TF and IDF appropriately

  21. ExperimentsSetup • Document set • AP: news article , DOE: technical report, FR: government documents, • ADF :combination of AP, DOE, FR • Web: web data used in the TREC8 • Trec7: ad hoc data used in the TREC7 • Trec8: ad hoc data used in the TREC8

  22. ExperimentsSetup • Query combination • Short-keyword (SK, keyword title) • Shot-verbose (SV, one sentence description) • Long-keyword (LK, keyword list) • Long-verbose (LV, multiple sentences) • Preprocessing • Only stemming with the Porter’s stemmer • No stop words have been removed

  23. ExperimentsParameter Sensitivity • Pivoted normalization method • The analysis of LNC2 constraint for the pivoted normalization methods suggests the s should be smaller than 0.4

  24. ExperimentsParameter Sensitivity • Okapi method k1 =1.2, k3 =1000, b changes from 0.1 to 1.0

  25. ExperimentsParameter Sensitivity • Dirichlet prior method

  26. ExperimentsParameter Sensitivity • Dirichlet prior method

  27. ExperimentsPerformance Comparison

  28. ExperimentsPerformance Comparison • For any query type, the performance of Dirichlet prior method is comparable to pivoted normalization method • For keyword queries, the performance of Okapi is comparable to the other two retrieval formulas • For verbose queries, the performance of Okapi may be worse than others due to the possible negative IDF part in the formula

  29. ExperimentsPerformance Comparison • Average precision comparison

  30. Conclusion and Future Work • Define six basic constraints that any reasonable retrieval function should satisfy • When the constraints is not satisfied, it often indicates non-optimality of the method

More Related