1 / 66

Lecture 8: Probabilistic IR and Relevance Feedback

Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004 http://www.sims.berkeley.edu/academics/courses/is202/f04/. Lecture 8: Probabilistic IR and Relevance Feedback. SIMS 202: Information Organization and Retrieval. Lecture Overview.

zareh
Download Presentation

Lecture 8: Probabilistic IR and Relevance Feedback

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004 http://www.sims.berkeley.edu/academics/courses/is202/f04/ Lecture 8: Probabilistic IR and Relevance Feedback SIMS 202: Information Organization and Retrieval

  2. Lecture Overview • Review • Vector Representation • Term Weights • Vector Matching • Clustering • Probabilistic Models of IR • Relevance Feedback Credit for some of the slides in this lecture goes to Marti Hearst

  3. Lecture Overview • Review • Vector Representation • Term Weights • Vector Matching • Clustering • Probabilistic Models of IR • Relevance Feedback Credit for some of the slides in this lecture goes to Marti Hearst

  4. Document Vectors

  5. Vector Space Documents and Queries t1 t3 D2 D9 D1 D4 D11 D5 D3 D6 D10 D8 t2 D7 Q is a query – also represented as a vector Boolean term combinations

  6. Documents in Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 D7 D6 D8 t2

  7. Binary Weights • Only the presence (1) or absence (0) of a term is included in the vector

  8. Raw Term Weights • The frequency of occurrence for the term in each document is included in the vector

  9. tf*idf weights

  10. Inverse Document Frequency • IDF provides high values for rare words and low values for common words For a collection of 10000 documents (N = 10000)

  11. tf*idf Normalization • Normalize the term weights (so longer vectors are not unfairly given more weight) • Normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive

  12. Vector Space Similarity • Now, the similarity of two documents is: • This is also called the cosine, or normalized inner product • The normalization was done when weighting the terms • Note that the wik weights can be stored in the vectors/ inverted files for the documents

  13. Vector Space Matching Di=(di1,wdi1;di2, wdi2;…;dit, wdit) Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit) Term B 1.0 Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7) Q D2 0.8 0.6 0.4 D1 0.2 0 0.2 0.4 0.6 0.8 1.0 Term A

  14. Vector Space Visualization

  15. Document/Document Matrix

  16. Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseau Term 1 Term 2

  17. Problems with Vector Space • There is no real theoretical basis for the assumption of a term space • it is more for visualization that having any real basis • most similarity measures work about the same regardless of model • Terms are not really orthogonal dimensions • Terms are not independent of all other terms • Retrieval efficiency vs. indexing and update efficiency for stored pre-calculated weights

  18. Lecture Overview • Review • Vector Representation • Term Weights • Vector Matching • Clustering • Probabilistic Models of IR • Relevance Feedback Credit for some of the slides in this lecture goes to Marti Hearst

  19. Probabilistic Models • Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query • Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) • Relies on accurate estimates of probabilities

  20. Probability Ranking Principle • “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.” Stephen E. Robertson, J. Documentation 1977

  21. Model 1 – Maron and Kuhns • Concerned with estimating probabilities of relevance at the point of indexing: • If a patron came with a request using term ti, what is the probability that she/he would be satisfied with document Dj ?

  22. Model 1 • A patron submits a query (call it Q) consisting of some specification of her/his information need. Different patrons submitting the same stated query may differ as to whether or not they judge a specific document to be relevant. The function of the retrieval system is to compute for each individual document the probability that it will be judged relevant by a patron who has submitted query Q. Robertson, Maron & Cooper, 1982

  23. Model 1 – Bayes • A is the class of events of using the library • Di is the class of events of Document i being judged relevant • Ijis the class of queries consisting of the single term Ij • P(Di|A,Ij) = probability that if a query is submitted to the system then a relevant document is retrieved

  24. Model 2 • Documents have many different properties; some documents have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties. Robertson, Maron & Cooper, 1982

  25. Model 2 – Robertson & Sparck Jones Given a term t and a query q Document Relevance + - + r n-r n - R-r N-n-R+r N-n R N-R N Document Indexing

  26. Robertson-Sparck Jones Weights • Retrospective formulation

  27. Robertson-Sparck Jones Weights • Predictive formulation

  28. Probabilistic Models: Some Unifying Notation • D = All present and future documents • Q = All present and future queries • (Di,Qj) = A document query pair • x = class of similar documents, • y = class of similar queries, • Relevance (R) is a relation:

  29. Probabilistic Models • Model 1 -- Probabilistic Indexing, P(R|y,Di) • Model 2 -- Probabilistic Querying, P(R|Qj,x) • Model 3 -- Merged Model, P(R| Qj, Di) • Model 0 -- P(R|y,x) • Probabilities are estimated based on prior usage or relevance estimation

  30. Probabilistic Models Q D y Qj x Di

  31. Logistic Regression • Another approach to estimating probability of relevance • Based on work by William Cooper, Fred Gey and Daniel Dabney • Builds a regression model for relevance prediction based on a set of training data • Uses less restrictive independence assumptions than Model 2 • Linked Dependence

  32. So What’s Regression? • A method for fitting a curve (not necessarily a straight line) through a set of points using some goodness-of-fit criterion • The most common type of regression is linear regression

  33. What’s Regression? • Least Squares Fitting is a mathematical procedure for finding the best fitting curve to a given set of points by minimizing the sum of the squares of the offsets ("the residuals") of the points from the curve • The sum of the squares of the offsets is used instead of the offset absolute values because this allows the residuals to be treated as a continuous differentiable quantity

  34. Logistic Regression 100 - 90 - 80 - 70 - 60 - 50 - 40 - 30 - 20 - 10 - 0 - Relevance 0 10 20 30 40 50 60 Term Frequency in Document

  35. Probabilistic Models: Logistic Regression • Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds:

  36. Logistic Regression Attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

  37. Logistic Regression • Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients • At retrieval the probability estimate is obtained by: • For the 6 X attribute measures shown previously

  38. Strong theoretical basis In principle should supply the best predictions of relevance given available information Can be implemented similarly to Vector Relevance information is required -- or is “guestimated” Important indicators of relevance may not be term -- though terms only are usually used Optimally requires on-going collection of relevance information Probabilistic Models Advantages Disadvantages

  39. Vector and Probabilistic Models • Support “natural language” queries • Treat documents and queries the same • Support relevance feedback searching • Support ranked retrieval • Differ primarily in theoretical basis and in how the ranking is calculated • Vector assumes relevance • Probabilistic relies on relevance judgments or estimates

  40. Current Use of Probabilistic Models • Virtually all the major systems in TREC now use the “Okapi BM25 formula” which incorporates the Robertson-Sparck Jones weights…

  41. Okapi BM25 • Where: • Q is a query containing terms T • K is k1((1-b) + b.dl/avdl) • k1, b and k3are parameters , usually 1.2, 0.75 and 7-1000 • tf is the frequency of the term in a specific document • qtf is the frequency of the term in a topic from which Q was derived • dl and avdl are the document length and the average document length measured in some convenient unit • w(1) is the Robertson-Sparck Jones weight

  42. Language Models • A recent addition to the probabilistic models is “language modeling” that estimates the probability that a query could have been produced by a given document. • This is a slight variation on the other probabilistic models that has led to some modest improvements in performance

  43. Logistic Regression and Cheshire II • The Cheshire II system (see readings) uses Logistic Regression equations estimated from TREC full-text data • Used for a number of production level systems here and in the U.K.

  44. Lecture Overview • Review • Vector Representation • Term Weights • Vector Matching • Clustering • Probabilistic Models of IR • Relevance Feedback Credit for some of the slides in this lecture goes to Marti Hearst

  45. Querying in IR System Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Potentially Relevant Documents

  46. Relevance Feedback in an IR System Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Potentially Relevant Documents Selected relevant docs

  47. Query Modification • Problem: How to reformulate the query? • Thesaurus expansion: • Suggest terms similar to query terms • Relevance feedback: • Suggest terms (and documents) similar to retrieved documents that have been judged to be relevant

  48. Relevance Feedback • Main Idea: • Modify existing query based on relevance judgements • Extract terms from relevant documents and add them to the query • And/or re-weight the terms already in the query • Two main approaches: • Automatic (pseudo-relevance feedback) • Users select relevant documents • Users/system select terms from an automatically-generated list

  49. Relevance Feedback • Usually do both: • Expand query with new terms • Re-weight terms in query • There are many variations • Usually positive weights for terms from relevant docs • Sometimes negative weights for terms from non-relevant docs • Remove terms ONLY in non-relevant documents

More Related