1 / 20

IR Models: Review Vector Model and Probabilistic

IR Models: Review Vector Model and Probabilistic. Algebraic. Set Theoretic. Generalized Vector Lat. Semantic Index Neural Networks. Structured Models. Fuzzy Extended Boolean. Non-Overlapping Lists Proximal Nodes. Classic Models. Probabilistic. boolean vector probabilistic.

Download Presentation

IR Models: Review Vector Model and Probabilistic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IR Models:Review Vector Model and Probabilistic

  2. Algebraic Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic boolean vector probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext IR Model Taxonomy U s e r T a s k Retrieval: Adhoc Filtering Browsing

  3. Classic IR Models - Basic Concepts • Each document represented by a set of representative keywords or index terms • An index term is a document word useful for remembering the document main themes • The importance of the index terms is represented by weights associated to them • Let • kibe an index term • djbe a document • wijis a weight associated with (ki,dj) • The weight wij quantifies the importance of the index term for describing the document contents

  4. j Vector Model Similarity dj  • Sim(q,dj) = cos() = [vec(dj)  vec(q)] / |dj| * |q| = [ wij * wiq] / |dj| * |q| • TF-IDF term-weighting scheme • wij = freq(i,q) / max(freq(l,q)])* log(N/ni) • Default query term weights • wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni) q i

  5. k2 k1 Example 1:no weights d7 d6 d2 d4 d5 d3 d1 k3

  6. k2 k1 Example 2:query weights d7 d6 d2 d4 d5 d3 d1 k3

  7. k2 k1 Example 3:query and document weights d7 d6 d2 d4 d5 d3 d1 k3

  8. Summary of Vector Space Model • Advantages: • term-weighting improves quality of the answer set • partial matching allows retrieval of docs that approximate the query conditions • cosine ranking formula sorts documents according to degree of similarity to the query • Disadvantages: • assumes independence of index terms (??); not clear that this is bad though

  9. Probabilistic Model • Objective: to capture the IR problem using a probabilistic framework • Given a user query, there is an ideal answer set • Querying as specification of the properties of this ideal answer set (clustering) • But, what are these properties? • Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) • Improve by iteration

  10. Probabilistic Model • An initial set of documents is retrieved somehow • User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) • IR system uses this information to refine description of ideal answer set • By repeating this process, it is expected that the description of the ideal answer set will improve • Description of ideal answer set is modeled in probabilistic terms

  11. Probabilistic Ranking Principle • Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj relevant. • The model assumes that this probability of relevance depends on the query and the document representations only. • Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant. • But, • how to compute probabilities?

  12. The Ranking • Probabilistic ranking computed as: • sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) • This is the odds of the document dj being relevant • Taking the odds minimizes the probability of an erroneous judgement • Definition: • wij {0,1}, wiq {0,1} • P (R | vec(dj)) :probability that given doc is relevant • P (R | vec(dj)) : probability doc is not relevant

  13. The Ranking • sim(dj,q) = P(R | vec(dj)) / P(R | vec(dj)) • = [P(vec(dj) | R) * P(R)] [P(vec(dj) | R) * P(R)] ~ P(vec(dj) | R) P(vec(dj) | R) • P(vec(dj) | R) : probability of randomly selecting the document dj from the set R of relevant documents

  14. The Ranking • sim(dj,q) ~ P(vec(dj) | R) P(vec(dj) | R) ~ [ indoc P(ki | R)] x [ indoc P(ki | R)] [ indoc P(ki | R)] x [ indoc P(ki | R)] • P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents • Assumes the independence of terms.

  15. The Ranking • sim(dj,q) ~ [ indoc P(ki | R)] x [ indoc P(ki | R)] [ indoc P(ki | R)] x [ indoc P(ki | R)] • math happens ... • ~  wiq * wij * (log P( ki | R) + log P( ki | R) ) P(ki | R) P(ki | R) where P(ki | R) = 1 - P(ki | R) P(ki | R) = 1 - P(ki | R)

  16. The Initial Ranking • sim(dj,q) ~ • ~  wiq * wij * (log P( ki | R) + log P( ki | R) ) P(ki | R) P(ki | R) • Probabilities P(ki | R) and P(ki | R) ? • Estimates based on assumptions: • P(ki | R) = 0.5 • P(ki | R) = ni N where ni is the number of docs that contain ki • Use this initial guess to retrieve an initial ranking • Improve upon this initial ranking

  17. Improving the Initial Ranking • sim(dj,q) ~ • ~  wiq * wij * (log P( ki | R) + log P( ki | R) ) P(ki | R) P(ki | R) • Let • V : set of docs initially retrieved • These are considered relevant even if we are not sure • Vi : subset of docs retrieved that contain ki • Reevaluate estimates: • P(ki | R) = Vi V • P(ki | R) = ni - Vi N - V • Repeat recursively

  18. Improving the Initial Ranking • sim(dj,q) ~ • ~  wiq * wij * (log P( ki | R) + log P( ki | R) ) P(ki | R) P(ki | R) • Need to avoid problems with small known document sets (e.g. when V=1 and Vi=0): • P(ki | R) = Vi + 0.5 V + 1 • P(ki | R) = ni - Vi + 0.5 N - V + 1 • But we can use frequency in corpus instead, • P(ki | R) = Vi + ni/N V + 1 • P(ki | R) = ni - Vi + ni/N N - V + 1

  19. Probabilistic Model • Advantages: • Documents are ranked in decreasing order of probability of relevance • Disadvantages: • need to guess initial estimates for P(ki | R) • method does not take into account tf and idf factors • New Use for Model: • Meta search • Use probabilities to model the value of different search engines for different topics

  20. Brief Comparison of Classic Models • Boolean model does not provide for partial matches and is considered to be the weakest classic model • Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections • This seems also to be the view of the research community

More Related