1 / 11

Probabilistic Information Retrieval

Probabilistic Information Retrieval. CSE6392 - Database Exploration Gautam Das Thursday, March 29 2006. Z.M. Joseph Spring 2006, CSE, UTA. Basic Rules of Probability. Recall the product rule: Baye’s Theorem:. Basic Assumptions.

lilike
Download Presentation

Probabilistic Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Information Retrieval CSE6392 - Database Exploration Gautam Das Thursday, March 29 2006 Z.M. Joseph Spring 2006, CSE, UTA

  2. Basic Rules of Probability • Recall the product rule: • Baye’s Theorem:

  3. Basic Assumptions • Assume a database D consisting of a set of objects: documents, tuples, etc. • Q : Query • R : ‘Relevant Set’ of tuples • Goal is to find an R for each Q, given D. • Instead of deterministic, consider probabilistic ordering • Ranking/Scoring function should decide the degree of relevance of a document • Thus given a document d: Score(d) = P(R|D) [1] Thus, according to this, if you know the relevance set, then R’s members would have probability of 1, which would be the maximum score. Others would get a probability of 0.

  4. Simplification • From [1]: • Take ratios of probability that document is in R to probability that it is not in R: • This retains the old ordering. Factors in the elements outside R which are part of D.

  5. Applying Bayes Theorem • Simplify as follows:

  6. Observations • Forms the scoring function • The equation still retains R, which we do not know. • The ordering will still be the same using this equation as a scoring function

  7. Derivation for Keyword Queries • Now assume that a query contains a vector of words, with zero probability assigned if it does not occur. • Then, applying the previous equation to each word w (instead of to a document) and combining all the words of the query gives:

  8. Search for “Microsoft Corporation” • Thus expression would be: • Assume you had two documents: • D1 : Contains ‘Microsoft’ but not ‘Corporation’ • D2 : Contains ‘Corporation’ but not ‘Microsoft’ • Thus:

  9. Search for “Microsoft Corporation” • Because Corporation is more common in the database D, then P(Corporation|D) will be far higher than P(Microsoft|D). • Thus Score(D1) will be higher than Score(D2). • Thus document which has ‘Microsoft’ in it will get higher ranking as this is more specific than the word ‘Corporation’. • Similar to Vector Space ranking by relevance

  10. Relevance Feedback • Can keep fine-tuning R by getting user feedback on initial rankings. • Once a better R is known, better scoring and ranking of matches is possible.

  11. PIR Applied to Databases • Originally PIR was applied to documents and not to databases • Applying PIR to databases is not easy as it is difficult to capture various aspects • These include: • Different values of an attributes • PIR is based on words in document, in a database if a car is blue, black,etc. that is not easily captured • Would you assign each color as a keyword? • What to sacrifice in ranking is also not easy to capture – if a user’s preference is black cars, how is PIR applied to that when listing results that do not match entirely?

More Related