120 likes | 260 Views
This document explores the concepts of Probabilistic Information Retrieval (PIR) as applied to database exploration. It discusses the fundamentals of probability, including Bayes' Theorem, and its implications for ranking documents based on relevance to user queries. The working mechanisms such as scoring functions, keyword queries, and relevance feedback are highlighted, along with the challenges of applying PIR to complex databases compared to document-based information retrieval. Key insights on ranking, user preferences, and the complexities of data attributes are presented.
E N D
Probabilistic Information Retrieval CSE6392 - Database Exploration Gautam Das Thursday, March 29 2006 Z.M. Joseph Spring 2006, CSE, UTA
Basic Rules of Probability • Recall the product rule: • Baye’s Theorem:
Basic Assumptions • Assume a database D consisting of a set of objects: documents, tuples, etc. • Q : Query • R : ‘Relevant Set’ of tuples • Goal is to find an R for each Q, given D. • Instead of deterministic, consider probabilistic ordering • Ranking/Scoring function should decide the degree of relevance of a document • Thus given a document d: Score(d) = P(R|D) [1] Thus, according to this, if you know the relevance set, then R’s members would have probability of 1, which would be the maximum score. Others would get a probability of 0.
Simplification • From [1]: • Take ratios of probability that document is in R to probability that it is not in R: • This retains the old ordering. Factors in the elements outside R which are part of D.
Applying Bayes Theorem • Simplify as follows:
Observations • Forms the scoring function • The equation still retains R, which we do not know. • The ordering will still be the same using this equation as a scoring function
Derivation for Keyword Queries • Now assume that a query contains a vector of words, with zero probability assigned if it does not occur. • Then, applying the previous equation to each word w (instead of to a document) and combining all the words of the query gives:
Search for “Microsoft Corporation” • Thus expression would be: • Assume you had two documents: • D1 : Contains ‘Microsoft’ but not ‘Corporation’ • D2 : Contains ‘Corporation’ but not ‘Microsoft’ • Thus:
Search for “Microsoft Corporation” • Because Corporation is more common in the database D, then P(Corporation|D) will be far higher than P(Microsoft|D). • Thus Score(D1) will be higher than Score(D2). • Thus document which has ‘Microsoft’ in it will get higher ranking as this is more specific than the word ‘Corporation’. • Similar to Vector Space ranking by relevance
Relevance Feedback • Can keep fine-tuning R by getting user feedback on initial rankings. • Once a better R is known, better scoring and ranking of matches is possible.
PIR Applied to Databases • Originally PIR was applied to documents and not to databases • Applying PIR to databases is not easy as it is difficult to capture various aspects • These include: • Different values of an attributes • PIR is based on words in document, in a database if a car is blue, black,etc. that is not easily captured • Would you assign each color as a keyword? • What to sacrifice in ranking is also not easy to capture – if a user’s preference is black cars, how is PIR applied to that when listing results that do not match entirely?