- 76 Views
- Uploaded on
- Presentation posted in: General

CS 430 / INFO 430 Information Retrieval

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Lecture 12

Probabilistic Information Retrieval

Discussion Class 6

This is a long paper. The discussion class will be on Sections 1 and 2 (up to page 16).

Assignment 2

Remember to check the FAQ for hints and answers to questions that people have raised.

Midterm Examination

•See the Examinations page on the Web site

•On Wednesday October 11, 7:30-9:00 p.m., Phillips 203, instead of the discussion class.

•The topics to be are examined are all lectures and discussion class readings before the midterm break. See the Web site for a sample paper.

•Laptops may be used to store course materials and your notes, but for no other purposes. Hand calculators are allowed. No communication. No other electronic devices.

Many authors divide the classical methods of information retrieval into three categories:

Boolean (based on set theory)

Vector space (based on linear algebra)

Probabilistic (based on Bayesian statistics)

In practice, the latter two have considerable overlap.

Let a, b be two events, with probability P(a) and P(b).

Independent events

The events a and b are independent if and only if:

P(a b) = P(b) P(a)

Conditional probability

P(a | b) is the probability of a given b, also called the conditional probability of a given b.

Conditional independence

The events a1, ..., an are conditionally independent if and only if:

P(ai | aj) = P(ai) for all i and j

a

where a is the event not a

x

a

y

w

b

z

b

P(a) = x + y

P(b) = w + x

P(a | b) = x / (w + x)

P(a | b) P(b) = P(a b) = P(b | a) P(a)

Independent

a and b are the results of throwing two dice

P(a=5 | b=3) = P(a=5) = 1/6

Not independent

a and b are the results of throwing two dice

t is the sum of the two dice

t = a + b

P(t=8 | a=2) = 1/6

P(t=8 | a=1) = 0

P(b | a) P(a)

P(b)

Notation

Let a, b be two events.

P(a | b) is the probability of a given b

Bayes Theorem

P(a | b) =

Derivation

P(a | b) P(b) = P(a b) = P(b | a) P(a)

P(b | a) P(a)

P(b)

Terminology used with Bayes Theorem

P(a | b) =

P(a) is called the prior probability of a

P(a | b) is called the posterior probability

of a given b

Example

a Weight over 200 lb.

b Height over 6 ft.

P(a | b) = x / (w+x) = x / P(b)

P(b | a) = x / (x+y) = x / P(a)

x is P(a b)

Over 200 lb

x

y

w

z

Over 6 ft

"If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately a possible on the basis of whatever data is made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data."

W.S. Cooper

Basic concept:

"For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents.

"By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically."

Van Rijsbergen

Basic concept:

The probability that a document is relevant to a query is assumed to depend on the terms in the query and the terms used to index the document, only.

Given a user query q, the ideal answer set, R, is the set of all relevant documents.

Given a user query q and a document d in the collection, the probabilistic model estimates the probability that the user will find d relevant, i.e., that d is a member of R.

Initial probabilities:

Given a query q and a document d the model needs an estimate of the probability that the user finds d relevant. i.e., P(R | d).

Similarity measure:

S(d, q), thesimilarityof d to q, is the ratio:

probability that d is relevant to q

probability that d is not relevant to q

This measure runs from near zero, if the probability is small that the document is relevant, to large as the probability of relevance approaches one.

In practice it is often convenient to use S' = log(S).

P(R | d)

P(R | d)

P(d | R) P(R)

P(d | R) P(R)

P(d | R)

P(d | R)

S (d, q) =

= by Bayes Theorem

= x k where k is constant

P(d | R) is the probability of randomly selecting d from R.

Let x = (x1, x2, ... xn) be the term incidence vector for d.

xi = 1 if term i is in the document and 0 otherwise.

We estimate P(d | R) by P(x | R)

If the index terms are independent

P(x | R) = P(x1 R) P(x2 R) ... P(xn R)

= P(x1 | R) P(x2 | R) ... P(xn | R)

= ∏ P(xi | R)

{This is known as the Naive Bayes probabilistic model.}

∏ P(xi | R)

∏ P(xi | R)

Since the xi are either 0 or 1, this can we written:

P(xi = 1 | R) P(xi = 0 | R)

xi = 1P(xi = 1 | R) xi = 0P(xi = 0 | R)

S(d, q) = k

∏ ∏

S = k

For terms that appear in the query let

pi = P(xi = 1 | R)

ri = P(xi= 1 | R)

For terms that do not appear in the query assume

pi = ri

pi 1 - pi

xi = qi = 1ri xi = 0, qi = 1 1 - ri

pi (1 - ri) 1 - pi

xi = qi = 1ri(1 - pi) qi = 1 1 - ri

terms with qi = 0 are pi/ri equal to 1

∏ ∏

S = k

= k

constant for a given query

∏ ∏

Taking logs and ignoring factors that are constant for a given query, we have:

pi (1 - ri)

(1 - pi) ri

where the summation is taken over those terms that appear in both the query and the document.

S' =log(S)

{ }

= ∑ log

Suppose that, in the term vector space, document d is represented by a vector that has component in dimension i of:

pi (1 - ri)

(1 - pi) ri

and the query q is represented by a vector with value 1 in each dimension that corresponds to a term in the vector.

Then the Binary Independence Retrievalsimilarity, S' is the inner product of these two vectors.

Thus this approach can be considered as a probabilistic way of determining term weights in the vector space model

{}

log

The probabilistic model is an alternative to the term vector space model.

The Binary Independence Retrieval similarity measure is used instead of the cosine similarity measure to rank all documents against the query q.

Techniques such as stoplists and stemming can be used with either model.

Variations to the model result in slightly different expressions for the similarity measure.

Early uses of probabilistic information retrieval

were based on relevance feedback

R is a setof documents that are guessed to be relevant and R

the complement of R.

1. Guess a preliminary probabilistic description of R and use it to retrieve a first set of documents.

2. Interact with the user to refine the description of R (relevance feedback).

Repeat, thus generating a succession of approximations to R.

Initial guess, with no information to work from:

pi = P(xi | R) = c

ri = P(xi | R) = ni / N

where:

c is an arbitrary constant, e.g., 0.5

ni is the number of documents that contain xi

N is the total number of documents in the collection

With these assumptions:

pi (1 - ri)

(1 - pi) ri

= ∑ log{(N - ni)/ni}

where the summation is taken over those terms that appear in both the query and the document.

{ }

S' (d, q) = ∑ log

Human feedback -- relevance feedback

Automatically

(a) Run query q using initial values. Consider the t top ranked documents. Let si be the number of these documentsthat contain the term xi.

(b) The new estimates are:

pi = P(xi | R) = si / t

ri = P(xi | R) = (ni - si) / (N - t)

Advantages

•Based on firm theoretical basis

Disadvantages

•Initial definition of R has to be guessed.

•Weights ignore term frequency

•Assumes independent index terms (as does vector model)