- By
**fiona** - Follow User

- 103 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Information Retrieval' - fiona

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Information Retrieval

CSE 8337

Spring 2003

Modeling

Material for these slides obtained from:

Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/

Introduction to Modern Information Retrieval by Gerard Salton and Michael J. McGill, McGraw-Hill, 1983.

Modeling TOC

- Introduction
- Classic IR Models
- Boolean Model
- Vector Model
- Probabilistic Model
- Set Theoretic Models
- Fuzzy Set Model
- Extended Boolean Model
- Generalized Vector Model
- Latent Semantic Indexing
- Neural Network Model
- Alternative Probabilistic Models
- Inference Network
- Belief Network

Introduction

- IR systems usually adopt index terms to process queries
- Index term:
- a keyword or group of selected words
- any word (more general)
- Stemming might be used:
- connect: connecting, connection, connections
- An inverted file is built for the chosen index terms

- Matching at index term level is quite imprecise
- No surprise that users get frequently unsatisfied
- Since most users have no training in query formation, problem is even worst
- Frequent dissatisfaction of Web users
- Issue of deciding relevance is critical for IR systems: ranking

- A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the query
- A ranking is based on fundamental premisses regarding the notion of relevance, such as:
- common sets of index terms
- sharing of weighted terms
- likelihood of relevance
- Each set of premisses leads to a distinct IR model

Set Theoretic

Generalized Vector

Lat. Semantic Index

Neural Networks

Structured Models

Fuzzy

Extended Boolean

Non-Overlapping Lists

Proximal Nodes

Classic Models

Probabilistic

boolean

vector

probabilistic

Inference Network

Belief Network

Browsing

Flat

Structure Guided

Hypertext

IR Models

U

s

e

r

T

a

s

k

Retrieval:

Adhoc

Filtering

Browsing

Classic IR Models - Basic Concepts

- Each document represented by a set of representative keywords or index terms
- An index term is a document word useful for remembering the document main themes
- Usually, index terms are nouns because nouns have meaning by themselves
- However, search engines assume that all words are index terms (full text representation)

Classic IR Models - Basic Concepts

- The importance of the index terms is represented by weights associated to them
- ki- an index term
- dj- a document
- wij - a weight associated with (ki,dj)
- The weight wijquantifies the importance of the index term for describing the document contents

Classic IR Models - Basic Concepts

- t is the total number of index terms
- K = {k1, k2, …, kt} is the set of all index terms
- wij >= 0 is a weight associated with (ki,dj)
- wij = 0 indicates that term does not belong to doc
- dj= (w1j, w2j, …, wtj) is a weighted vector associated with the document dj
- gi(dj) = wij is a function which returns the weight associated with pair (ki,dj)

- Simple model based on set theory
- Queries specified as boolean expressions
- precise semantics and neat formalism
- Terms are either present or absent. Thus, wij {0,1}
- Consider
- q = ka (kb kc)
- qdnf = (1,1,1) (1,1,0) (1,0,0)
- qcc= (1,1,0) is a conjunctive component

Kb

(1,1,0)

(1,0,0)

(1,1,1)

Kc

The Boolean Model

- q = ka (kb kc)
- sim(q,dj) =

1 if qcc| (qcc qdnf) (ki, gi(dj)= gi(qcc))

0 otherwise

Drawbacks of the Boolean Model

- Retrieval based on binary decision criteria with no notion of partial matching
- No ranking of the documents is provided
- Information need has to be translated into a Boolean expression
- The Boolean queries formulated by the users are most often too simplistic
- As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

- Use of binary weights is too limiting
- Non-binary weights provide consideration for partial matches
- These term weights are used to compute a degree of similarity between a query and each document
- Ranked set of documents provides for better matching

- wij > 0 whenever ki appears in dj
- wiq >= 0 associated with the pair (ki,q)
- dj = (w1j, w2j, ..., wtj)
- q = (w1q, w2q, ..., wtq)
- To each term ki is associated a unitary vector i
- The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)
- The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors

j

dj

q

i

- Sim(q,dj) = cos()

= [dj q] / |dj| * |q|

= [ wij * wiq] / |dj| * |q|

- Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1
- A document is retrieved even if it matches the query terms only partially

- One approach is to examine the frequency of the occurence of a word in a document:
- Absolute frequency:
- tf factor, the term frequency within a document
- freqi,j - raw frequency of ki within dj
- Both high-frequency and low-frequency terms may not actually be significant
- Relative frequency: tf divided by number of words in document
- Normalized frequency:

fi,j = (freqi,j)/(maxl freql,j)

Inverse Document Frequency

- Importance of term may depend more on how it can distinguish between documents.
- Quantification of inter-documents separation
- Dissimilarity not similarity
- idf factor, the inverse document frequency

- N be the total number of docs in the collection
- ni be the number of docs which contain ki
- The idf factor is computed as
- idfi = log (N/ni)
- the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.
- IDF Ex:
- N=1000, n1=100, n2=500, n3=800
- idf1= 3 - 2 = 1
- idf2= 3 – 2.7 = 0.3
- idf3 = 3 – 2.9 = 0.1

- The best term-weighting schemes take both into account.
- wij = fi,j * log(N/ni)
- This strategy is called a tf-idf weighting scheme

- For the query term weights, a suggestion is
- wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni)
- The vector model with tf-idf weights is a good ranking strategy with general collections
- The vector model is usually as good as any known ranking alternatives.
- It is also simple and fast to compute.

- Advantages:
- term-weighting improves quality of the answer set
- partial matching allows retrieval of docs that approximate the query conditions
- cosine ranking formula sorts documents according to degree of similarity to the query
- Disadvantages:
- Assumes independence of index terms (??); not clear that this is bad though

Probabilistic Model

- Objective: to capture the IR problem using a probabilistic framework
- Given a user query, there is an ideal answer set
- Querying as specification of the properties of this ideal answer set (clustering)
- But, what are these properties?
- Guess at the beginning what they could be (i.e., guess initial description of ideal answer set)
- Improve by iteration

- An initial set of documents is retrieved somehow
- User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected)
- IR system uses this information to refine description of ideal answer set
- By repeting this process, it is expected that the description of the ideal answer set will improve
- Have always in mind the need to guess at the very beginning the description of the ideal answer set
- Description of ideal answer set is modeled in probabilistic terms

Probabilistic Ranking Principle

- Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant.
- But,
- how to compute probabilities?
- what is the sample space?

- Probabilistic ranking computed as:
- sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q)
- This is the odds of the document dj being relevant
- Taking the odds minimize the probability of an erroneous judgement
- Definition:
- wij {0,1}
- P(R | dj) :probability that given doc is relevant
- P(R | dj) : probability doc is not relevant

- sim(dj,q) = P(R | dj) / P(R | dj) = [P(dj | R) * P(R)] [P(dj | R) * P(R)]

~ P(dj | R) P(dj | R)

- P(dj | R) : probability of randomly selecting the document dj from the set R of relevant documents

- sim(dj,q) ~ P(dj | R) P(dj | R) ~ [ P(ki | R)] * [ P(ki | R)]

[ P(ki | R)] * [ P(ki | R)]

- P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents

- sim(dj,q)

~ log [ P(ki | R)] * [ P(kj | R)]

[ P(ki |R)] * [ P(ki | R)]

~ K * [ log P(ki | R) + log P(ki | R) ] P(ki | R) P(ki | R)

where P(ki | R) = 1 - P(ki | R) P(ki | R) = 1 - P(ki | R)

- sim(dj,q)

~ wiq * wij * (log P(ki | R) + log P(ki | R) )

P(ki | R) P(ki | R)

- Probabilities P(ki | R) and P(ki | R) ?
- Estimates based on assumptions:
- P(ki | R) = 0.5
- P(ki | R) = ni N
- Use this initial guess to retrieve an initial ranking
- Improve upon this initial ranking

- Let
- V : set of docs initially retrieved
- Vi : subset of docs retrieved that contain ki
- Reevaluate estimates:
- P(ki | R) = Vi V
- P(ki | R) = ni - Vi N - V
- Repeat recursively

- To avoid problems with V=1 and Vi=0:
- P(ki | R) = Vi + 0.5 V + 1
- P(ki | R) = ni - Vi + 0.5 N - V + 1
- Also,
- P(ki | R) = Vi + ni/N V + 1
- P(ki | R) = ni - Vi + ni/N N - V + 1

- Advantages:
- Docs ranked in decreasing order of probability of relevance
- Disadvantages:
- need to guess initial estimates for P(ki | R)
- method does not take into account tf and idf factors

Brief Comparison of Classic Models

- Boolean model does not provide for partial matches and is considered to be the weakest classic model
- Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections
- This seems also to be the view of the research community

Set Theoretic Models

- The Boolean model imposes a binary criterion for deciding relevance
- The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past
- We discuss now two set theoretic models for this:
- Fuzzy Set Model
- Extended Boolean Model

- This vagueness of document/query matching can be modeled using a fuzzy framework, as follows:
- with each term is associated a fuzzy set
- each doc has a degree of membership in this fuzzy set
- Here, we discuss the model proposed by Ogawa, Morita, and Kobayashi (1991)

- A fuzzy subset A of U is characterized by a membership function (A,u) : U [0,1] which associates with each element u of U a number (u) in the interval [0,1]
- Definition
- Let A and B be two fuzzy subsets of U. Also, let ¬A be the complement of A. Then,
- (¬A,u) = 1 - (A,u)
- (AB,u) = max((A,u), (B,u))
- (AB,u) = min((A,u), (B,u))

- Fuzzy sets are modeled based on a thesaurus
- This thesaurus is built as follows:
- Let c be a term-term correlation matrix
- Let ci,l be a normalized correlation factor for (ki,kl): ci,l = ni,l ni + nl - ni,l
- ni: number of docs which contain ki
- nl: number of docs which contain kl
- ni,l: number of docs which contain both ki and kl
- We now have the notion of proximity among index terms.

- The correlation factor ci,l can be used to define fuzzy set membership for a document dj as follows: i,j = 1 - (1 - ci,l) ki dj
- i,j : membership of doc dj in fuzzy subset associated with ki
- The above expression computes an algebraic sum over all terms in the doc dj

- A doc dj belongs to the fuzzy set for ki, if its own terms are associated with ki
- If doc dj contains a term kl which is closely related to ki, we have
- ci,l ~ 1
- i,j ~ 1
- index ki is a good fuzzy index for doc

Kb

cc2

cc3

cc1

Kc

Fuzzy IR: An Example

- q = ka (kb kc)
- qdnf

= (1,1,1) + (1,1,0) + (1,0,0)

= cc1 + cc2 + cc3

- q,dj = cc1+cc2+cc3,j = 1 - (1 - a,j b,j) c,j) * (1 - a,j b,j (1-c,j)) * (1 - a,j (1-b,j) (1-c,j))

- Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory
- Experiments with standard test collections are not available
- Difficult to compare at this time

Extended Boolean Model

- Boolean model is simple and elegant.
- But, no provision for a ranking
- As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership
- Extend the Boolean model with the notions of partial matching and term weighting
- Combine characteristics of the Vector model with properties of Boolean algebra

- The Extended Boolean Model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra
- Let,
- q = kx ky
- wxj = fxj * idfx associated with [kx,dj] max(idfi)
- Further, wxj = x and wyj = y

ky

dj+1

y = wyj

dj

(0,0)

x = wxj

kx

2

2

sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) )

2

The Idea:

qand = kx ky; wxj = x and wyj = y

AND

dj+1

dj

y = wyj

(0,0)

x = wxj

kx

2

2

sim(qor,dj) = sqrt( x + y )

2

The Idea:

qor = kx ky; wxj = x and wyj = y

(1,1)

OR

- We can extend the previous model to consider Euclidean distances in a t-dimensional space
- This can be done using p-norms which extend the notion of distance to include p-distances, where 1 p is a new parameter

1

1

p

p

p

p

p

p

p

p

p

p

p

- sim(qor,dj) = (x1 + x2 + . . . + xm ) m

p

p

p

- sim(qand,dj)=1 - ((1-x1) + (1-x2) + . . . + (1-xm) ) m

- A generalized disjunctive query is given by
- qor = k1 k2 . . . kt
- A generalized conjunctive query is given by
- qand = k1 k2 . . . kt

- If p = 1 then (Vector like)
- sim(qor,dj) = sim(qand,dj) = x1 + . . . + xm m
- If p = then (Fuzzy like)
- sim(qor,dj) = max (wxj)
- sim(qand,dj) = min (wxj)
- By varying p, we can make the model behave as a vector, as a fuzzy, or as an intermediary model

2

- This is quite powerful and is a good argument in favor of the extended Boolean model
- q = (k1 k2) k3

k1 and k2 are to be used as in a vector retrieval while the presence of k3 is required.

- sim(q,dj) = ( (1 - ( (1-x1) + (1-x2) ) ) + x3 ) 2 ______ 2

Conclusions

- Model is quite powerful
- Properties are interesting and might be useful
- Computation is somewhat complex
- However, distributivity operation does not hold for ranking computation:
- q1 = (k1 k2) k3
- q2 = (k1 k3) (k2 k3)
- sim(q1,dj) sim(q2,dj)

Download Presentation

Connecting to Server..