information retrieval n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Information Retrieval PowerPoint Presentation
Download Presentation
Information Retrieval

Loading in 2 Seconds...

play fullscreen
1 / 55

Information Retrieval - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

Information Retrieval. CSE 8337 Spring 2003 Modeling Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Retrieval' - fiona


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information retrieval

Information Retrieval

CSE 8337

Spring 2003

Modeling

Material for these slides obtained from:

Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/

Introduction to Modern Information Retrieval by Gerard Salton and Michael J. McGill, McGraw-Hill, 1983.

modeling toc
Modeling TOC
  • Introduction
  • Classic IR Models
    • Boolean Model
    • Vector Model
    • Probabilistic Model
  • Set Theoretic Models
    • Fuzzy Set Model
    • Extended Boolean Model
  • Generalized Vector Model
  • Latent Semantic Indexing
  • Neural Network Model
  • Alternative Probabilistic Models
    • Inference Network
    • Belief Network
introduction
Introduction
  • IR systems usually adopt index terms to process queries
  • Index term:
    • a keyword or group of selected words
    • any word (more general)
  • Stemming might be used:
    • connect: connecting, connection, connections
  • An inverted file is built for the chosen index terms
slide4

Introduction

Docs

Index Terms

doc

match

Ranking

Information Need

query

slide5

Introduction

  • Matching at index term level is quite imprecise
  • No surprise that users get frequently unsatisfied
  • Since most users have no training in query formation, problem is even worst
  • Frequent dissatisfaction of Web users
  • Issue of deciding relevance is critical for IR systems: ranking
slide6

Introduction

  • A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the query
  • A ranking is based on fundamental premisses regarding the notion of relevance, such as:
    • common sets of index terms
    • sharing of weighted terms
    • likelihood of relevance
  • Each set of premisses leads to a distinct IR model
slide7

Algebraic

Set Theoretic

Generalized Vector

Lat. Semantic Index

Neural Networks

Structured Models

Fuzzy

Extended Boolean

Non-Overlapping Lists

Proximal Nodes

Classic Models

Probabilistic

boolean

vector

probabilistic

Inference Network

Belief Network

Browsing

Flat

Structure Guided

Hypertext

IR Models

U

s

e

r

T

a

s

k

Retrieval:

Adhoc

Filtering

Browsing

slide9

Classic IR Models - Basic Concepts

  • Each document represented by a set of representative keywords or index terms
  • An index term is a document word useful for remembering the document main themes
  • Usually, index terms are nouns because nouns have meaning by themselves
  • However, search engines assume that all words are index terms (full text representation)
slide10

Classic IR Models - Basic Concepts

  • The importance of the index terms is represented by weights associated to them
  • ki- an index term
  • dj- a document
  • wij - a weight associated with (ki,dj)
  • The weight wijquantifies the importance of the index term for describing the document contents
slide11

Classic IR Models - Basic Concepts

  • t is the total number of index terms
  • K = {k1, k2, …, kt} is the set of all index terms
  • wij >= 0 is a weight associated with (ki,dj)
  • wij = 0 indicates that term does not belong to doc
  • dj= (w1j, w2j, …, wtj) is a weighted vector associated with the document dj
  • gi(dj) = wij is a function which returns the weight associated with pair (ki,dj)
slide12

The Boolean Model

  • Simple model based on set theory
  • Queries specified as boolean expressions
    • precise semantics and neat formalism
  • Terms are either present or absent. Thus, wij  {0,1}
  • Consider
    • q = ka  (kb  kc)
    • qdnf = (1,1,1)  (1,1,0)  (1,0,0)
    • qcc= (1,1,0) is a conjunctive component
slide13

Ka

Kb

(1,1,0)

(1,0,0)

(1,1,1)

Kc

The Boolean Model

  • q = ka  (kb kc)
  • sim(q,dj) =

1 if  qcc| (qcc  qdnf)  (ki, gi(dj)= gi(qcc))

0 otherwise

slide14

Drawbacks of the Boolean Model

  • Retrieval based on binary decision criteria with no notion of partial matching
  • No ranking of the documents is provided
  • Information need has to be translated into a Boolean expression
  • The Boolean queries formulated by the users are most often too simplistic
  • As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query
slide15

The Vector Model

  • Use of binary weights is too limiting
  • Non-binary weights provide consideration for partial matches
  • These term weights are used to compute a degree of similarity between a query and each document
  • Ranked set of documents provides for better matching
slide16

The Vector Model

  • wij > 0 whenever ki appears in dj
  • wiq >= 0 associated with the pair (ki,q)
  • dj = (w1j, w2j, ..., wtj)
  • q = (w1q, w2q, ..., wtq)
  • To each term ki is associated a unitary vector i
  • The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)
  • The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors
slide17

The Vector Model

j

dj

q

i

  • Sim(q,dj) = cos()

= [dj  q] / |dj| * |q|

= [ wij * wiq] / |dj| * |q|

  • Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1
  • A document is retrieved even if it matches the query terms only partially
slide18

Weights wij and wiq ?

  • One approach is to examine the frequency of the occurence of a word in a document:
  • Absolute frequency:
    • tf factor, the term frequency within a document
    • freqi,j - raw frequency of ki within dj
    • Both high-frequency and low-frequency terms may not actually be significant
  • Relative frequency: tf divided by number of words in document
  • Normalized frequency:

fi,j = (freqi,j)/(maxl freql,j)

inverse document frequency
Inverse Document Frequency
  • Importance of term may depend more on how it can distinguish between documents.
  • Quantification of inter-documents separation
  • Dissimilarity not similarity
  • idf factor, the inverse document frequency
slide20

IDF

  • N be the total number of docs in the collection
  • ni be the number of docs which contain ki
  • The idf factor is computed as
    • idfi = log (N/ni)
    • the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.
  • IDF Ex:
    • N=1000, n1=100, n2=500, n3=800
    • idf1= 3 - 2 = 1
    • idf2= 3 – 2.7 = 0.3
    • idf3 = 3 – 2.9 = 0.1
slide21

The Vector Model

  • The best term-weighting schemes take both into account.
  • wij = fi,j * log(N/ni)
  • This strategy is called a tf-idf weighting scheme
slide22

The Vector Model

  • For the query term weights, a suggestion is
    • wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni)
  • The vector model with tf-idf weights is a good ranking strategy with general collections
  • The vector model is usually as good as any known ranking alternatives.
  • It is also simple and fast to compute.
slide23

The Vector Model

  • Advantages:
    • term-weighting improves quality of the answer set
    • partial matching allows retrieval of docs that approximate the query conditions
    • cosine ranking formula sorts documents according to degree of similarity to the query
  • Disadvantages:
    • Assumes independence of index terms (??); not clear that this is bad though
slide24

k2

k1

d7

d6

d2

d4

d5

d3

d1

k3

The Vector Model: Example I

slide25

k2

k1

d7

d6

d2

d4

d5

d3

d1

k3

The Vector Model: Example II

slide26

k2

k1

d7

d6

d2

d4

d5

d3

d1

k3

The Vector Model: Example III

probabilistic model
Probabilistic Model
  • Objective: to capture the IR problem using a probabilistic framework
  • Given a user query, there is an ideal answer set
  • Querying as specification of the properties of this ideal answer set (clustering)
  • But, what are these properties?
  • Guess at the beginning what they could be (i.e., guess initial description of ideal answer set)
  • Improve by iteration
slide28

Probabilistic Model

  • An initial set of documents is retrieved somehow
  • User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected)
  • IR system uses this information to refine description of ideal answer set
  • By repeting this process, it is expected that the description of the ideal answer set will improve
  • Have always in mind the need to guess at the very beginning the description of the ideal answer set
  • Description of ideal answer set is modeled in probabilistic terms
slide29

Probabilistic Ranking Principle

  • Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant.
  • But,
    • how to compute probabilities?
    • what is the sample space?
slide30

The Ranking

  • Probabilistic ranking computed as:
    • sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q)
    • This is the odds of the document dj being relevant
    • Taking the odds minimize the probability of an erroneous judgement
  • Definition:
    • wij {0,1}
    • P(R | dj) :probability that given doc is relevant
    • P(R | dj) : probability doc is not relevant
slide31

The Ranking

  • sim(dj,q) = P(R | dj) / P(R | dj) = [P(dj | R) * P(R)] [P(dj | R) * P(R)]

~ P(dj | R) P(dj | R)

  • P(dj | R) : probability of randomly selecting the document dj from the set R of relevant documents
slide32

The Ranking

  • sim(dj,q) ~ P(dj | R) P(dj | R) ~ [  P(ki | R)] * [  P(ki | R)]

[  P(ki | R)] * [  P(ki | R)]

  • P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents
slide33

The Ranking

  • sim(dj,q)

~ log [  P(ki | R)] * [  P(kj | R)]

[  P(ki |R)] * [  P(ki | R)]

~ K * [ log  P(ki | R) + log  P(ki | R) ] P(ki | R) P(ki | R)

where P(ki | R) = 1 - P(ki | R) P(ki | R) = 1 - P(ki | R)

slide34

The Initial Ranking

  • sim(dj,q)

~  wiq * wij * (log P(ki | R) + log P(ki | R) )

P(ki | R) P(ki | R)

  • Probabilities P(ki | R) and P(ki | R) ?
  • Estimates based on assumptions:
    • P(ki | R) = 0.5
    • P(ki | R) = ni N
    • Use this initial guess to retrieve an initial ranking
    • Improve upon this initial ranking
slide35

Improving the Initial Ranking

  • Let
    • V : set of docs initially retrieved
    • Vi : subset of docs retrieved that contain ki
  • Reevaluate estimates:
    • P(ki | R) = Vi V
    • P(ki | R) = ni - Vi N - V
  • Repeat recursively
slide36

Improving the Initial Ranking

  • To avoid problems with V=1 and Vi=0:
    • P(ki | R) = Vi + 0.5 V + 1
    • P(ki | R) = ni - Vi + 0.5 N - V + 1
  • Also,
    • P(ki | R) = Vi + ni/N V + 1
    • P(ki | R) = ni - Vi + ni/N N - V + 1
slide37

Pluses and Minuses

  • Advantages:
    • Docs ranked in decreasing order of probability of relevance
  • Disadvantages:
    • need to guess initial estimates for P(ki | R)
    • method does not take into account tf and idf factors
slide38

Brief Comparison of Classic Models

  • Boolean model does not provide for partial matches and is considered to be the weakest classic model
  • Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections
  • This seems also to be the view of the research community
set theoretic models
Set Theoretic Models
  • The Boolean model imposes a binary criterion for deciding relevance
  • The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past
  • We discuss now two set theoretic models for this:
    • Fuzzy Set Model
    • Extended Boolean Model
slide40

Fuzzy Set Model

  • This vagueness of document/query matching can be modeled using a fuzzy framework, as follows:
    • with each term is associated a fuzzy set
    • each doc has a degree of membership in this fuzzy set
  • Here, we discuss the model proposed by Ogawa, Morita, and Kobayashi (1991)
slide41

Fuzzy Set Theory

  • A fuzzy subset A of U is characterized by a membership function (A,u) : U  [0,1] which associates with each element u of U a number (u) in the interval [0,1]
  • Definition
    • Let A and B be two fuzzy subsets of U. Also, let ¬A be the complement of A. Then,
      • (¬A,u) = 1 - (A,u)
      • (AB,u) = max((A,u), (B,u))
      • (AB,u) = min((A,u), (B,u))
slide42

Fuzzy Information Retrieval

  • Fuzzy sets are modeled based on a thesaurus
  • This thesaurus is built as follows:
    • Let c be a term-term correlation matrix
    • Let ci,l be a normalized correlation factor for (ki,kl): ci,l = ni,l ni + nl - ni,l
        • ni: number of docs which contain ki
        • nl: number of docs which contain kl
        • ni,l: number of docs which contain both ki and kl
  • We now have the notion of proximity among index terms.
slide43

Fuzzy Information Retrieval

  • The correlation factor ci,l can be used to define fuzzy set membership for a document dj as follows: i,j = 1 -  (1 - ci,l) ki  dj
  • i,j : membership of doc dj in fuzzy subset associated with ki
  • The above expression computes an algebraic sum over all terms in the doc dj
slide44

Fuzzy Information Retrieval

  • A doc dj belongs to the fuzzy set for ki, if its own terms are associated with ki
  • If doc dj contains a term kl which is closely related to ki, we have
    • ci,l ~ 1
    • i,j ~ 1
    • index ki is a good fuzzy index for doc
slide45

Ka

Kb

cc2

cc3

cc1

Kc

Fuzzy IR: An Example

  • q = ka (kb kc)
  • qdnf

= (1,1,1) + (1,1,0) + (1,0,0)

= cc1 + cc2 + cc3

  • q,dj = cc1+cc2+cc3,j = 1 - (1 - a,j b,j) c,j) * (1 - a,j b,j (1-c,j)) * (1 - a,j (1-b,j) (1-c,j))
slide46

Fuzzy Information Retrieval

  • Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory
  • Experiments with standard test collections are not available
  • Difficult to compare at this time
extended boolean model
Extended Boolean Model
  • Boolean model is simple and elegant.
  • But, no provision for a ranking
  • As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership
  • Extend the Boolean model with the notions of partial matching and term weighting
  • Combine characteristics of the Vector model with properties of Boolean algebra
slide48

The Idea

  • The Extended Boolean Model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra
  • Let,
    • q = kx ky
    • wxj = fxj * idfx associated with [kx,dj] max(idfi)
    • Further, wxj = x and wyj = y
slide49

(1,1)

ky

dj+1

y = wyj

dj

(0,0)

x = wxj

kx

2

2

sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) )

2

The Idea:

qand = kx ky; wxj = x and wyj = y

AND

slide50

ky

dj+1

dj

y = wyj

(0,0)

x = wxj

kx

2

2

sim(qor,dj) = sqrt( x + y )

2

The Idea:

qor = kx ky; wxj = x and wyj = y

(1,1)

OR

slide51

Generalizing the Idea

  • We can extend the previous model to consider Euclidean distances in a t-dimensional space
  • This can be done using p-norms which extend the notion of distance to include p-distances, where 1  p  is a new parameter
slide52

Generalizing the Idea

1

1

p

p

p

p

p

p

p

p

p

p

p

  • sim(qor,dj) = (x1 + x2 + . . . + xm ) m

p

p

p

  • sim(qand,dj)=1 - ((1-x1) + (1-x2) + . . . + (1-xm) ) m
  • A generalized disjunctive query is given by
    • qor = k1 k2 . . . kt
  • A generalized conjunctive query is given by
    • qand = k1 k2 . . . kt
slide53

Properties

  • If p = 1 then (Vector like)
    • sim(qor,dj) = sim(qand,dj) = x1 + . . . + xm m
  • If p =  then (Fuzzy like)
    • sim(qor,dj) = max (wxj)
    • sim(qand,dj) = min (wxj)
  • By varying p, we can make the model behave as a vector, as a fuzzy, or as an intermediary model
slide54

Properties

2

  • This is quite powerful and is a good argument in favor of the extended Boolean model
  • q = (k1 k2) k3

k1 and k2 are to be used as in a vector retrieval while the presence of k3 is required.

  • sim(q,dj) = ( (1 - ( (1-x1) + (1-x2) ) ) + x3 ) 2 ______ 2
conclusions
Conclusions
  • Model is quite powerful
  • Properties are interesting and might be useful
  • Computation is somewhat complex
  • However, distributivity operation does not hold for ranking computation:
    • q1 = (k1  k2)  k3
    • q2 = (k1  k3)  (k2  k3)
    • sim(q1,dj)  sim(q2,dj)