Information retrieval
Sponsored Links
This presentation is the property of its rightful owner.
1 / 84

Information Retrieval PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Information Retrieval. CSE 8337 (Part B) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza -Yates and Berthier Ribeiro-Neto

Download Presentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Information Retrieval

CSE 8337 (Part B)

Spring 2009

Some Material for these slides obtained from:

Modern Information Retrieval by Ricardo Baeza-Yates and BerthierRibeiro-Neto

Data Mining Introductory and Advanced Topics by Margaret H. Dunham

Introduction to Information Retrieval by Christopher D. Manning, PrabhakarRaghavan, and HinrichSchutze

CSE 8337 Outline

  • Introduction

  • Simple Text Processing

  • Boolean Queries

  • Web Searching/Crawling

  • Indexes

  • Vector Space Model

  • Matching

  • Evaluation

CSE 8337 Outline

  • Introduction

  • Simple Text Processing

  • Boolean Queries

  • Web Searching/Crawling

  • Indexes

  • Vector Space Model

  • Matching

  • Evaluation

Modeling TOC(Vector Space and Other Models)

  • Introduction

  • Classic IR Models

    • Boolean Model

    • Vector Model

    • Probabilistic Model

  • Extended Boolean Model

  • Vector Space Scoring

  • Vector Model and Web Search


Set Theoretic

Generalized Vector

Lat. Semantic Index

Neural Networks

Structured Models


Extended Boolean

Non-Overlapping Lists

Proximal Nodes

Classic Models





Inference Network

Belief Network



Structure Guided


IR Models













The Boolean Model

  • Simple model based on set theory

  • Queries specified as boolean expressions

    • precise semantics and neat formalism

  • Terms are either present or absent. Thus, wij  {0,1}

  • Consider

    • q = ka  (kb  kc)

    • qdnf = (1,1,1)  (1,1,0)  (1,0,0)

    • qcc= (1,1,0) is a conjunctive component







The Boolean Model

  • q = ka  (kb kc)

  • sim(q,dj) =

    1 if  qcc| (qcc  qdnf)  (ki, gi(dj)= gi(qcc))

    0 otherwise

Drawbacks of the Boolean Model

  • Retrieval based on binary decision criteria with no notion of partial matching

  • No ranking of the documents is provided

  • Information need has to be translated into a Boolean expression

  • The Boolean queries formulated by the users are most often too simplistic

  • As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

The Vector Model

  • Use of binary weights is too limiting

  • Non-binary weights provide consideration for partial matches

  • These term weights are used to compute a degree of similarity between a query and each document

  • Ranked set of documents provides for better matching

The Vector Model

  • wij > 0 whenever ki appears in dj

  • wiq >= 0 associated with the pair (ki,q)

  • dj = (w1j, w2j, ..., wtj)

  • q = (w1q, w2q, ..., wtq)

  • To each term ki is associated a unitary vector i

  • The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)

  • The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors

The Vector Model





  • Sim(q,dj) = cos()

    = [dj  q] / |dj| * |q|

    = [ wij * wiq] / |dj| * |q|

  • Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1

  • A document is retrieved even if it matches the query terms only partially

Weights wij and wiq ?

  • One approach is to examine the frequency of the occurence of a word in a document:

  • Absolute frequency:

    • tf factor, the term frequency within a document

    • freqi,j - raw frequency of ki within dj

    • Both high-frequency and low-frequency terms may not actually be significant

  • Relative frequency: tf divided by number of words in document

  • Normalized frequency:

    fi,j = (freqi,j)/(maxl freql,j)

Inverse Document Frequency

  • Importance of term may depend more on how it can distinguish between documents.

  • Quantification of inter-documents separation

  • Dissimilarity not similarity

  • idf factor, the inverse document frequency


  • N be the total number of docs in the collection

  • ni be the number of docs which contain ki

  • The idf factor is computed as

    • idfi = log (N/ni)

    • the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.

  • IDF Ex:

    • N=1000, n1=100, n2=500, n3=800

    • idf1= 3 - 2 = 1

    • idf2= 3 – 2.7 = 0.3

    • idf3 = 3 – 2.9 = 0.1

The Vector Model

  • The best term-weighting schemes take both into account.

  • wij = fi,j * log(N/ni)

  • This strategy is called a tf-idf weighting scheme

The Vector Model

  • For the query term weights, a suggestion is

    • wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni)

  • The vector model with tf-idf weights is a good ranking strategy with general collections

  • The vector model is usually as good as any known ranking alternatives.

  • It is also simple and fast to compute.

The Vector Model

  • Advantages:

    • term-weighting improves quality of the answer set

    • partial matching allows retrieval of docs that approximate the query conditions

    • cosine ranking formula sorts documents according to degree of similarity to the query

  • Disadvantages:

    • Assumes independence of index terms (??); not clear that this is bad though











The Vector Model: Example I











The Vector Model: Example II











The Vector Model: Example III

Probabilistic Model

  • Objective: to capture the IR problem using a probabilistic framework

  • Given a user query, there is an ideal answer set

  • Querying as specification of the properties of this ideal answer set (clustering)

  • But, what are these properties?

  • Guess at the beginning what they could be (i.e., guess initial description of ideal answer set)

  • Improve by iteration

Probabilistic Model

  • An initial set of documents is retrieved somehow

  • User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected)

  • IR system uses this information to refine description of ideal answer set

  • By repeating this process, it is expected that the description of the ideal answer set will improve

  • Have always in mind the need to guess at the very beginning the description of the ideal answer set

  • Description of ideal answer set is modeled in probabilistic terms

Probabilistic Ranking Principle

  • Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant.

  • But,

    • how to compute probabilities?

    • what is the sample space?

The Ranking

  • Probabilistic ranking computed as:

    • sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q)

    • This is the odds of the document dj being relevant

    • Taking the odds minimize the probability of an erroneous judgement

  • Definition:

    • wij {0,1}

    • P(R | dj) :probability that given doc is relevant

    • P(R | dj) : probability doc is not relevant

The Ranking

  • sim(dj,q) = P(R | dj) / P(R | dj)= [P(dj | R) * P(R)] [P(dj | R) * P(R)]

    ~ P(dj | R) P(dj | R)

  • P(dj | R) : probability of randomly selecting the document dj from the set R of relevant documents

The Ranking

  • sim(dj,q)~ P(dj | R) P(dj | R)~ [  P(ki | R)] * [  P(ki | R)]

    [  P(ki | R)] * [  P(ki | R)]

  • P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents

The Ranking

  • sim(dj,q)

    ~ log [  P(ki | R)] * [  P(kj | R)]

    [  P(ki |R)] * [  P(ki | R)]

    ~ K * [ log  P(ki | R) + log  P(ki | R) ] P(ki | R) P(ki | R)

    where P(ki | R) = 1 - P(ki | R) P(ki | R) = 1 - P(ki | R)

The Initial Ranking

  • sim(dj,q)

    ~  wiq * wij * (log P(ki | R) + log P(ki | R) )

    P(ki | R) P(ki | R)

  • Probabilities P(ki | R) and P(ki | R) ?

  • Estimates based on assumptions:

    • P(ki | R) = 0.5

    • P(ki | R) = ni N

    • Use this initial guess to retrieve an initial ranking

    • Improve upon this initial ranking

Improving the Initial Ranking

  • Let

    • V : set of docs initially retrieved

    • Vi : subset of docs retrieved that contain ki

  • Reevaluate estimates:

    • P(ki | R) = Vi V

    • P(ki | R) = ni - Vi N - V

  • Repeat recursively

Improving the Initial Ranking

  • To avoid problems with V=1 and Vi=0:

    • P(ki | R) = Vi + 0.5 V + 1

    • P(ki | R) = ni - Vi + 0.5 N - V + 1

  • Also,

    • P(ki | R) = Vi + ni/N V + 1

    • P(ki | R) = ni - Vi + ni/N N - V + 1

Pluses and Minuses

  • Advantages:

    • Docs ranked in decreasing order of probability of relevance

  • Disadvantages:

    • need to guess initial estimates for P(ki | R)

    • method does not take into account tf and idf factors

Brief Comparison of Classic Models

  • Boolean model does not provide for partial matches and is considered to be the weakest classic model

  • Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections

  • This seems also to be the view of the research community

Extended Boolean Model

  • Boolean model is simple and elegant.

  • But, no provision for a ranking

  • As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership

  • Extend the Boolean model with the notions of partial matching and term weighting

  • Combine characteristics of the Vector model with properties of Boolean algebra

The Idea

  • The Extended Boolean Model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra

  • Let,

    • q = kx ky

    • wxj = fxj * idfx associated with [kx,dj] max(idfi)

    • Further, wxj = x and wyj = y



sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) )


The Idea:

qand = kx ky; wxj = x and wyj = y




y = wyj



x = wxj




sim(qor,dj) = sqrt( x + y )


The Idea:

qor = kx ky; wxj = x and wyj = y





y = wyj


x = wxj


Generalizing the Idea

  • We can extend the previous model to consider Euclidean distances in a t-dimensional space

  • This can be done using p-norms which extend the notion of distance to include p-distances, where 1  p  is a new parameter

Generalizing the Idea














  • sim(qor,dj) = (x1 + x2 + . . . + xm ) m




  • sim(qand,dj)=1 - ((1-x1) + (1-x2) + . . . + (1-xm) ) m

  • A generalized disjunctive query is given by

    • qor = k1 k2 . . . kt

  • A generalized conjunctive query is given by

    • qand = k1 k2 . . . kt


  • If p = 1 then (Vector like)

    • sim(qor,dj) = sim(qand,dj) = x1 + . . . + xm m

  • If p =  then (Fuzzy like)

    • sim(qor,dj) = max (wxj)

    • sim(qand,dj) = min (wxj)

  • By varying p, we can make the model behave as a vector, as a fuzzy, or as an intermediary model



  • This is quite powerful and is a good argument in favor of the extended Boolean model

  • q = (k1 k2) k3

    k1 and k2 are to be used as in a vector retrieval while the presence of k3 is required.

  • sim(q,dj) = ( (1 - ( (1-x1) + (1-x2) ) ) + x3 ) 2 ______ 2


  • Model is quite powerful

  • Properties are interesting and might be useful

  • Computation is somewhat complex

  • However, distributivity operation does not hold for ranking computation:

    • q1 = (k1  k2)  k3

    • q2 = (k1  k3)  (k2  k3)

    • sim(q1,dj)  sim(q2,dj)

Vector Space Scoring

  • First cut: distance between two points

    • ( = distance between the end points of the two vectors)

  • Euclidean distance?

  • Euclidean distance is a bad idea . . .

  • . . . because Euclidean distance is large for vectors of different lengths.

Why distance is a bad idea

The Euclidean distance between q

and d2 is large even though the

distribution of terms in the query qand the distribution of

terms in the document d2 are

very similar.

Use angle instead of distance

  • Thought experiment: take a document d and append it to itself. Call this document d′.

  • “Semantically” d and d′ have the same content

  • The Euclidean distance between the two documents can be quite large

  • The angle between the two documents is 0, corresponding to maximal similarity.

  • Key idea: Rank documents according to angle with query.

From angles to cosines

  • The following two notions are equivalent.

    • Rank documents in decreasing order of the angle between query and document

    • Rank documents in increasing order of cosine(query,document)

  • Cosine is a monotonically decreasing function for the interval [0o, 180o]

Length normalization

  • A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm:

  • Dividing a vector by its L2 norm makes it a unit (length) vector

  • Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.


Dot product

Unit vectors

qi is the tf-idf weight of term i in the query

di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.

Cosine similarity amongst 3 documents

How similar are

the novels

SaS: Sense and


PaP: Pride and

Prejudice, and

WH: Wuthering


Term frequencies (counts)

3 documents example contd.

Log frequency weighting

After normalization

cos(SaS,PaP) ≈

0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0

≈ 0.94

cos(SaS,WH) ≈ 0.79

cos(PaP,WH) ≈ 0.69

Why do we have cos(SaS,PaP) > cos(SAS,WH)?

tf-idf weighting has many variants

Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?

Weighting may differ in Queries vs Documents

  • Many search engines allow for different weightings for queries vs documents

  • To denote the combination in use in an engine, we use the notation qqq.ddd with the acronyms from the previous table

  • Example: ltn.ltc means:

  • Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization …

  • Document logarithmic tf, no idf and cosine normalization

Is this a bad idea?

tf-idf example: ltn.lnc

Document: car insurance auto insurance

Query: best car insurance

Exercise: what is N, the number of docs?

Doc length =

Score = 0+0+1.04+2.04 = 3.08

Summary – vector space ranking

  • Represent the query as a weighted tf-idf vector

  • Represent each document as a weighted tf-idf vector

  • Compute the cosine similarity score for the query vector and each document vector

  • Rank documents with respect to the query by score

  • Return the top K (e.g., K = 10) to the user

Vector Model and Web Search

  • Speeding up vector space ranking

  • Putting together a complete search system

    • Will require learning about a number of miscellaneous topics and heuristics

Efficient cosine ranking

  • Find the K docs in the collection “nearest” to the query  K largest query-doc cosines.

  • Efficient ranking:

    • Computing a single cosine efficiently.

    • Choosing the K largest cosine values efficiently.

      • Can we do this without computing all N cosines?

Efficient cosine ranking

  • What we’re doing in effect: solving the K-nearest neighbor problem for a query vector

  • In general, we do not know how to do this efficiently for high-dimensional spaces

  • But it is solvable for short queries, and standard indexes support this well

Special case – unweighted queries

  • No weighting on query terms

    • Assume each query term occurs only once

  • Then for ranking, don’t need to normalize query vector

Faster cosine: unweighted query

Computing the K largest cosines: selection vs. sorting

  • Typically we want to retrieve the top K docs (in the cosine ranking for the query)

    • not to totally order all docs in the collection

  • Can we pick off docs with K highest cosines?

  • Let J = number of docs with nonzero cosines

    • We seek the K best of these J

Use heap for selecting top K

  • Binary tree in which each node’s value > the values of children

  • Takes 2J operations to construct, then each of K “winners” read off in 2log J steps.

  • For J=1M, K=100, this is about 10% of the cost of sorting.









  • Primary computational bottleneck in scoring: cosine computation

  • Can we avoid all this computation?

  • Yes, but may sometimes get it wrong

    • a doc not in the top K may creep into the list of K output docs

    • Is this such a bad thing?

Cosine similarity is only a proxy

  • User has a task and a query formulation

  • Cosine matches docs to query

  • Thus cosine is anyway a proxy for user happiness

  • If we get a list of K docs “close” to the top K by cosine measure, should be ok

Generic approach

  • Find a set A of contenders, with K < |A| << N

    • A does not necessarily contain the top K, but has many docs from among the top K

    • Return the top K docs in A

  • Think of A as pruning non-contenders

  • The same approach is also used for other (non-cosine) scoring functions

  • Will look at several schemes following this approach

Index elimination

  • Basic algorithm of Fig 7.1 only considers docs containing at least one query term

  • Take this further:

    • Only consider high-idf query terms

    • Only consider docs containing many query terms

High-idf query terms only

  • For a query such as catcher in the rye

  • Only accumulate scores from catcher and rye

  • Intuition: in and the contribute little to the scores and don’t alter rank-ordering much

  • Benefit:

    • Postings of low-idf terms have many docs  these (many) docs get eliminated from A

Docs containing many query terms

  • Any doc with at least one query term is a candidate for the top K output list

  • For multi-term queries, only compute scores for docs containing several of the query terms

    • Say, at least 3 out of 4

    • Imposes a “soft conjunction” on queries seen on web search engines (early Google)

  • Easy to implement in postings traversal























3 of 4 query terms








Scores only computed for 8, 16 and 32.

Champion lists

  • Precompute for each dictionary term t, the r docs of highest weight in t’s postings

    • Call this the champion list for t

    • (aka fancy list or top docs for t)

  • Note that r has to be chosen at index time

  • At query time, only compute scores for docs in the champion list of some query term

    • Pick the K top-scoring docs from amongst these

Static quality scores

  • We want top-ranking documents to be both relevant and authoritative

  • Relevance is being modeled by cosine scores

  • Authority is typically a query-independent property of a document

  • Examples of authority signals

    • Wikipedia among websites

    • Articles in certain newspapers

    • A paper with many citations

    • Many diggs, Y!buzzes or marks

    • (Pagerank)


Modeling authority

  • Assign to each document a query-independentquality score in [0,1] to each document d

    • Denote this by g(d)

  • Thus, a quantity like the number of citations is scaled into [0,1]

    • Exercise: suggest a formula for this.

Net score

  • Consider a simple total score combining cosine relevance and authority

  • net-score(q,d) = g(d) + cosine(q,d)

    • Can use some other linear combination than an equal weighting

    • Indeed, any function of the two “signals” of user happiness – more later

  • Now we seek the top K docs by net score

Top K by net score – fast methods

  • First idea: Order all postings by g(d)

  • Key: this is a common ordering for all postings

  • Thus, can concurrently traverse query terms’ postings for

    • Postings intersection

    • Cosine score computation

  • Exercise: write pseudocode for cosine score computation if postings are ordered by g(d)

Why order postings by g(d)?

  • Under g(d)-ordering, top-scoring docs likely to appear early in postings traversal

  • In time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal early

    • Short of computing scores for all docs in postings

Champion lists in g(d)-ordering

  • Can combine champion lists with g(d)-ordering

  • Maintain for each term a champion list of the r docs with highest g(d) + tf-idftd

  • Seek top-K results from only the docs in these champion lists

High and low lists

  • For each term, we maintain two postings lists called high and low

    • Think of high as the champion list

  • When traversing postings on a query, only traverse high lists first

    • If we get more than K docs, select the top K and stop

    • Else proceed to get docs from the low lists

  • Can be used even for simple cosine scores, without global quality g(d)

  • A means for segmenting index into two tiers

Impact-ordered postings

  • We only want to compute scores for docs for which wft,d is high enough

  • We sort each postings list by wft,d

  • Now: not all postings in a common order!

  • How do we compute scores in order to pick off top K?

    • Two ideas follow

1. Early termination

  • When traversing t’s postings, stop early after either

    • a fixed number of rdocs

    • wft,d drops below some threshold

  • Take the union of the resulting sets of docs

    • One from the postings of each query term

  • Compute only the scores for docs in this union

2. idf-ordered terms

  • When considering the postings of query terms

  • Look at them in order of decreasing idf

    • High idf terms likely to contribute most to score

  • As we update score contribution from each query term

    • Stop if doc scores relatively unchanged

  • Can apply to cosine or some other net scores

Cluster pruning: preprocessing

  • Pick N docs at random: call these leaders

  • For every other doc, pre-compute nearest leader

    • Docs attached to a leader: its followers;

    • Likely: each leader has ~ N followers.

Cluster pruning: query processing

  • Process a query as follows:

    • Given query Q, find its nearest leader L.

    • Seek K nearest docs from among L’s followers.





Why use random sampling

  • Fast

  • Leaders reflect data distribution

General variants

  • Have each follower attached to b1=3 (say) nearest leaders.

  • From query, find b2=4 (say) nearest leaders and their followers.

  • Can recur on leader/follower construction.

Putting it all together

  • Login