Information retrieval
This presentation is the property of its rightful owner.
Sponsored Links
1 / 84

Information Retrieval PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on
  • Presentation posted in: General

Information Retrieval. CSE 8337 (Part B) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza -Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/

Download Presentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Information retrieval

Information Retrieval

CSE 8337 (Part B)

Spring 2009

Some Material for these slides obtained from:

Modern Information Retrieval by Ricardo Baeza-Yates and BerthierRibeiro-Netohttp://www.sims.berkeley.edu/~hearst/irbook/

Data Mining Introductory and Advanced Topics by Margaret H. Dunham

http://www.engr.smu.edu/~mhd/book

Introduction to Information Retrieval by Christopher D. Manning, PrabhakarRaghavan, and HinrichSchutze

http://informationretrieval.org


Cse 8337 outline

CSE 8337 Outline

  • Introduction

  • Simple Text Processing

  • Boolean Queries

  • Web Searching/Crawling

  • Indexes

  • Vector Space Model

  • Matching

  • Evaluation


Cse 8337 outline1

CSE 8337 Outline

  • Introduction

  • Simple Text Processing

  • Boolean Queries

  • Web Searching/Crawling

  • Indexes

  • Vector Space Model

  • Matching

  • Evaluation


Modeling toc vector space and other models

Modeling TOC(Vector Space and Other Models)

  • Introduction

  • Classic IR Models

    • Boolean Model

    • Vector Model

    • Probabilistic Model

  • Extended Boolean Model

  • Vector Space Scoring

  • Vector Model and Web Search


Information retrieval

Algebraic

Set Theoretic

Generalized Vector

Lat. Semantic Index

Neural Networks

Structured Models

Fuzzy

Extended Boolean

Non-Overlapping Lists

Proximal Nodes

Classic Models

Probabilistic

boolean

vector

probabilistic

Inference Network

Belief Network

Browsing

Flat

Structure Guided

Hypertext

IR Models

U

s

e

r

T

a

s

k

Retrieval:

Adhoc

Filtering

Browsing


Information retrieval

The Boolean Model

  • Simple model based on set theory

  • Queries specified as boolean expressions

    • precise semantics and neat formalism

  • Terms are either present or absent. Thus, wij  {0,1}

  • Consider

    • q = ka  (kb  kc)

    • qdnf = (1,1,1)  (1,1,0)  (1,0,0)

    • qcc= (1,1,0) is a conjunctive component


Information retrieval

Ka

Kb

(1,1,0)

(1,0,0)

(1,1,1)

Kc

The Boolean Model

  • q = ka  (kb kc)

  • sim(q,dj) =

    1 if  qcc| (qcc  qdnf)  (ki, gi(dj)= gi(qcc))

    0 otherwise


Information retrieval

Drawbacks of the Boolean Model

  • Retrieval based on binary decision criteria with no notion of partial matching

  • No ranking of the documents is provided

  • Information need has to be translated into a Boolean expression

  • The Boolean queries formulated by the users are most often too simplistic

  • As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query


Information retrieval

The Vector Model

  • Use of binary weights is too limiting

  • Non-binary weights provide consideration for partial matches

  • These term weights are used to compute a degree of similarity between a query and each document

  • Ranked set of documents provides for better matching


Information retrieval

The Vector Model

  • wij > 0 whenever ki appears in dj

  • wiq >= 0 associated with the pair (ki,q)

  • dj = (w1j, w2j, ..., wtj)

  • q = (w1q, w2q, ..., wtq)

  • To each term ki is associated a unitary vector i

  • The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)

  • The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors


Information retrieval

The Vector Model

j

dj

q

i

  • Sim(q,dj) = cos()

    = [dj  q] / |dj| * |q|

    = [ wij * wiq] / |dj| * |q|

  • Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1

  • A document is retrieved even if it matches the query terms only partially


Information retrieval

Weights wij and wiq ?

  • One approach is to examine the frequency of the occurence of a word in a document:

  • Absolute frequency:

    • tf factor, the term frequency within a document

    • freqi,j - raw frequency of ki within dj

    • Both high-frequency and low-frequency terms may not actually be significant

  • Relative frequency: tf divided by number of words in document

  • Normalized frequency:

    fi,j = (freqi,j)/(maxl freql,j)


Inverse document frequency

Inverse Document Frequency

  • Importance of term may depend more on how it can distinguish between documents.

  • Quantification of inter-documents separation

  • Dissimilarity not similarity

  • idf factor, the inverse document frequency


Information retrieval

IDF

  • N be the total number of docs in the collection

  • ni be the number of docs which contain ki

  • The idf factor is computed as

    • idfi = log (N/ni)

    • the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.

  • IDF Ex:

    • N=1000, n1=100, n2=500, n3=800

    • idf1= 3 - 2 = 1

    • idf2= 3 – 2.7 = 0.3

    • idf3 = 3 – 2.9 = 0.1


Information retrieval

The Vector Model

  • The best term-weighting schemes take both into account.

  • wij = fi,j * log(N/ni)

  • This strategy is called a tf-idf weighting scheme


Information retrieval

The Vector Model

  • For the query term weights, a suggestion is

    • wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni)

  • The vector model with tf-idf weights is a good ranking strategy with general collections

  • The vector model is usually as good as any known ranking alternatives.

  • It is also simple and fast to compute.


Information retrieval

The Vector Model

  • Advantages:

    • term-weighting improves quality of the answer set

    • partial matching allows retrieval of docs that approximate the query conditions

    • cosine ranking formula sorts documents according to degree of similarity to the query

  • Disadvantages:

    • Assumes independence of index terms (??); not clear that this is bad though


Information retrieval

k2

k1

d7

d6

d2

d4

d5

d3

d1

k3

The Vector Model: Example I


Information retrieval

k2

k1

d7

d6

d2

d4

d5

d3

d1

k3

The Vector Model: Example II


Information retrieval

k2

k1

d7

d6

d2

d4

d5

d3

d1

k3

The Vector Model: Example III


Probabilistic model

Probabilistic Model

  • Objective: to capture the IR problem using a probabilistic framework

  • Given a user query, there is an ideal answer set

  • Querying as specification of the properties of this ideal answer set (clustering)

  • But, what are these properties?

  • Guess at the beginning what they could be (i.e., guess initial description of ideal answer set)

  • Improve by iteration


Information retrieval

Probabilistic Model

  • An initial set of documents is retrieved somehow

  • User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected)

  • IR system uses this information to refine description of ideal answer set

  • By repeating this process, it is expected that the description of the ideal answer set will improve

  • Have always in mind the need to guess at the very beginning the description of the ideal answer set

  • Description of ideal answer set is modeled in probabilistic terms


Information retrieval

Probabilistic Ranking Principle

  • Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant.

  • But,

    • how to compute probabilities?

    • what is the sample space?


Information retrieval

The Ranking

  • Probabilistic ranking computed as:

    • sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q)

    • This is the odds of the document dj being relevant

    • Taking the odds minimize the probability of an erroneous judgement

  • Definition:

    • wij {0,1}

    • P(R | dj) :probability that given doc is relevant

    • P(R | dj) : probability doc is not relevant


Information retrieval

The Ranking

  • sim(dj,q) = P(R | dj) / P(R | dj)= [P(dj | R) * P(R)] [P(dj | R) * P(R)]

    ~ P(dj | R) P(dj | R)

  • P(dj | R) : probability of randomly selecting the document dj from the set R of relevant documents


Information retrieval

The Ranking

  • sim(dj,q)~ P(dj | R) P(dj | R)~ [  P(ki | R)] * [  P(ki | R)]

    [  P(ki | R)] * [  P(ki | R)]

  • P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents


Information retrieval

The Ranking

  • sim(dj,q)

    ~ log [  P(ki | R)] * [  P(kj | R)]

    [  P(ki |R)] * [  P(ki | R)]

    ~ K * [ log  P(ki | R) + log  P(ki | R) ] P(ki | R) P(ki | R)

    where P(ki | R) = 1 - P(ki | R) P(ki | R) = 1 - P(ki | R)


Information retrieval

The Initial Ranking

  • sim(dj,q)

    ~  wiq * wij * (log P(ki | R) + log P(ki | R) )

    P(ki | R) P(ki | R)

  • Probabilities P(ki | R) and P(ki | R) ?

  • Estimates based on assumptions:

    • P(ki | R) = 0.5

    • P(ki | R) = ni N

    • Use this initial guess to retrieve an initial ranking

    • Improve upon this initial ranking


Information retrieval

Improving the Initial Ranking

  • Let

    • V : set of docs initially retrieved

    • Vi : subset of docs retrieved that contain ki

  • Reevaluate estimates:

    • P(ki | R) = Vi V

    • P(ki | R) = ni - Vi N - V

  • Repeat recursively


Information retrieval

Improving the Initial Ranking

  • To avoid problems with V=1 and Vi=0:

    • P(ki | R) = Vi + 0.5 V + 1

    • P(ki | R) = ni - Vi + 0.5 N - V + 1

  • Also,

    • P(ki | R) = Vi + ni/N V + 1

    • P(ki | R) = ni - Vi + ni/N N - V + 1


Information retrieval

Pluses and Minuses

  • Advantages:

    • Docs ranked in decreasing order of probability of relevance

  • Disadvantages:

    • need to guess initial estimates for P(ki | R)

    • method does not take into account tf and idf factors


Information retrieval

Brief Comparison of Classic Models

  • Boolean model does not provide for partial matches and is considered to be the weakest classic model

  • Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections

  • This seems also to be the view of the research community


Extended boolean model

Extended Boolean Model

  • Boolean model is simple and elegant.

  • But, no provision for a ranking

  • As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership

  • Extend the Boolean model with the notions of partial matching and term weighting

  • Combine characteristics of the Vector model with properties of Boolean algebra


Information retrieval

The Idea

  • The Extended Boolean Model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra

  • Let,

    • q = kx ky

    • wxj = fxj * idfx associated with [kx,dj] max(idfi)

    • Further, wxj = x and wyj = y


Information retrieval

2

2

sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) )

2

The Idea:

qand = kx ky; wxj = x and wyj = y

(1,1)

ky

AND

y = wyj

dj

(0,0)

x = wxj

kx


Information retrieval

2

2

sim(qor,dj) = sqrt( x + y )

2

The Idea:

qor = kx ky; wxj = x and wyj = y

(1,1)

ky

OR

dj

y = wyj

(0,0)

x = wxj

kx


Information retrieval

Generalizing the Idea

  • We can extend the previous model to consider Euclidean distances in a t-dimensional space

  • This can be done using p-norms which extend the notion of distance to include p-distances, where 1  p  is a new parameter


Information retrieval

Generalizing the Idea

1

1

p

p

p

p

p

p

p

p

p

p

p

  • sim(qor,dj) = (x1 + x2 + . . . + xm ) m

p

p

p

  • sim(qand,dj)=1 - ((1-x1) + (1-x2) + . . . + (1-xm) ) m

  • A generalized disjunctive query is given by

    • qor = k1 k2 . . . kt

  • A generalized conjunctive query is given by

    • qand = k1 k2 . . . kt


Information retrieval

Properties

  • If p = 1 then (Vector like)

    • sim(qor,dj) = sim(qand,dj) = x1 + . . . + xm m

  • If p =  then (Fuzzy like)

    • sim(qor,dj) = max (wxj)

    • sim(qand,dj) = min (wxj)

  • By varying p, we can make the model behave as a vector, as a fuzzy, or as an intermediary model


Information retrieval

Properties

2

  • This is quite powerful and is a good argument in favor of the extended Boolean model

  • q = (k1 k2) k3

    k1 and k2 are to be used as in a vector retrieval while the presence of k3 is required.

  • sim(q,dj) = ( (1 - ( (1-x1) + (1-x2) ) ) + x3 ) 2 ______ 2


Conclusions

Conclusions

  • Model is quite powerful

  • Properties are interesting and might be useful

  • Computation is somewhat complex

  • However, distributivity operation does not hold for ranking computation:

    • q1 = (k1  k2)  k3

    • q2 = (k1  k3)  (k2  k3)

    • sim(q1,dj)  sim(q2,dj)


Vector space scoring

Vector Space Scoring

  • First cut: distance between two points

    • ( = distance between the end points of the two vectors)

  • Euclidean distance?

  • Euclidean distance is a bad idea . . .

  • . . . because Euclidean distance is large for vectors of different lengths.


Why distance is a bad idea

Why distance is a bad idea

The Euclidean distance between q

and d2 is large even though the

distribution of terms in the query qand the distribution of

terms in the document d2 are

very similar.


Use angle instead of distance

Use angle instead of distance

  • Thought experiment: take a document d and append it to itself. Call this document d′.

  • “Semantically” d and d′ have the same content

  • The Euclidean distance between the two documents can be quite large

  • The angle between the two documents is 0, corresponding to maximal similarity.

  • Key idea: Rank documents according to angle with query.


From angles to cosines

From angles to cosines

  • The following two notions are equivalent.

    • Rank documents in decreasing order of the angle between query and document

    • Rank documents in increasing order of cosine(query,document)

  • Cosine is a monotonically decreasing function for the interval [0o, 180o]


Length normalization

Length normalization

  • A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm:

  • Dividing a vector by its L2 norm makes it a unit (length) vector

  • Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.


Cosine query document

cosine(query,document)

Dot product

Unit vectors

qi is the tf-idf weight of term i in the query

di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.


Cosine similarity amongst 3 documents

Cosine similarity amongst 3 documents

How similar are

the novels

SaS: Sense and

Sensibility

PaP: Pride and

Prejudice, and

WH: Wuthering

Heights?

Term frequencies (counts)


3 documents example contd

3 documents example contd.

Log frequency weighting

After normalization

cos(SaS,PaP) ≈

0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0

≈ 0.94

cos(SaS,WH) ≈ 0.79

cos(PaP,WH) ≈ 0.69

Why do we have cos(SaS,PaP) > cos(SAS,WH)?


Tf idf weighting has many variants

tf-idf weighting has many variants

Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?


Weighting may differ in queries vs documents

Weighting may differ in Queries vs Documents

  • Many search engines allow for different weightings for queries vs documents

  • To denote the combination in use in an engine, we use the notation qqq.ddd with the acronyms from the previous table

  • Example: ltn.ltc means:

  • Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization …

  • Document logarithmic tf, no idf and cosine normalization

Is this a bad idea?


Tf idf example ltn lnc

tf-idf example: ltn.lnc

Document: car insurance auto insurance

Query: best car insurance

Exercise: what is N, the number of docs?

Doc length =

Score = 0+0+1.04+2.04 = 3.08


Summary vector space ranking

Summary – vector space ranking

  • Represent the query as a weighted tf-idf vector

  • Represent each document as a weighted tf-idf vector

  • Compute the cosine similarity score for the query vector and each document vector

  • Rank documents with respect to the query by score

  • Return the top K (e.g., K = 10) to the user


Vector model and web search

Vector Model and Web Search

  • Speeding up vector space ranking

  • Putting together a complete search system

    • Will require learning about a number of miscellaneous topics and heuristics


Efficient cosine ranking

Efficient cosine ranking

  • Find the K docs in the collection “nearest” to the query  K largest query-doc cosines.

  • Efficient ranking:

    • Computing a single cosine efficiently.

    • Choosing the K largest cosine values efficiently.

      • Can we do this without computing all N cosines?


Efficient cosine ranking1

Efficient cosine ranking

  • What we’re doing in effect: solving the K-nearest neighbor problem for a query vector

  • In general, we do not know how to do this efficiently for high-dimensional spaces

  • But it is solvable for short queries, and standard indexes support this well


Special case unweighted queries

Special case – unweighted queries

  • No weighting on query terms

    • Assume each query term occurs only once

  • Then for ranking, don’t need to normalize query vector


Faster cosine unweighted query

Faster cosine: unweighted query


Computing the k largest cosines selection vs sorting

Computing the K largest cosines: selection vs. sorting

  • Typically we want to retrieve the top K docs (in the cosine ranking for the query)

    • not to totally order all docs in the collection

  • Can we pick off docs with K highest cosines?

  • Let J = number of docs with nonzero cosines

    • We seek the K best of these J


Use heap for selecting top k

Use heap for selecting top K

  • Binary tree in which each node’s value > the values of children

  • Takes 2J operations to construct, then each of K “winners” read off in 2log J steps.

  • For J=1M, K=100, this is about 10% of the cost of sorting.

1

.9

.3

.3

.8

.1

.1


Bottlenecks

Bottlenecks

  • Primary computational bottleneck in scoring: cosine computation

  • Can we avoid all this computation?

  • Yes, but may sometimes get it wrong

    • a doc not in the top K may creep into the list of K output docs

    • Is this such a bad thing?


Cosine similarity is only a proxy

Cosine similarity is only a proxy

  • User has a task and a query formulation

  • Cosine matches docs to query

  • Thus cosine is anyway a proxy for user happiness

  • If we get a list of K docs “close” to the top K by cosine measure, should be ok


Generic approach

Generic approach

  • Find a set A of contenders, with K < |A| << N

    • A does not necessarily contain the top K, but has many docs from among the top K

    • Return the top K docs in A

  • Think of A as pruning non-contenders

  • The same approach is also used for other (non-cosine) scoring functions

  • Will look at several schemes following this approach


Index elimination

Index elimination

  • Basic algorithm of Fig 7.1 only considers docs containing at least one query term

  • Take this further:

    • Only consider high-idf query terms

    • Only consider docs containing many query terms


High idf query terms only

High-idf query terms only

  • For a query such as catcher in the rye

  • Only accumulate scores from catcher and rye

  • Intuition: in and the contribute little to the scores and don’t alter rank-ordering much

  • Benefit:

    • Postings of low-idf terms have many docs  these (many) docs get eliminated from A


Docs containing many query terms

Docs containing many query terms

  • Any doc with at least one query term is a candidate for the top K output list

  • For multi-term queries, only compute scores for docs containing several of the query terms

    • Say, at least 3 out of 4

    • Imposes a “soft conjunction” on queries seen on web search engines (early Google)

  • Easy to implement in postings traversal


3 of 4 query terms

3

2

4

4

8

8

16

16

32

32

64

64

128

128

1

2

3

5

8

13

21

34

3 of 4 query terms

Antony

Brutus

Caesar

Calpurnia

13

16

32

Scores only computed for 8, 16 and 32.


Champion lists

Champion lists

  • Precompute for each dictionary term t, the r docs of highest weight in t’s postings

    • Call this the champion list for t

    • (aka fancy list or top docs for t)

  • Note that r has to be chosen at index time

  • At query time, only compute scores for docs in the champion list of some query term

    • Pick the K top-scoring docs from amongst these


Static quality scores

Static quality scores

  • We want top-ranking documents to be both relevant and authoritative

  • Relevance is being modeled by cosine scores

  • Authority is typically a query-independent property of a document

  • Examples of authority signals

    • Wikipedia among websites

    • Articles in certain newspapers

    • A paper with many citations

    • Many diggs, Y!buzzes or del.icio.us marks

    • (Pagerank)

Quantitative


Modeling authority

Modeling authority

  • Assign to each document a query-independentquality score in [0,1] to each document d

    • Denote this by g(d)

  • Thus, a quantity like the number of citations is scaled into [0,1]

    • Exercise: suggest a formula for this.


Net score

Net score

  • Consider a simple total score combining cosine relevance and authority

  • net-score(q,d) = g(d) + cosine(q,d)

    • Can use some other linear combination than an equal weighting

    • Indeed, any function of the two “signals” of user happiness – more later

  • Now we seek the top K docs by net score


Top k by net score fast methods

Top K by net score – fast methods

  • First idea: Order all postings by g(d)

  • Key: this is a common ordering for all postings

  • Thus, can concurrently traverse query terms’ postings for

    • Postings intersection

    • Cosine score computation

  • Exercise: write pseudocode for cosine score computation if postings are ordered by g(d)


Why order postings by g d

Why order postings by g(d)?

  • Under g(d)-ordering, top-scoring docs likely to appear early in postings traversal

  • In time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal early

    • Short of computing scores for all docs in postings


Champion lists in g d ordering

Champion lists in g(d)-ordering

  • Can combine champion lists with g(d)-ordering

  • Maintain for each term a champion list of the r docs with highest g(d) + tf-idftd

  • Seek top-K results from only the docs in these champion lists


High and low lists

High and low lists

  • For each term, we maintain two postings lists called high and low

    • Think of high as the champion list

  • When traversing postings on a query, only traverse high lists first

    • If we get more than K docs, select the top K and stop

    • Else proceed to get docs from the low lists

  • Can be used even for simple cosine scores, without global quality g(d)

  • A means for segmenting index into two tiers


Impact ordered postings

Impact-ordered postings

  • We only want to compute scores for docs for which wft,d is high enough

  • We sort each postings list by wft,d

  • Now: not all postings in a common order!

  • How do we compute scores in order to pick off top K?

    • Two ideas follow


1 early termination

1. Early termination

  • When traversing t’s postings, stop early after either

    • a fixed number of rdocs

    • wft,d drops below some threshold

  • Take the union of the resulting sets of docs

    • One from the postings of each query term

  • Compute only the scores for docs in this union


2 idf ordered terms

2. idf-ordered terms

  • When considering the postings of query terms

  • Look at them in order of decreasing idf

    • High idf terms likely to contribute most to score

  • As we update score contribution from each query term

    • Stop if doc scores relatively unchanged

  • Can apply to cosine or some other net scores


Cluster pruning preprocessing

Cluster pruning: preprocessing

  • Pick N docs at random: call these leaders

  • For every other doc, pre-compute nearest leader

    • Docs attached to a leader: its followers;

    • Likely: each leader has ~ N followers.


Cluster pruning query processing

Cluster pruning: query processing

  • Process a query as follows:

    • Given query Q, find its nearest leader L.

    • Seek K nearest docs from among L’s followers.


Visualization

Visualization

Query

Leader

Follower


Why use random sampling

Why use random sampling

  • Fast

  • Leaders reflect data distribution


General variants

General variants

  • Have each follower attached to b1=3 (say) nearest leaders.

  • From query, find b2=4 (say) nearest leaders and their followers.

  • Can recur on leader/follower construction.


Putting it all together

Putting it all together


  • Login