Information retrieval
Sponsored Links
This presentation is the property of its rightful owner.
1 / 84

Information Retrieval PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on
  • Presentation posted in: General

Information Retrieval. CSE 8337 (Part B) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza -Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/

Download Presentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Information Retrieval

CSE 8337 (Part B)

Spring 2009

Some Material for these slides obtained from:

Modern Information Retrieval by Ricardo Baeza-Yates and BerthierRibeiro-Netohttp://www.sims.berkeley.edu/~hearst/irbook/

Data Mining Introductory and Advanced Topics by Margaret H. Dunham

http://www.engr.smu.edu/~mhd/book

Introduction to Information Retrieval by Christopher D. Manning, PrabhakarRaghavan, and HinrichSchutze

http://informationretrieval.org


CSE 8337 Outline

  • Introduction

  • Simple Text Processing

  • Boolean Queries

  • Web Searching/Crawling

  • Indexes

  • Vector Space Model

  • Matching

  • Evaluation


CSE 8337 Outline

  • Introduction

  • Simple Text Processing

  • Boolean Queries

  • Web Searching/Crawling

  • Indexes

  • Vector Space Model

  • Matching

  • Evaluation


Modeling TOC(Vector Space and Other Models)

  • Introduction

  • Classic IR Models

    • Boolean Model

    • Vector Model

    • Probabilistic Model

  • Extended Boolean Model

  • Vector Space Scoring

  • Vector Model and Web Search


Algebraic

Set Theoretic

Generalized Vector

Lat. Semantic Index

Neural Networks

Structured Models

Fuzzy

Extended Boolean

Non-Overlapping Lists

Proximal Nodes

Classic Models

Probabilistic

boolean

vector

probabilistic

Inference Network

Belief Network

Browsing

Flat

Structure Guided

Hypertext

IR Models

U

s

e

r

T

a

s

k

Retrieval:

Adhoc

Filtering

Browsing


The Boolean Model

  • Simple model based on set theory

  • Queries specified as boolean expressions

    • precise semantics and neat formalism

  • Terms are either present or absent. Thus, wij  {0,1}

  • Consider

    • q = ka  (kb  kc)

    • qdnf = (1,1,1)  (1,1,0)  (1,0,0)

    • qcc= (1,1,0) is a conjunctive component


Ka

Kb

(1,1,0)

(1,0,0)

(1,1,1)

Kc

The Boolean Model

  • q = ka  (kb kc)

  • sim(q,dj) =

    1 if  qcc| (qcc  qdnf)  (ki, gi(dj)= gi(qcc))

    0 otherwise


Drawbacks of the Boolean Model

  • Retrieval based on binary decision criteria with no notion of partial matching

  • No ranking of the documents is provided

  • Information need has to be translated into a Boolean expression

  • The Boolean queries formulated by the users are most often too simplistic

  • As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query


The Vector Model

  • Use of binary weights is too limiting

  • Non-binary weights provide consideration for partial matches

  • These term weights are used to compute a degree of similarity between a query and each document

  • Ranked set of documents provides for better matching


The Vector Model

  • wij > 0 whenever ki appears in dj

  • wiq >= 0 associated with the pair (ki,q)

  • dj = (w1j, w2j, ..., wtj)

  • q = (w1q, w2q, ..., wtq)

  • To each term ki is associated a unitary vector i

  • The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)

  • The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors


The Vector Model

j

dj

q

i

  • Sim(q,dj) = cos()

    = [dj  q] / |dj| * |q|

    = [ wij * wiq] / |dj| * |q|

  • Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1

  • A document is retrieved even if it matches the query terms only partially


Weights wij and wiq ?

  • One approach is to examine the frequency of the occurence of a word in a document:

  • Absolute frequency:

    • tf factor, the term frequency within a document

    • freqi,j - raw frequency of ki within dj

    • Both high-frequency and low-frequency terms may not actually be significant

  • Relative frequency: tf divided by number of words in document

  • Normalized frequency:

    fi,j = (freqi,j)/(maxl freql,j)


Inverse Document Frequency

  • Importance of term may depend more on how it can distinguish between documents.

  • Quantification of inter-documents separation

  • Dissimilarity not similarity

  • idf factor, the inverse document frequency


IDF

  • N be the total number of docs in the collection

  • ni be the number of docs which contain ki

  • The idf factor is computed as

    • idfi = log (N/ni)

    • the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.

  • IDF Ex:

    • N=1000, n1=100, n2=500, n3=800

    • idf1= 3 - 2 = 1

    • idf2= 3 – 2.7 = 0.3

    • idf3 = 3 – 2.9 = 0.1


The Vector Model

  • The best term-weighting schemes take both into account.

  • wij = fi,j * log(N/ni)

  • This strategy is called a tf-idf weighting scheme


The Vector Model

  • For the query term weights, a suggestion is

    • wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni)

  • The vector model with tf-idf weights is a good ranking strategy with general collections

  • The vector model is usually as good as any known ranking alternatives.

  • It is also simple and fast to compute.


The Vector Model

  • Advantages:

    • term-weighting improves quality of the answer set

    • partial matching allows retrieval of docs that approximate the query conditions

    • cosine ranking formula sorts documents according to degree of similarity to the query

  • Disadvantages:

    • Assumes independence of index terms (??); not clear that this is bad though


k2

k1

d7

d6

d2

d4

d5

d3

d1

k3

The Vector Model: Example I


k2

k1

d7

d6

d2

d4

d5

d3

d1

k3

The Vector Model: Example II


k2

k1

d7

d6

d2

d4

d5

d3

d1

k3

The Vector Model: Example III


Probabilistic Model

  • Objective: to capture the IR problem using a probabilistic framework

  • Given a user query, there is an ideal answer set

  • Querying as specification of the properties of this ideal answer set (clustering)

  • But, what are these properties?

  • Guess at the beginning what they could be (i.e., guess initial description of ideal answer set)

  • Improve by iteration


Probabilistic Model

  • An initial set of documents is retrieved somehow

  • User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected)

  • IR system uses this information to refine description of ideal answer set

  • By repeating this process, it is expected that the description of the ideal answer set will improve

  • Have always in mind the need to guess at the very beginning the description of the ideal answer set

  • Description of ideal answer set is modeled in probabilistic terms


Probabilistic Ranking Principle

  • Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant.

  • But,

    • how to compute probabilities?

    • what is the sample space?


The Ranking

  • Probabilistic ranking computed as:

    • sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q)

    • This is the odds of the document dj being relevant

    • Taking the odds minimize the probability of an erroneous judgement

  • Definition:

    • wij {0,1}

    • P(R | dj) :probability that given doc is relevant

    • P(R | dj) : probability doc is not relevant


The Ranking

  • sim(dj,q) = P(R | dj) / P(R | dj)= [P(dj | R) * P(R)] [P(dj | R) * P(R)]

    ~ P(dj | R) P(dj | R)

  • P(dj | R) : probability of randomly selecting the document dj from the set R of relevant documents


The Ranking

  • sim(dj,q)~ P(dj | R) P(dj | R)~ [  P(ki | R)] * [  P(ki | R)]

    [  P(ki | R)] * [  P(ki | R)]

  • P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents


The Ranking

  • sim(dj,q)

    ~ log [  P(ki | R)] * [  P(kj | R)]

    [  P(ki |R)] * [  P(ki | R)]

    ~ K * [ log  P(ki | R) + log  P(ki | R) ] P(ki | R) P(ki | R)

    where P(ki | R) = 1 - P(ki | R) P(ki | R) = 1 - P(ki | R)


The Initial Ranking

  • sim(dj,q)

    ~  wiq * wij * (log P(ki | R) + log P(ki | R) )

    P(ki | R) P(ki | R)

  • Probabilities P(ki | R) and P(ki | R) ?

  • Estimates based on assumptions:

    • P(ki | R) = 0.5

    • P(ki | R) = ni N

    • Use this initial guess to retrieve an initial ranking

    • Improve upon this initial ranking


Improving the Initial Ranking

  • Let

    • V : set of docs initially retrieved

    • Vi : subset of docs retrieved that contain ki

  • Reevaluate estimates:

    • P(ki | R) = Vi V

    • P(ki | R) = ni - Vi N - V

  • Repeat recursively


Improving the Initial Ranking

  • To avoid problems with V=1 and Vi=0:

    • P(ki | R) = Vi + 0.5 V + 1

    • P(ki | R) = ni - Vi + 0.5 N - V + 1

  • Also,

    • P(ki | R) = Vi + ni/N V + 1

    • P(ki | R) = ni - Vi + ni/N N - V + 1


Pluses and Minuses

  • Advantages:

    • Docs ranked in decreasing order of probability of relevance

  • Disadvantages:

    • need to guess initial estimates for P(ki | R)

    • method does not take into account tf and idf factors


Brief Comparison of Classic Models

  • Boolean model does not provide for partial matches and is considered to be the weakest classic model

  • Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections

  • This seems also to be the view of the research community


Extended Boolean Model

  • Boolean model is simple and elegant.

  • But, no provision for a ranking

  • As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership

  • Extend the Boolean model with the notions of partial matching and term weighting

  • Combine characteristics of the Vector model with properties of Boolean algebra


The Idea

  • The Extended Boolean Model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra

  • Let,

    • q = kx ky

    • wxj = fxj * idfx associated with [kx,dj] max(idfi)

    • Further, wxj = x and wyj = y


2

2

sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) )

2

The Idea:

qand = kx ky; wxj = x and wyj = y

(1,1)

ky

AND

y = wyj

dj

(0,0)

x = wxj

kx


2

2

sim(qor,dj) = sqrt( x + y )

2

The Idea:

qor = kx ky; wxj = x and wyj = y

(1,1)

ky

OR

dj

y = wyj

(0,0)

x = wxj

kx


Generalizing the Idea

  • We can extend the previous model to consider Euclidean distances in a t-dimensional space

  • This can be done using p-norms which extend the notion of distance to include p-distances, where 1  p  is a new parameter


Generalizing the Idea

1

1

p

p

p

p

p

p

p

p

p

p

p

  • sim(qor,dj) = (x1 + x2 + . . . + xm ) m

p

p

p

  • sim(qand,dj)=1 - ((1-x1) + (1-x2) + . . . + (1-xm) ) m

  • A generalized disjunctive query is given by

    • qor = k1 k2 . . . kt

  • A generalized conjunctive query is given by

    • qand = k1 k2 . . . kt


Properties

  • If p = 1 then (Vector like)

    • sim(qor,dj) = sim(qand,dj) = x1 + . . . + xm m

  • If p =  then (Fuzzy like)

    • sim(qor,dj) = max (wxj)

    • sim(qand,dj) = min (wxj)

  • By varying p, we can make the model behave as a vector, as a fuzzy, or as an intermediary model


Properties

2

  • This is quite powerful and is a good argument in favor of the extended Boolean model

  • q = (k1 k2) k3

    k1 and k2 are to be used as in a vector retrieval while the presence of k3 is required.

  • sim(q,dj) = ( (1 - ( (1-x1) + (1-x2) ) ) + x3 ) 2 ______ 2


Conclusions

  • Model is quite powerful

  • Properties are interesting and might be useful

  • Computation is somewhat complex

  • However, distributivity operation does not hold for ranking computation:

    • q1 = (k1  k2)  k3

    • q2 = (k1  k3)  (k2  k3)

    • sim(q1,dj)  sim(q2,dj)


Vector Space Scoring

  • First cut: distance between two points

    • ( = distance between the end points of the two vectors)

  • Euclidean distance?

  • Euclidean distance is a bad idea . . .

  • . . . because Euclidean distance is large for vectors of different lengths.


Why distance is a bad idea

The Euclidean distance between q

and d2 is large even though the

distribution of terms in the query qand the distribution of

terms in the document d2 are

very similar.


Use angle instead of distance

  • Thought experiment: take a document d and append it to itself. Call this document d′.

  • “Semantically” d and d′ have the same content

  • The Euclidean distance between the two documents can be quite large

  • The angle between the two documents is 0, corresponding to maximal similarity.

  • Key idea: Rank documents according to angle with query.


From angles to cosines

  • The following two notions are equivalent.

    • Rank documents in decreasing order of the angle between query and document

    • Rank documents in increasing order of cosine(query,document)

  • Cosine is a monotonically decreasing function for the interval [0o, 180o]


Length normalization

  • A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm:

  • Dividing a vector by its L2 norm makes it a unit (length) vector

  • Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.


cosine(query,document)

Dot product

Unit vectors

qi is the tf-idf weight of term i in the query

di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.


Cosine similarity amongst 3 documents

How similar are

the novels

SaS: Sense and

Sensibility

PaP: Pride and

Prejudice, and

WH: Wuthering

Heights?

Term frequencies (counts)


3 documents example contd.

Log frequency weighting

After normalization

cos(SaS,PaP) ≈

0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0

≈ 0.94

cos(SaS,WH) ≈ 0.79

cos(PaP,WH) ≈ 0.69

Why do we have cos(SaS,PaP) > cos(SAS,WH)?


tf-idf weighting has many variants

Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?


Weighting may differ in Queries vs Documents

  • Many search engines allow for different weightings for queries vs documents

  • To denote the combination in use in an engine, we use the notation qqq.ddd with the acronyms from the previous table

  • Example: ltn.ltc means:

  • Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization …

  • Document logarithmic tf, no idf and cosine normalization

Is this a bad idea?


tf-idf example: ltn.lnc

Document: car insurance auto insurance

Query: best car insurance

Exercise: what is N, the number of docs?

Doc length =

Score = 0+0+1.04+2.04 = 3.08


Summary – vector space ranking

  • Represent the query as a weighted tf-idf vector

  • Represent each document as a weighted tf-idf vector

  • Compute the cosine similarity score for the query vector and each document vector

  • Rank documents with respect to the query by score

  • Return the top K (e.g., K = 10) to the user


Vector Model and Web Search

  • Speeding up vector space ranking

  • Putting together a complete search system

    • Will require learning about a number of miscellaneous topics and heuristics


Efficient cosine ranking

  • Find the K docs in the collection “nearest” to the query  K largest query-doc cosines.

  • Efficient ranking:

    • Computing a single cosine efficiently.

    • Choosing the K largest cosine values efficiently.

      • Can we do this without computing all N cosines?


Efficient cosine ranking

  • What we’re doing in effect: solving the K-nearest neighbor problem for a query vector

  • In general, we do not know how to do this efficiently for high-dimensional spaces

  • But it is solvable for short queries, and standard indexes support this well


Special case – unweighted queries

  • No weighting on query terms

    • Assume each query term occurs only once

  • Then for ranking, don’t need to normalize query vector


Faster cosine: unweighted query


Computing the K largest cosines: selection vs. sorting

  • Typically we want to retrieve the top K docs (in the cosine ranking for the query)

    • not to totally order all docs in the collection

  • Can we pick off docs with K highest cosines?

  • Let J = number of docs with nonzero cosines

    • We seek the K best of these J


Use heap for selecting top K

  • Binary tree in which each node’s value > the values of children

  • Takes 2J operations to construct, then each of K “winners” read off in 2log J steps.

  • For J=1M, K=100, this is about 10% of the cost of sorting.

1

.9

.3

.3

.8

.1

.1


Bottlenecks

  • Primary computational bottleneck in scoring: cosine computation

  • Can we avoid all this computation?

  • Yes, but may sometimes get it wrong

    • a doc not in the top K may creep into the list of K output docs

    • Is this such a bad thing?


Cosine similarity is only a proxy

  • User has a task and a query formulation

  • Cosine matches docs to query

  • Thus cosine is anyway a proxy for user happiness

  • If we get a list of K docs “close” to the top K by cosine measure, should be ok


Generic approach

  • Find a set A of contenders, with K < |A| << N

    • A does not necessarily contain the top K, but has many docs from among the top K

    • Return the top K docs in A

  • Think of A as pruning non-contenders

  • The same approach is also used for other (non-cosine) scoring functions

  • Will look at several schemes following this approach


Index elimination

  • Basic algorithm of Fig 7.1 only considers docs containing at least one query term

  • Take this further:

    • Only consider high-idf query terms

    • Only consider docs containing many query terms


High-idf query terms only

  • For a query such as catcher in the rye

  • Only accumulate scores from catcher and rye

  • Intuition: in and the contribute little to the scores and don’t alter rank-ordering much

  • Benefit:

    • Postings of low-idf terms have many docs  these (many) docs get eliminated from A


Docs containing many query terms

  • Any doc with at least one query term is a candidate for the top K output list

  • For multi-term queries, only compute scores for docs containing several of the query terms

    • Say, at least 3 out of 4

    • Imposes a “soft conjunction” on queries seen on web search engines (early Google)

  • Easy to implement in postings traversal


3

2

4

4

8

8

16

16

32

32

64

64

128

128

1

2

3

5

8

13

21

34

3 of 4 query terms

Antony

Brutus

Caesar

Calpurnia

13

16

32

Scores only computed for 8, 16 and 32.


Champion lists

  • Precompute for each dictionary term t, the r docs of highest weight in t’s postings

    • Call this the champion list for t

    • (aka fancy list or top docs for t)

  • Note that r has to be chosen at index time

  • At query time, only compute scores for docs in the champion list of some query term

    • Pick the K top-scoring docs from amongst these


Static quality scores

  • We want top-ranking documents to be both relevant and authoritative

  • Relevance is being modeled by cosine scores

  • Authority is typically a query-independent property of a document

  • Examples of authority signals

    • Wikipedia among websites

    • Articles in certain newspapers

    • A paper with many citations

    • Many diggs, Y!buzzes or del.icio.us marks

    • (Pagerank)

Quantitative


Modeling authority

  • Assign to each document a query-independentquality score in [0,1] to each document d

    • Denote this by g(d)

  • Thus, a quantity like the number of citations is scaled into [0,1]

    • Exercise: suggest a formula for this.


Net score

  • Consider a simple total score combining cosine relevance and authority

  • net-score(q,d) = g(d) + cosine(q,d)

    • Can use some other linear combination than an equal weighting

    • Indeed, any function of the two “signals” of user happiness – more later

  • Now we seek the top K docs by net score


Top K by net score – fast methods

  • First idea: Order all postings by g(d)

  • Key: this is a common ordering for all postings

  • Thus, can concurrently traverse query terms’ postings for

    • Postings intersection

    • Cosine score computation

  • Exercise: write pseudocode for cosine score computation if postings are ordered by g(d)


Why order postings by g(d)?

  • Under g(d)-ordering, top-scoring docs likely to appear early in postings traversal

  • In time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal early

    • Short of computing scores for all docs in postings


Champion lists in g(d)-ordering

  • Can combine champion lists with g(d)-ordering

  • Maintain for each term a champion list of the r docs with highest g(d) + tf-idftd

  • Seek top-K results from only the docs in these champion lists


High and low lists

  • For each term, we maintain two postings lists called high and low

    • Think of high as the champion list

  • When traversing postings on a query, only traverse high lists first

    • If we get more than K docs, select the top K and stop

    • Else proceed to get docs from the low lists

  • Can be used even for simple cosine scores, without global quality g(d)

  • A means for segmenting index into two tiers


Impact-ordered postings

  • We only want to compute scores for docs for which wft,d is high enough

  • We sort each postings list by wft,d

  • Now: not all postings in a common order!

  • How do we compute scores in order to pick off top K?

    • Two ideas follow


1. Early termination

  • When traversing t’s postings, stop early after either

    • a fixed number of rdocs

    • wft,d drops below some threshold

  • Take the union of the resulting sets of docs

    • One from the postings of each query term

  • Compute only the scores for docs in this union


2. idf-ordered terms

  • When considering the postings of query terms

  • Look at them in order of decreasing idf

    • High idf terms likely to contribute most to score

  • As we update score contribution from each query term

    • Stop if doc scores relatively unchanged

  • Can apply to cosine or some other net scores


Cluster pruning: preprocessing

  • Pick N docs at random: call these leaders

  • For every other doc, pre-compute nearest leader

    • Docs attached to a leader: its followers;

    • Likely: each leader has ~ N followers.


Cluster pruning: query processing

  • Process a query as follows:

    • Given query Q, find its nearest leader L.

    • Seek K nearest docs from among L’s followers.


Visualization

Query

Leader

Follower


Why use random sampling

  • Fast

  • Leaders reflect data distribution


General variants

  • Have each follower attached to b1=3 (say) nearest leaders.

  • From query, find b2=4 (say) nearest leaders and their followers.

  • Can recur on leader/follower construction.


Putting it all together


  • Login