- 66 Views
- Uploaded on
- Presentation posted in: General

Information Retrieval

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Information Retrieval

CSE 8337 (Part B)

Spring 2009

Some Material for these slides obtained from:

Modern Information Retrieval by Ricardo Baeza-Yates and BerthierRibeiro-Netohttp://www.sims.berkeley.edu/~hearst/irbook/

Data Mining Introductory and Advanced Topics by Margaret H. Dunham

http://www.engr.smu.edu/~mhd/book

Introduction to Information Retrieval by Christopher D. Manning, PrabhakarRaghavan, and HinrichSchutze

http://informationretrieval.org

- Introduction
- Simple Text Processing
- Boolean Queries
- Web Searching/Crawling
- Indexes
- Vector Space Model
- Matching
- Evaluation

- Introduction
- Simple Text Processing
- Boolean Queries
- Web Searching/Crawling
- Indexes
- Vector Space Model
- Matching
- Evaluation

- Introduction
- Classic IR Models
- Boolean Model
- Vector Model
- Probabilistic Model

- Extended Boolean Model
- Vector Space Scoring
- Vector Model and Web Search

Algebraic

Set Theoretic

Generalized Vector

Lat. Semantic Index

Neural Networks

Structured Models

Fuzzy

Extended Boolean

Non-Overlapping Lists

Proximal Nodes

Classic Models

Probabilistic

boolean

vector

probabilistic

Inference Network

Belief Network

Browsing

Flat

Structure Guided

Hypertext

IR Models

U

s

e

r

T

a

s

k

Retrieval:

Adhoc

Filtering

Browsing

The Boolean Model

- Simple model based on set theory
- Queries specified as boolean expressions
- precise semantics and neat formalism

- Terms are either present or absent. Thus, wij {0,1}
- Consider
- q = ka (kb kc)
- qdnf = (1,1,1) (1,1,0) (1,0,0)
- qcc= (1,1,0) is a conjunctive component

Ka

Kb

(1,1,0)

(1,0,0)

(1,1,1)

Kc

The Boolean Model

- q = ka (kb kc)
- sim(q,dj) =
1 if qcc| (qcc qdnf) (ki, gi(dj)= gi(qcc))

0 otherwise

Drawbacks of the Boolean Model

- Retrieval based on binary decision criteria with no notion of partial matching
- No ranking of the documents is provided
- Information need has to be translated into a Boolean expression
- The Boolean queries formulated by the users are most often too simplistic
- As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

The Vector Model

- Use of binary weights is too limiting
- Non-binary weights provide consideration for partial matches
- These term weights are used to compute a degree of similarity between a query and each document
- Ranked set of documents provides for better matching

The Vector Model

- wij > 0 whenever ki appears in dj
- wiq >= 0 associated with the pair (ki,q)
- dj = (w1j, w2j, ..., wtj)
- q = (w1q, w2q, ..., wtq)
- To each term ki is associated a unitary vector i
- The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)
- The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors

The Vector Model

j

dj

q

i

- Sim(q,dj) = cos()
= [dj q] / |dj| * |q|

= [ wij * wiq] / |dj| * |q|

- Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1
- A document is retrieved even if it matches the query terms only partially

Weights wij and wiq ?

- One approach is to examine the frequency of the occurence of a word in a document:
- Absolute frequency:
- tf factor, the term frequency within a document
- freqi,j - raw frequency of ki within dj
- Both high-frequency and low-frequency terms may not actually be significant

- Relative frequency: tf divided by number of words in document
- Normalized frequency:
fi,j = (freqi,j)/(maxl freql,j)

- Importance of term may depend more on how it can distinguish between documents.
- Quantification of inter-documents separation
- Dissimilarity not similarity
- idf factor, the inverse document frequency

IDF

- N be the total number of docs in the collection
- ni be the number of docs which contain ki
- The idf factor is computed as
- idfi = log (N/ni)
- the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.

- IDF Ex:
- N=1000, n1=100, n2=500, n3=800
- idf1= 3 - 2 = 1
- idf2= 3 – 2.7 = 0.3
- idf3 = 3 – 2.9 = 0.1

The Vector Model

- The best term-weighting schemes take both into account.
- wij = fi,j * log(N/ni)
- This strategy is called a tf-idf weighting scheme

The Vector Model

- For the query term weights, a suggestion is
- wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni)

- The vector model with tf-idf weights is a good ranking strategy with general collections
- The vector model is usually as good as any known ranking alternatives.
- It is also simple and fast to compute.

The Vector Model

- Advantages:
- term-weighting improves quality of the answer set
- partial matching allows retrieval of docs that approximate the query conditions
- cosine ranking formula sorts documents according to degree of similarity to the query

- Disadvantages:
- Assumes independence of index terms (??); not clear that this is bad though

k2

k1

d7

d6

d2

d4

d5

d3

d1

k3

The Vector Model: Example I

k2

k1

d7

d6

d2

d4

d5

d3

d1

k3

The Vector Model: Example II

k2

k1

d7

d6

d2

d4

d5

d3

d1

k3

The Vector Model: Example III

- Objective: to capture the IR problem using a probabilistic framework
- Given a user query, there is an ideal answer set
- Querying as specification of the properties of this ideal answer set (clustering)
- But, what are these properties?
- Guess at the beginning what they could be (i.e., guess initial description of ideal answer set)
- Improve by iteration

Probabilistic Model

- An initial set of documents is retrieved somehow
- User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected)
- IR system uses this information to refine description of ideal answer set
- By repeating this process, it is expected that the description of the ideal answer set will improve
- Have always in mind the need to guess at the very beginning the description of the ideal answer set
- Description of ideal answer set is modeled in probabilistic terms

Probabilistic Ranking Principle

- Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant.
- But,
- how to compute probabilities?
- what is the sample space?

The Ranking

- Probabilistic ranking computed as:
- sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q)
- This is the odds of the document dj being relevant
- Taking the odds minimize the probability of an erroneous judgement

- Definition:
- wij {0,1}
- P(R | dj) :probability that given doc is relevant
- P(R | dj) : probability doc is not relevant

The Ranking

- sim(dj,q) = P(R | dj) / P(R | dj)= [P(dj | R) * P(R)] [P(dj | R) * P(R)]
~ P(dj | R) P(dj | R)

- P(dj | R) : probability of randomly selecting the document dj from the set R of relevant documents

The Ranking

- sim(dj,q)~ P(dj | R) P(dj | R)~ [ P(ki | R)] * [ P(ki | R)]
[ P(ki | R)] * [ P(ki | R)]

- P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents

The Ranking

- sim(dj,q)
~ log [ P(ki | R)] * [ P(kj | R)]

[ P(ki |R)] * [ P(ki | R)]

~ K * [ log P(ki | R) + log P(ki | R) ] P(ki | R) P(ki | R)

where P(ki | R) = 1 - P(ki | R) P(ki | R) = 1 - P(ki | R)

The Initial Ranking

- sim(dj,q)
~ wiq * wij * (log P(ki | R) + log P(ki | R) )

P(ki | R) P(ki | R)

- Probabilities P(ki | R) and P(ki | R) ?
- Estimates based on assumptions:
- P(ki | R) = 0.5
- P(ki | R) = ni N
- Use this initial guess to retrieve an initial ranking
- Improve upon this initial ranking

Improving the Initial Ranking

- Let
- V : set of docs initially retrieved
- Vi : subset of docs retrieved that contain ki

- Reevaluate estimates:
- P(ki | R) = Vi V
- P(ki | R) = ni - Vi N - V

- Repeat recursively

Improving the Initial Ranking

- To avoid problems with V=1 and Vi=0:
- P(ki | R) = Vi + 0.5 V + 1
- P(ki | R) = ni - Vi + 0.5 N - V + 1

- Also,
- P(ki | R) = Vi + ni/N V + 1
- P(ki | R) = ni - Vi + ni/N N - V + 1

Pluses and Minuses

- Advantages:
- Docs ranked in decreasing order of probability of relevance

- Disadvantages:
- need to guess initial estimates for P(ki | R)
- method does not take into account tf and idf factors

Brief Comparison of Classic Models

- Boolean model does not provide for partial matches and is considered to be the weakest classic model
- Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections
- This seems also to be the view of the research community

- Boolean model is simple and elegant.
- But, no provision for a ranking
- As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership
- Extend the Boolean model with the notions of partial matching and term weighting
- Combine characteristics of the Vector model with properties of Boolean algebra

The Idea

- The Extended Boolean Model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra
- Let,
- q = kx ky
- wxj = fxj * idfx associated with [kx,dj] max(idfi)
- Further, wxj = x and wyj = y

2

2

sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) )

2

The Idea:

qand = kx ky; wxj = x and wyj = y

(1,1)

ky

AND

y = wyj

dj

(0,0)

x = wxj

kx

2

2

sim(qor,dj) = sqrt( x + y )

2

The Idea:

qor = kx ky; wxj = x and wyj = y

(1,1)

ky

OR

dj

y = wyj

(0,0)

x = wxj

kx

Generalizing the Idea

- We can extend the previous model to consider Euclidean distances in a t-dimensional space
- This can be done using p-norms which extend the notion of distance to include p-distances, where 1 p is a new parameter

Generalizing the Idea

1

1

p

p

p

p

p

p

p

p

p

p

p

- sim(qor,dj) = (x1 + x2 + . . . + xm ) m

p

p

p

- sim(qand,dj)=1 - ((1-x1) + (1-x2) + . . . + (1-xm) ) m

- A generalized disjunctive query is given by
- qor = k1 k2 . . . kt

- A generalized conjunctive query is given by
- qand = k1 k2 . . . kt

Properties

- If p = 1 then (Vector like)
- sim(qor,dj) = sim(qand,dj) = x1 + . . . + xm m

- If p = then (Fuzzy like)
- sim(qor,dj) = max (wxj)
- sim(qand,dj) = min (wxj)

- By varying p, we can make the model behave as a vector, as a fuzzy, or as an intermediary model

Properties

2

- This is quite powerful and is a good argument in favor of the extended Boolean model
- q = (k1 k2) k3
k1 and k2 are to be used as in a vector retrieval while the presence of k3 is required.

- sim(q,dj) = ( (1 - ( (1-x1) + (1-x2) ) ) + x3 ) 2 ______ 2

- Model is quite powerful
- Properties are interesting and might be useful
- Computation is somewhat complex
- However, distributivity operation does not hold for ranking computation:
- q1 = (k1 k2) k3
- q2 = (k1 k3) (k2 k3)
- sim(q1,dj) sim(q2,dj)

- First cut: distance between two points
- ( = distance between the end points of the two vectors)

- Euclidean distance?
- Euclidean distance is a bad idea . . .
- . . . because Euclidean distance is large for vectors of different lengths.

The Euclidean distance between q

and d2 is large even though the

distribution of terms in the query qand the distribution of

terms in the document d2 are

very similar.

- Thought experiment: take a document d and append it to itself. Call this document d′.
- “Semantically” d and d′ have the same content
- The Euclidean distance between the two documents can be quite large
- The angle between the two documents is 0, corresponding to maximal similarity.
- Key idea: Rank documents according to angle with query.

- The following two notions are equivalent.
- Rank documents in decreasing order of the angle between query and document
- Rank documents in increasing order of cosine(query,document)

- Cosine is a monotonically decreasing function for the interval [0o, 180o]

- A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm:
- Dividing a vector by its L2 norm makes it a unit (length) vector
- Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.

Dot product

Unit vectors

qi is the tf-idf weight of term i in the query

di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.

How similar are

the novels

SaS: Sense and

Sensibility

PaP: Pride and

Prejudice, and

WH: Wuthering

Heights?

Term frequencies (counts)

Log frequency weighting

After normalization

cos(SaS,PaP) ≈

0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0

≈ 0.94

cos(SaS,WH) ≈ 0.79

cos(PaP,WH) ≈ 0.69

Why do we have cos(SaS,PaP) > cos(SAS,WH)?

Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?

- Many search engines allow for different weightings for queries vs documents
- To denote the combination in use in an engine, we use the notation qqq.ddd with the acronyms from the previous table
- Example: ltn.ltc means:
- Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization …
- Document logarithmic tf, no idf and cosine normalization

Is this a bad idea?

Document: car insurance auto insurance

Query: best car insurance

Exercise: what is N, the number of docs?

Doc length =

Score = 0+0+1.04+2.04 = 3.08

- Represent the query as a weighted tf-idf vector
- Represent each document as a weighted tf-idf vector
- Compute the cosine similarity score for the query vector and each document vector
- Rank documents with respect to the query by score
- Return the top K (e.g., K = 10) to the user

- Speeding up vector space ranking
- Putting together a complete search system
- Will require learning about a number of miscellaneous topics and heuristics

- Find the K docs in the collection “nearest” to the query K largest query-doc cosines.
- Efficient ranking:
- Computing a single cosine efficiently.
- Choosing the K largest cosine values efficiently.
- Can we do this without computing all N cosines?

- What we’re doing in effect: solving the K-nearest neighbor problem for a query vector
- In general, we do not know how to do this efficiently for high-dimensional spaces
- But it is solvable for short queries, and standard indexes support this well

- No weighting on query terms
- Assume each query term occurs only once

- Then for ranking, don’t need to normalize query vector

- Typically we want to retrieve the top K docs (in the cosine ranking for the query)
- not to totally order all docs in the collection

- Can we pick off docs with K highest cosines?
- Let J = number of docs with nonzero cosines
- We seek the K best of these J

- Binary tree in which each node’s value > the values of children
- Takes 2J operations to construct, then each of K “winners” read off in 2log J steps.
- For J=1M, K=100, this is about 10% of the cost of sorting.

1

.9

.3

.3

.8

.1

.1

- Primary computational bottleneck in scoring: cosine computation
- Can we avoid all this computation?
- Yes, but may sometimes get it wrong
- a doc not in the top K may creep into the list of K output docs
- Is this such a bad thing?

- User has a task and a query formulation
- Cosine matches docs to query
- Thus cosine is anyway a proxy for user happiness
- If we get a list of K docs “close” to the top K by cosine measure, should be ok

- Find a set A of contenders, with K < |A| << N
- A does not necessarily contain the top K, but has many docs from among the top K
- Return the top K docs in A

- Think of A as pruning non-contenders
- The same approach is also used for other (non-cosine) scoring functions
- Will look at several schemes following this approach

- Basic algorithm of Fig 7.1 only considers docs containing at least one query term
- Take this further:
- Only consider high-idf query terms
- Only consider docs containing many query terms

- For a query such as catcher in the rye
- Only accumulate scores from catcher and rye
- Intuition: in and the contribute little to the scores and don’t alter rank-ordering much
- Benefit:
- Postings of low-idf terms have many docs these (many) docs get eliminated from A

- Any doc with at least one query term is a candidate for the top K output list
- For multi-term queries, only compute scores for docs containing several of the query terms
- Say, at least 3 out of 4
- Imposes a “soft conjunction” on queries seen on web search engines (early Google)

- Easy to implement in postings traversal

3

2

4

4

8

8

16

16

32

32

64

64

128

128

1

2

3

5

8

13

21

34

Antony

Brutus

Caesar

Calpurnia

13

16

32

Scores only computed for 8, 16 and 32.

- Precompute for each dictionary term t, the r docs of highest weight in t’s postings
- Call this the champion list for t
- (aka fancy list or top docs for t)

- Note that r has to be chosen at index time
- At query time, only compute scores for docs in the champion list of some query term
- Pick the K top-scoring docs from amongst these

- We want top-ranking documents to be both relevant and authoritative
- Relevance is being modeled by cosine scores
- Authority is typically a query-independent property of a document
- Examples of authority signals
- Wikipedia among websites
- Articles in certain newspapers
- A paper with many citations
- Many diggs, Y!buzzes or del.icio.us marks
- (Pagerank)

Quantitative

- Assign to each document a query-independentquality score in [0,1] to each document d
- Denote this by g(d)

- Thus, a quantity like the number of citations is scaled into [0,1]
- Exercise: suggest a formula for this.

- Consider a simple total score combining cosine relevance and authority
- net-score(q,d) = g(d) + cosine(q,d)
- Can use some other linear combination than an equal weighting
- Indeed, any function of the two “signals” of user happiness – more later

- Now we seek the top K docs by net score

- First idea: Order all postings by g(d)
- Key: this is a common ordering for all postings
- Thus, can concurrently traverse query terms’ postings for
- Postings intersection
- Cosine score computation

- Exercise: write pseudocode for cosine score computation if postings are ordered by g(d)

- Under g(d)-ordering, top-scoring docs likely to appear early in postings traversal
- In time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal early
- Short of computing scores for all docs in postings

- Can combine champion lists with g(d)-ordering
- Maintain for each term a champion list of the r docs with highest g(d) + tf-idftd
- Seek top-K results from only the docs in these champion lists

- For each term, we maintain two postings lists called high and low
- Think of high as the champion list

- When traversing postings on a query, only traverse high lists first
- If we get more than K docs, select the top K and stop
- Else proceed to get docs from the low lists

- Can be used even for simple cosine scores, without global quality g(d)
- A means for segmenting index into two tiers

- We only want to compute scores for docs for which wft,d is high enough
- We sort each postings list by wft,d
- Now: not all postings in a common order!
- How do we compute scores in order to pick off top K?
- Two ideas follow

- When traversing t’s postings, stop early after either
- a fixed number of rdocs
- wft,d drops below some threshold

- Take the union of the resulting sets of docs
- One from the postings of each query term

- Compute only the scores for docs in this union

- When considering the postings of query terms
- Look at them in order of decreasing idf
- High idf terms likely to contribute most to score

- As we update score contribution from each query term
- Stop if doc scores relatively unchanged

- Can apply to cosine or some other net scores

- Pick N docs at random: call these leaders
- For every other doc, pre-compute nearest leader
- Docs attached to a leader: its followers;
- Likely: each leader has ~ N followers.

- Process a query as follows:
- Given query Q, find its nearest leader L.
- Seek K nearest docs from among L’s followers.

Query

Leader

Follower

- Fast
- Leaders reflect data distribution

- Have each follower attached to b1=3 (say) nearest leaders.
- From query, find b2=4 (say) nearest leaders and their followers.
- Can recur on leader/follower construction.