Information Retrieval. CSE 8337 (Part B) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza -Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
CSE 8337 (Part B)
Some Material for these slides obtained from:
Modern Information Retrieval by Ricardo Baeza-Yates and BerthierRibeiro-Netohttp://www.sims.berkeley.edu/~hearst/irbook/
Data Mining Introductory and Advanced Topics by Margaret H. Dunham
Introduction to Information Retrieval by Christopher D. Manning, PrabhakarRaghavan, and HinrichSchutze
Lat. Semantic Index
The Boolean Model
1 if qcc| (qcc qdnf) (ki, gi(dj)= gi(qcc))
= [dj q] / |dj| * |q|
= [ wij * wiq] / |dj| * |q|
fi,j = (freqi,j)/(maxl freql,j)
The Vector Model: Example I
The Vector Model: Example II
The Vector Model: Example III
~ P(dj | R) P(dj | R)
[ P(ki | R)] * [ P(ki | R)]
~ log [ P(ki | R)] * [ P(kj | R)]
[ P(ki |R)] * [ P(ki | R)]
~ K * [ log P(ki | R) + log P(ki | R) ] P(ki | R) P(ki | R)
where P(ki | R) = 1 - P(ki | R) P(ki | R) = 1 - P(ki | R)
~ wiq * wij * (log P(ki | R) + log P(ki | R) )
P(ki | R) P(ki | R)
sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) )
qand = kx ky; wxj = x and wyj = y
y = wyj
x = wxj
sim(qor,dj) = sqrt( x + y )
qor = kx ky; wxj = x and wyj = y
y = wyj
x = wxj
k1 and k2 are to be used as in a vector retrieval while the presence of k3 is required.
The Euclidean distance between q
and d2 is large even though the
distribution of terms in the query qand the distribution of
terms in the document d2 are
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.
How similar are
SaS: Sense and
PaP: Pride and
Term frequencies (counts)
Log frequency weighting
0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SAS,WH)?
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
Is this a bad idea?
Document: car insurance auto insurance
Query: best car insurance
Exercise: what is N, the number of docs?
Doc length =
Score = 0+0+1.04+2.04 = 3.08