CES 514 Data Mining March 11, 2010 Lecture 5: scoring, term weighting, vector space model (Ch 6). Model for ranking documents. Ranked retrieval Zones, parameters in querying Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
CES 514 Data Mining March 11, 2010
Lecture 5: scoring, term weighting, vector space model (Ch 6)
UI that supports parametric search
Zone index - choices
Given a Boolean query q and a document d, weighted zone scoring assigns to the pair (q, d) a score in the interval [0, 1], by computing a linear combination of zone scores, where each zone of the document contributes a Boolean value.
Consider a set of documents each of which has ℓ
zones. Let g1, . . . , gℓ ∈ [0, 1] such that
g1 + … + gl = 1. For 1 ≤ i ≤ ℓ, let si be the
Boolean score denoting a match between q and the ith zone.
Weighted score =
computation of weighted zone scores
Learning weights
Ch. 6
Ch. 6
Ch. 6
Ch. 6
Ch. 6
Ch. 6
Ch. 6
Ch. 6
Sec. 6.2
Each document is represented by a binary vector ∈ {0,1}|V|
Sec. 6.2
Sec. 6.2
Sec. 6.2.1
Sec. 6.2.1
Sec. 6.2.1
Will turn out the base of the log is immaterial.
Sec. 6.2.1
Sec. 6.2.1
Sec. 6.2.2
Sec. 6.2.2
Sec. 6.3
Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|
Sec. 6.3
Sec. 6.3
Sec. 6.3
Sec. 6.3
The Euclidean distance between q
and d2 is large even though the
distribution of terms in the query qand the distribution of
terms in the document d2 are
very similar.
Sec. 6.3
Sec. 6.3
Sec. 6.3
Sec. 6.3
Sec. 6.3
Dot product
Unit vectors
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.
for q, d length-normalized.
Sec. 6.3
How similar are
the novels
SaS: Sense and
Sensibility
PaP: Pride and
Prejudice, and
WH: Wuthering
Heights?
Term frequencies (counts)
Note: To simplify this example, we don’t do idf weighting.
Sec. 6.3
Log frequency weighting
After length normalization
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SAS,WH)?
Sec. 6.3
Sec. 6.4
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
Sec. 6.4
Sec. 6.4
Document: car insurance auto insurance
Query: best car insurance
Exercise: what is N, the number of docs?
Doc length =
Score = 0+0+0.27+0.53 = 0.8