Information Retrieval

Information Retrieval For the MSc Computer Science Programme Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 Dell Zhang Birkbeck, University of London

Boolean Search • Strength • Docs either match or not. • Good for expert users with precise understanding of their needs and the corpus. • Weakness • Not good for (the majority of) users with poor Boolean formulation of their needs. • Applications may consume 1000’s of results, but most users don’t want to wade through 1000’s of results – cf. use of Web search engines.

Beyond Boolean Search • Solution: Ranking • We wish to return in order the documents most likely to be useful to the searcher. • How can we rank/order the docs in the corpus with respect to a query? • Assign a score – say in [0,1]– for each doc on each query.

Document Scoring • Idea: More is Better • If a document talks about a topic more, then it is a better match. • That is to say, a document is more relevant if it has morerelevant terms. • This leads to the problem of term weighting.

Bag-Of-Words (BOW) Model • Term-Document Count Matrix • Each document corresponds to a vector in ℕv, i.e., a column below. The matrix element A(i,j) is the number of occurrences of the i-th term in the j-th doc.

Bag-Of-Words (BOW) Model • Simplification • In the BOW model, • the doc • John is quicker than Mary. • is indistinguishable from the doc • Mary is quicker than John.

Term Frequency (TF) • Digression: Terminology • WARNING: In a lot of IR literature, “frequency” is used to mean “count”. • Thus term frequency in IR literature is used to mean the number of occurrencesof a term in a document not divided by document length (which would actually make it a frequency). • We will conform to this misnomer: in saying term frequency we mean the number of occurrences of a term in a document.

Term Frequency (TF) • What is the relative importance of • 0 vs. 1 occurrence of a term in a doc, • 1 vs. 2 occurrences, • 2 vs. 3 occurrences, ……? • Can just use raw tf . • While it seems that more is better, a lot isn’t proportionally better than a few. • So another option commonly used in practice:

Term Frequency (TF) • The score of a document d for a query q • 0 if no query terms in document • wfcan be used instead of tf in the above

Term Frequency (TF) • Is TF good enough for weighting? • Ignorance of document length • Long docs are favored because they’re more likely to contain query terms. • This can be fixed to some extent by normalizing for document length. [talk later]

Term Frequency (TF) • Is TF good enough for weighting? • Ignorance of term rarity in corpus • Consider the query ides of march. • Julius Caesar has 5 occurrences of ides , while no other play has ides . • march occurs in over a dozen. • All the plays contain of . • By this weighting scheme, the top-scoring play is likely to be the one with the most ofs.

Document/Collection Frequency • Which of these tells you more about a doc? • 5 occurrences of of? • 5 occurrences of march? • 5 occurrences of ides? • We’d like to attenuate the weight of a common term. But what is “common”? • Collection Frequency (CF) • the number of occurrences of the term in the corpus • Document Frequency (DF) • the number of docs in the corpus containing the term

Document/Collection Frequency • DF may be better than CF WordCFDF try10422 8760 insurance10440 3997 So how do we make use of DF?

Inverse Document Frequency (IDF) • Could just be the reciprocal of DF (idfi = 1/dfi). • But by far the most commonly used version is:

Inverse Document Frequency (IDF) • Prof Karen Spark Jones 1935-2007

TFxIDF • TFxIDFweighting scheme combines: • Term Frequency (TF) • measure of term density in a doc • Inverse Document Frequency (IDF) • measure of informativeness of a term: its rarity across the whole corpus

TFxIDF • Each term i in each document d is assigned a TFxIDF weight • Increases with the number of occurrences within a doc. • Increases with the rarity of the term across the whole corpus. What is the weight of a term thatoccurs in allof the docs?

Term-Document Matrix (Real-Valued) The matrix element A(i,j) is the log-scaled TFxIDF weight. Note: can be > 1.

Vector Space Model • Docs Vectors • Each doc j can now be viewed as a vector of TFxIDF values, one component for each term. • So we have a vector space • Terms are axes • Docs live in this space • May have 20,000+ dimensions • even with stemming

Vector Space Model • Prof Gerard Salton 1927-1995 The SMART information retrieval system

t3 d2 d3 d1 θ φ t1 d5 t2 d4 Vector Space Model • First application: Query-By-Example (QBE) • Given a doc d, find others “like” it. • Now that d is a vector, find vectors (docs) “near” it. Postulate: Documents that are “close together” in the vector space talk about the same things.

Vector Space Model • QueriesVectors • Regard a query as a (very short) document. • Return the docs ranked by the closeness of their vectors to the query, also represented as a vector.

Desiderata for Proximity • If d1 is near d2, then d2 is near d1. • If d1 near d2, and d2 near d3, then d1 is not far from d3. • No doc is closer to d than d itself.

Euclidean Distance • Distance between dj and dk is • Why is this not a great idea? • We still haven’t dealt with the issue of length normalization: long documents would be more similar to each other by virtue of length, not topic. • However, we can implicitly normalize by looking at angles instead.

Cosine Similarity • Vector Normalization • A vector can be normalized (given a length of 1) by dividing each of its components by its length – here we use the L2 norm • This maps vectors onto the unit sphere. • Then longer documents don’t get more weight.

Cosine Similarity • Cosine of angle between two vectors The denominator involves the lengths of the vectors. This means normalization.

t 3 d 2 d 1 θ t 1 t 2 Cosine Similarity • The similarity between dj and dk is captured by the cosine of the angle between their vectors. No triangle inequality for similarity.

Cosine Similarity - Exercise • Rank the following by decreasing cosine similarity: • Two docs that have only frequent words (the, a, an, of) in common. • Two docs that have no words in common. • Two docs that have many rare words in common (wingspan, tailfin).

Cosine Similarity - Exercise • Show that, for normalized vectors, Euclidean distance measure gives the same proximity ordering as the cosine similarity measure.

Cosine Similarity - Example • Docs • Austen's Sense and Sensibility (SaS) • Austen's Pride and Prejudice (PaP) • Bronte's Wuthering Heights (WH) cos(SaS, PaP) = 0.996 x 0.993 + 0.087 x 0.120 + 0.017 x 0.000 = 0.999 cos(SaS, WH) = 0.996 x 0.847 + 0.087 x 0.466 + 0.017 x 0.254 = 0.889

Vector Space Model - Summary • What’s the real point of using vector space? • Every query can be viewed as a (very short) doc. • Every query becomes a vector in the same space as the docs. • Can measure each doc’s proximity to the query. • It provides a natural measure of scores/ranking – no longer Boolean. • Docs (and queries) are expressed as bags of words.

Vector Space Model - Exercise • How would you augment the inverted index built in previous lectures to support cosine ranking computations? • Walk through the steps of serving a query using the Vector Space Model.

Efficient Cosine Ranking • Computing a single cosine • For every term t, with each doc d, Add tft,d to postings lists. • Some tradeoffs on whether to store term count, term weight, or weighted by IDF. • At query time, accumulate component-wise sum.

Efficient Cosine Ranking • Computing the k largest cosines • Search as a kNN problem • Find the k docs “nearest” to the query (with largest query-doc cosines) in the vector space. • Do not need to totally order all docs in the corpus. • Use heap for selecting top k docs • Binary tree in which each node’s value > values of children • Takes 2n operations to construct, then each of k“winners” read off in with logn steps. • For n=1M, k=100, this is about 10% of the cost of complete sorting.

Efficient Cosine Ranking • Heuristics • Avoid computing cosines from query to each of n docs, but may occasionally get an answer wrong. • For example, cluster pruning.

Take Home Messages • TFxIDF • Vector Space Model • docs and queries as vectors • cosine similarity • efficient cosine ranking

Information Retrieval