CS533 Information Retrieval

CS533 Information Retrieval Dr. Michal Cutler Lecture #4 February 8, 1999

This lecture • Introduce assignment 2 • Dictionary based and successive variety stemming • “Modern” information retrieval • The vector space model

Modern Information retrieval • Accept natural language queries • Useful documents and queries have substantial vocabulary overlap • Degree of overlap enables ranking

Natural language queries • Queries easier to formulate • Selection of document paragraph as query • Clustering documents by similarity • Automatic creation of hyperlinks

Ranking documents • Top ranked documents should be best • Enables restricting output to top n documents • Very top relevant (and non relevant) docs can be used for query expansion • (relevance feedback)

Comparison to Boolean • In conventional Boolean retrieval: • output size cannot be limited. • best documents spread anywhere in output

The Vector Space Model • Queries and documents are represented by vectors • Assumes document terms and query terms are independent

The vector space model • Assume collection vocabulary of t terms • This allows documents and queries to be represented by vectors of t dimensions

The document vector • The ith document, Di, is represented by Di=(di1,…, dit) where dik for k=1,…,t is the weight of term k in the document

The vector space model • Similarly, the query, Q, is represented by Q=(q1,…, qt) where qk for k=1,…,t is the weight of term k in the query

Term weights • Binary • w = 1 if term present in document • w = 0 if not • Real number w representing the “importance” of the term • w > 0 if the term present in document • w = 0 if not

The vector space model • Computes similarity between vectors X=(x1,x2,…,xt) and Y=(y1, y2,…,yt) • xi - weight of term i in the document and • yi - weight of term i in the query

The vector space model • For binary weights: • Let |X| = number of 1s in the document and • |Y| = number of 1s in query

Similarity measures

Retrieval examples • Given: • The number of terms t • The number of documents N • The document/term (N*t) weight matrix • A query • Use inner product to produce a ranked list of retrieved documents

Example 1- Boolean Weights t=5072 term 1 2 … 17 … 456 … 693 … 5072 Doc-1 0 1 0 1 0 0 Doc-2 1 1 1 0 1 1 ... Doc-N 0 1 0 1 1 0 Query 1 1 0 0 1 0

Retrieval example 1 • Sim(Q, Doc-1) = 1, • Sim(Q, Doc-2) = 3, and • ... • Sim((Q, Doc-N) = 2. • The ranked list is Doc-2 (3), Doc-N (2), Doc-1 (1)

Example 2 Term 1 2 … 17 … 456 … 693 … 5072 Doc-1 0 0.3 0 0.5 0 0 Doc-2 0.2 0.6 0.3 0 0.8 0.3 ... Doc-N 0 0.2 0 0 0.6 0 Query 0.3 0.7 0 0 0.7 0

Retrieval Example 2 • Using inner product: • For Doc-1 0.3*0 + 0.7*0.3 + 0.7*0 = 0.21 • For Doc-2 0.3*0.2 + 0.7*0.6 + 0.7*0.8 = 1.04 • For Doc-N 0.3*0 + 0.7*0.2 + 0.7*0.6 = 0.56

Retrieval example 2 • Important query terms are 2, and 693 • Query term 1 is less important • Terms 2 and 693 important for Doc-2 • Doc-2 retrieved with high similarity and rank

Retrieval Example 2 • To calculate the similarity for other documents we need their term weights

Vector space model • Assume terms are dependent • Each term i is represented by a vector Ti • Let dri be weight of term i in the document Dr, and • Let qsi be weight of term i in the query Qs.

Calculating the similarity

Calculating the similarity • To compute similarity we need the term correlation TiTj for all pairs of terms Ti and Tj • These correlations are not easy to compute

Example 3 • The doc/term weights are given • The correlations between terms are also given. • Using inner product produce a ranked list of documents

D1=2T1+3T2+5T3 D2=3T1+7T2+1T3 Q =0T1+0T2+2T3 T1 T2 T3 T1 1 .5 0 T2 .5 1 -.2 T3 0 -.2 1 Example 3

Example 3 • sim(D1, Q) = (2T1+ 3T2 + 5T3) * • (0T1 + 0T2 + 2T3) = • 4T1T3 + 6T2T3 + 10T3T3 = • 4*0-6*0.2+10*1=8.8

Example 3 sim(D2, Q) = (3T1+ 7T2 + 1T3) * (0T1 + 0T2 + 2T3) = 6T1T3 + 14T2T3 + 2T3T3 = 6*0-14*0.2+2*1=-.8

Term correlations • Approximating term correlations • We use the term/document weight matrix • We represent term Ti as linear combination of N document vectors • The problem now is document correlations

Approximating the correlations

Term vectors • Clearly documents are not semantically independent • The vector space model gets good retrieval results assuming term independence

Term weights • Good similarity values depend on good term weights • Different retrieval models, use different assumptions and different weight formulas

Term weights • In the vector space model the weight of a term depends on: • A measure of recurrence • A measure of term discrimination • A normalization factor

CS533 Information Retrieval