350 likes | 367 Views
This lecture introduces assignment 2, discusses dictionary-based and successive variety stemming, and explores the vector space model for modern information retrieval. It covers topics such as accepting natural language queries, ranking documents based on vocabulary overlap, and the use of term weights for similarity calculations.
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #4 February 8, 1999
This lecture • Introduce assignment 2 • Dictionary based and successive variety stemming • “Modern” information retrieval • The vector space model
Modern Information retrieval • Accept natural language queries • Useful documents and queries have substantial vocabulary overlap • Degree of overlap enables ranking
Natural language queries • Queries easier to formulate • Selection of document paragraph as query • Clustering documents by similarity • Automatic creation of hyperlinks
Ranking documents • Top ranked documents should be best • Enables restricting output to top n documents • Very top relevant (and non relevant) docs can be used for query expansion • (relevance feedback)
Comparison to Boolean • In conventional Boolean retrieval: • output size cannot be limited. • best documents spread anywhere in output
The Vector Space Model • Queries and documents are represented by vectors • Assumes document terms and query terms are independent
The vector space model • Assume collection vocabulary of t terms • This allows documents and queries to be represented by vectors of t dimensions
The document vector • The ith document, Di, is represented by Di=(di1,…, dit) where dik for k=1,…,t is the weight of term k in the document
The vector space model • Similarly, the query, Q, is represented by Q=(q1,…, qt) where qk for k=1,…,t is the weight of term k in the query
Term weights • Binary • w = 1 if term present in document • w = 0 if not • Real number w representing the “importance” of the term • w > 0 if the term present in document • w = 0 if not
The vector space model • Computes similarity between vectors X=(x1,x2,…,xt) and Y=(y1, y2,…,yt) • xi - weight of term i in the document and • yi - weight of term i in the query
The vector space model • For binary weights: • Let |X| = number of 1s in the document and • |Y| = number of 1s in query
Retrieval examples • Given: • The number of terms t • The number of documents N • The document/term (N*t) weight matrix • A query • Use inner product to produce a ranked list of retrieved documents
Example 1- Boolean Weights t=5072 term 1 2 … 17 … 456 … 693 … 5072 Doc-1 0 1 0 1 0 0 Doc-2 1 1 1 0 1 1 ... Doc-N 0 1 0 1 1 0 Query 1 1 0 0 1 0
Retrieval example 1 • Sim(Q, Doc-1) = 1, • Sim(Q, Doc-2) = 3, and • ... • Sim((Q, Doc-N) = 2. • The ranked list is Doc-2 (3), Doc-N (2), Doc-1 (1)
Example 2 Term 1 2 … 17 … 456 … 693 … 5072 Doc-1 0 0.3 0 0.5 0 0 Doc-2 0.2 0.6 0.3 0 0.8 0.3 ... Doc-N 0 0.2 0 0 0.6 0 Query 0.3 0.7 0 0 0.7 0
Retrieval Example 2 • Using inner product: • For Doc-1 0.3*0 + 0.7*0.3 + 0.7*0 = 0.21 • For Doc-2 0.3*0.2 + 0.7*0.6 + 0.7*0.8 = 1.04 • For Doc-N 0.3*0 + 0.7*0.2 + 0.7*0.6 = 0.56
Retrieval example 2 • Important query terms are 2, and 693 • Query term 1 is less important • Terms 2 and 693 important for Doc-2 • Doc-2 retrieved with high similarity and rank
Retrieval Example 2 • To calculate the similarity for other documents we need their term weights
Vector space model • Assume terms are dependent • Each term i is represented by a vector Ti • Let dri be weight of term i in the document Dr, and • Let qsi be weight of term i in the query Qs.
Calculating the similarity • To compute similarity we need the term correlation TiTj for all pairs of terms Ti and Tj • These correlations are not easy to compute
Example 3 • The doc/term weights are given • The correlations between terms are also given. • Using inner product produce a ranked list of documents
D1=2T1+3T2+5T3 D2=3T1+7T2+1T3 Q =0T1+0T2+2T3 T1 T2 T3 T1 1 .5 0 T2 .5 1 -.2 T3 0 -.2 1 Example 3
Example 3 • sim(D1, Q) = (2T1+ 3T2 + 5T3) * • (0T1 + 0T2 + 2T3) = • 4T1T3 + 6T2T3 + 10T3T3 = • 4*0-6*0.2+10*1=8.8
Example 3 sim(D2, Q) = (3T1+ 7T2 + 1T3) * (0T1 + 0T2 + 2T3) = 6T1T3 + 14T2T3 + 2T3T3 = 6*0-14*0.2+2*1=-.8
Term correlations • Approximating term correlations • We use the term/document weight matrix • We represent term Ti as linear combination of N document vectors • The problem now is document correlations
Term vectors • Clearly documents are not semantically independent • The vector space model gets good retrieval results assuming term independence
Term weights • Good similarity values depend on good term weights • Different retrieval models, use different assumptions and different weight formulas
Term weights • In the vector space model the weight of a term depends on: • A measure of recurrence • A measure of term discrimination • A normalization factor