CS 430: Information Discovery

CS 430: Information Discovery Lecture 18 Ranking 2

Course Administration • Assignment 3 has been posted. • Guest lecture: Paul Ginsparg, Los Alamos National Laboratory Creating a Global Knowledge Network If we were to start from scratch today to design a quality-controlled archive and distribution system for research findings, would it be realized as a set of "electronic clones" of print journals? Tuesday, April 3, 2001 at 4:30 p.m., Kimball B11

Assignment 3 A user will search your index and then retrieve an article from the archive. The company is building an archive of news articles. Your task is to design the search service.

Midterm Examination -- Question 1 • 1(a) Define the terms inverted file, inverted list, posting. • Inverted file: a list of the words in a set of documents and the documents in which they appear. • Inverted list: All the entries in an inverted file that apply to a specific word. • Posting: Entry in an inverted list • from Lecture 3 • 1(b) When implementing an inverted file system, what are the criteria that you would use to judge whether the system is suitable for very large-scale information retrieval?

Q1 (continued) Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resource. from Lecture 3

Q1 (continued) 1(c) You are designing an inverted file system to be used with Boolean queries on a very large collection of textual documents. New documents are being continually added to the collection. (i) What file structure(s) would you use? (ii) How well does your design satisfy the criteria listed in Part (b)?

Q1 (continued) Separate inverted index from lists of postings postings file Term Pointer to list of postings ant bee cat dog elk fox gnu hog inverted index Lists of postings from Lecture 3

Question 1 (continued) (a) Postings file may be stored sequentially as a linked list. (b) Index file is best stored as a tree. Binary trees provide fast searching but have problems with updating. B-trees are better, with B+-trees as the best. Note: Other answers are possible to this part of the question.

Question 1 (continued) 1(c)(ii) How well does your design satisfy the criteria listed in Part (b)? • Sequential list for each term is efficient for storage and for processing Boolean queries. The disadvantage is a slow update time for long inverted lists. • B-trees combine fast retrieval with moderately efficient updating. • Bottom-up updating is usual fast, but may require recursive tree climbing to the root. • The main weakness is poor storage utilization; typically buckets are only 0.69 full.

Question 3 (a) Define the terms recall and precision. 3(b) Q is a query. D is a collection of 1,000,000 documents. When the query Q is run, a set of 200 documents is returned. (i) How in a practical experiment would you calculate the precision? Have an expert examine each of the 200 documents and decide whether it is relevant. Precision is number judged relevant divided by 200. (ii) How in a practical experiment would you calculate the recall? It is not feasible to examine 1,000,000 records. Therefore sampling must be used ...

Question 3 (continued) (c) Suppose that, by some means, it is known that 100 of the documents in D are relevant to Q. Of the 200 documents returned by the search, 50 are relevant. (i) What is the precision? 50/200 = 0.25 (ii) What is the recall? 50/100 = 0.5 (d) Explain in general terms the method used by TREC to estimate the recall.

Question 3 (continued) For each query, a pool of potentially relevant documents is assembled, using the top 100 ranked documents from each participant The human expert who set the query looks at every document in the pool and determines whether it is relevant. Documents outside the pool are not examined. In TREC-8: 7,100 documents in the pool 1,736 unique documents (eliminating duplicates) 94 judged relevant from Lecture 11

Review of Weighting The objective is to measure the similarity between a document and a query using statistical (not linguistic) methods. Concept is to weight terms by some factor based on the distribution of terms within and between documents. In general: (a) Weight is proportional to the number of times that the term appears in the document (b) Weight is inversely proportional to the number of documents that contain the term (or the total number of occurrences of the term)

Measures of Within Document Frequency (a) Simplest is to use fik (b)Croft's normalization: cfik = K + (1 - K) fik/maxfk (fik > 0) fik is the frequency of term k in document i cfik is Croft's normalized frequency maxfk is the maximum frequency of any term in document k K is a constant between 0 and 1 that is adjusted for the collection

Measures of Within Document Frequency (c) Salton and Buckley recommend using different weightings for documents and queries documents fik for terms in collections of long documents 1 for terms in collections of short document queries cfik with K = 0.5 for general use fik for long queries (cfik with K = 0)

Inverse Document Frequency (IDF) (a) Simplest to use is 1 / dk (Salton) dk number of documents that contain term k (b) Normalized forms: IDFi= log2 (N/ni)+ 1 or IDFi= log2 (maxn/ni)+ 1 (Sparck Jones) N number of documents in the collection ni total number of occurrences of term i in the collection maxn maximum frequency of any term in the collection

Probabilistic Models The section in the book on probabilistic models is rather unsatisfactory because it relies on a mathematical foundation that has been left out. What are the basic ideas?

Probabilistic Ranking Basic concept: For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically. See Van Rijsbergen's article in the course readings

Probabilistic Weighting ( ) ( ) r R - r n - r N - R N number of documents in collection R number of relevant documents for query q n number of documents with term t r number of relevant documents with term t w = log r R - r n - r N - R number of relevant documents with term t number of relevant documents without term t ( ) ( ) number of non-relevant documents with term t number of non-relevant documents in collection

Ranking -- Practical Experience 1. Basic method is inner (dot) product with no weighting 2. Cosine (dividing by product of lengths) normalizes for vectors of different lengths 3. Term weighting using frequency of terms in document usually improves ranking 4. Term weighting using an inverse function of terms in the entire collection improves ranking (e.g., IDF) 5. Weightings for document structure improve ranking 6. Relevance weightings after initial retrieval improve ranking Effectiveness of methods depends on characteristics of the collection. In general, there are few improvements beyond simple weighting schemes.

CS 430: Information Discovery