CS 430: Information Discovery

CS 430: Information Discovery Lecture 17 Probabilistic Information Retrieval

Course Administration Midterm Examination Kimball B11, 7:30 to 9:00 pm on Wednesday, October 31. Assignment 3 Revised version now online: • Clarifies requirements, e.g., precedence of operators, stemming with wild cards, etc. • Detailed submission requirements, so that we can better grade and comment on your work.

Three Approaches to Information Retrieval Many authors divide the methods of information retrieval into three categories: Boolean (based on set theory) Vector space (based on linear algebra) Probabilistic (based on Bayesian statistics) In practice, the latter two have considerable overlap.

Probability Ranking Principle "If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately a possible on the basis of whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data." W.S. Cooper

Probabilistic Ranking Basic concept: "For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically." Van Rijsbergen

Concept R is a setof documents that are guessed to be relevant and R the complement of R. 1. Guess a preliminary probabilistic description of R and use it to retrieve a first set of documents. 2. Interact with the user to refine the description. 3. Repeat, thus generating a succession of approximations to R.

Binary Independence Retrieval Model (BIR) Suppose that the weights for term i in document dj and query q are wi,j and wi,q, where all weights are 0 or 1. Let P(ki | R) be the probability that index term ki is present in a document randomly selected from the set R. If the index terms are independent, after some mathematical manipulation, taking logs and ignoring factors that are constant for all documents: similarity (dj, q) = wi,q x wi,j x ( log + log ) P(ki | R) 1 - P(ki | R) 1 - P(ki | R) P(ki | R) i

Estimates of P(ki | R) Initial guess, with no information to work from: P(ki | R) = c P(ki | R) = ni / N where: c is an arbitrary constant, e.g., 0.5 ni is the number of documents that contain ki N is the total number of documents in the collection

Improving the Estimates of P(ki | R) Human feedback -- relevance feedback Automatically (a) Run query q using initial values. Consider the t top ranked documents. Let r be the number of these documentsthat contain the term ki. (b) The new estimates are: P(ki | R) = r / t P(ki | R) = (ni - r) / (N - t) Note: The ratio of these two terms, with minor changes of notation and taking logs, gives w2 on page 368 of Frake.

Continuation similarity (dj, q) = wi,q x wi,j x ( log + log ) = wi,q x wi,j x ( log r/(t- r) + log (N - r)/(N + r - t - ni) ) = wi,q x wi,j x log {r/(t- r)}/{(N + r - t - ni)/(N - r)} Note: With a minor change of notation, this is w4 on page 368 of Frake. P(ki | R) 1 - P(ki | R) 1 - P(ki | R) P(ki | R) i i i

Probabilistic Weighting ( ) ( ) r R - r n - r N - R N number of documents in collection R number of relevant documents for query q n number of documents with term t r number of relevant documents with term t w = log r R - r n - r N - R number of relevant documents with term t number of relevant documents without term t ( ) ( ) number of non-relevant documents with term t number of non-relevant documents in collection

Discussion of Probabilistic Model Advantages • Based on firm theoretical basis Disadvantages • Initial definition of R has to be guessed. • Weights ignore term frequency • Assumes independent index terms (as does vector model)

Review of Weighting The objective is to measure the similarity between a document and a query using statistical (not linguistic) methods. Concept is to weight terms by some factor based on the distribution of terms within and between documents. In general: (a) Weight is an increasing function of the number of times that the term appears in the document (b) Weight is a decreasing function of the number of documents that contain the term (or the total number of occurrences of the term) (c) Weight needs to be adjusted for documents that differ greatly in length.

Normalization of Within Document Frequency (Term Frequency) Normalization to moderate the effect of high-frequency terms Croft's normalization: cfij = K + (1 - K) fij/mi (fij > 0) fij is the frequency of term j in document i cfij is Croft's normalized frequency mi is the maximum frequency of any term in document i K is a constant between 0 and 1 that is adjusted for the collection K should be set to low values (e.g., 0.3) for collections with long documents (35 or more terms). K should be set to higher values (greater than 0.5) for collections with short documents.

Normalization of Within Document Frequency (Term Frequency) Examples Croft's normalization: cfij = K + (1 - K) fij/mi (fij > 0) document K mi weight (most weight (least length frequent term) frequent term) 20 0.3 5 1.00 0.44 20 0.3 2 1.00 0.65 100 0.5 25 1.00 0.52 100 0.5 2 1.00 0.75

Measures of Within Document Frequency (c) Salton and Buckley recommend using different weightings for documents and queries documents fik for terms in collections of long documents 1 for terms in collections of short document queries cfik with K = 0.5 for general use fik for long queries (cfik with K = 0)

Ranking -- Practical Experience 1. Basic method is inner (dot) product with no weighting 2. Cosine (dividing by product of lengths) normalizes for vectors of different lengths 3. Term weighting using frequency of terms in document usually improves ranking 4. Term weighting using an inverse function of terms in the entire collection improves ranking (e.g., IDF) 5. Weightings for document structure improve ranking 6. Relevance weightings after initial retrieval improve ranking Effectiveness of methods depends on characteristics of the collection. In general, there are few improvements beyond simple weighting schemes.

Inverse Document Frequency (IDF) (a) Simplest to use is 1 / dk (Salton) dk number of documents that contain term k (b) Normalized forms: IDFi= log2 (N/ni)+ 1 or IDFi= log2 (maxn/ni)+ 1 (Sparck Jones) N number of documents in the collection ni total number of occurrences of term i in the collection maxn maximum frequency of any term in the collection

CS 430: Information Discovery

CS 430: Information Discovery

Presentation Transcript

INFORMATION THEORY

Context

VIRUSES.

Major Recent Developments in Electronic Discovery and Thoughts on Information Management

Discovery Indonesia Journal Hosting

Drug Discovery Programs at BMS

CIS664-Knowledge Discovery and Data Mining

WHII.04: European Age of Discovery

Web-Scale Discovery from Alpha to Omega

SNP Discovery and Analysis Application to Association Studies

Computational Discovery in Evolving Complex Networks

Machine Learning Methods for Decision Support and Discovery

Distributed Monitoring and Information Services for the Grid

Phonetic perspectives on modelling information in the speech signal

Road to Discovery: Lecture 1

CIS664-Knowledge Discovery and Data Mining

Cisco Discovery Module 1

Pytheas the Greek and and the discovery of Britain

Fall Rally Information

Chapter 18

Discovery of DNA