Smoothing Methods for LM in IR

Smoothing Methods for LM in IR Alejandro Figueroa

Outline • The linguistic phenomena behind the retrieval of documents. • Language Modeling Approach. • Smoothing methods. • Overview. • Methods. • Parameters setting. • Interpolation vs. Back-off. • Comparison of methods. • Combination of methods. • Personal outlook and conclusions.

The Linguistic Phenomena behind IR. • „Reducing Information Variation on Texts“ (Agata Savary and Christian Jacquemin). • Work on our QA Group – DFKI.

Information Variation • The problem: simply keyword matching is not enough to retrieve the best documents for a query. For example: „When was Albert Einstein born?„ • The nobel prize of physics Albert Einstein was born in 1879 in Ulm, Germany. • Born: 14 March 1879 in Ulm, Württemberg, Germany. • Physics nobel prize Albert Einstein was born at Ulm, in Württemberg, Germany, on March 14, 1879. • Died 18 Apr 1955 (born 14 Mar 1879) German-American physicist. • The same information can be found in several ways:

Information Variation • Kinds of variation: • Graphic: "14 March 1879“ and "14 Mar 1879“. • Morphological:” Physics nobel prize“ • Syntactical: “German-American physicist“ • Semantic:"Albert Einstein was born at Ulm“ and "German-American physicist“. • Appropriateness: • Precision. • Economy.

Language Modeling Approach • „A Study of smoothing methods for Language Models applied to Information Retrieval“ (Chengxiang Zhai and John Lafferty)

Language Modeling • The probability that a query Q was generated by a probabilistic model based on a document. P(q|d)?0 • Uni-gram model:

Language Modeling • Smoothing methods makes use of two probabilites for the model Pu(w|d) and Ps(w|d).

Language Modeling carried out over the matched terms. Longer documents => less smoothing, longer documents => greater penalty!!.

Smoothing Methods

Overview • The problem: Adjust the MLE to compensate data sparseness. • The role of smoothing is: • LM more accurate. • Explain the non-informative words in the query. • Goal of the work: • How sensitive is retrieval performance to the smoothing of a document LM? • How should be the model and the parameters chosen?

Overview • The unsmoothed model is the MLE:

Overview • Smoothing: tackles the effect of statistical variability in small training sets. • Discounting: the relative frequencies of seen events are discounted; the gained probability mass is then distributed over the unseen words.

Smoothing Methods • Based on the Good-turing idea: Estimate the probabilities of new events by taking the counts of singleton events, dividing it by the total number of events (0,1).

GooD-Turing Idea Expected value of Ntf. Number of terms with frequency tf in a document. The probability of a term with freq. tf is given by: Nd = Total number of terms occurred in d. Total number of terms occurred in d.

Smoothing Methods • Jelinek-mercer method: involves a linear interpolation of the ML model with the collection model.

Smoothing Methods • Absolute discounting: decrease the probability of seen words by substracting a constant from their counts.

Smoothing Methods • Bayesian smoothing using Dirichlet priors: A multinomial distribution, for which the conjugate prior for bayesian analysis is the dirichlet distribution: • The idea is to adjust the probabilities according to the query.

Summary: Smoothing Methods

Parameters Setting • 5 databases from TREC: • Financial Times on disk 4. • FBIS on disk 5. • Los Angeles on disk 5. • Disk 4 and disk 5 minus Congressional Record. • The TREC8 web data. • Queries: • Topics 351-400 (TREC 7 ad-hoc task). • Topics 401-450 (TREC 8 ad hoc web task).

Parameters Setting TREC 7 <num> Number: 384 <title> space station moon <desc> Description: Identify documents that discuss the building of a space station with the intent of colonizing the moon. <narr> Narrative: A relevant document will discuss the purpose of a space station, initiatives towards colonizing the moon, impediments which thus far have thwarted such a project, plans currently underway or in the planning stages for such a venture; cost, countries prepared to make a commitment of men, resources, facilities and money to accomplish such a feat. </top>

Parameters Setting TREC 8 <num> Number: 414 <title> Cuba, sugar, exports <desc> Description: How much sugar does Cuba export and which countries import it? <narr> Narrative: A relevant document will provide information regarding Cuba's sugar trade. Sugar production statistics are not relevant unless exports are mentioned explicitly. </top>

Parameters Setting • Interaction query length/type: • Two different version of each set of queries: • Title only (2 or 3 words). • A long version (Title + description + narrative). • Optimize the performance of each method by means of the non-interpolated average precision.

Parameters Setting • Jelinek-Mercer smoothing: • Weight for a matched term: λ->1

Parameters Setting • Dirichlet priors: • Term weight: αd is a document-dependent length normalization factor that penalizes long documents.

Parameters Setting • Absolute discounting: αd is a document-dependent: • Larger for a document with a flatter distribution of words. • Weight of a matched term:

Parameters Setting • Conclusions Jelinek-Mercer: • The precision is much more sensitive to λ for long queries than for title queries. • Long queries need more smoothing, that is, lees emphasis on the relative weighting of terms. • In the web collection, it was sensitive to smoothing for title queries too. • For title queries the retrieval performance tends to be optimized when λ=0.1.

Parameters Setting • Conclusions Dirichlet Priors: • The precision is more sensitive to μ for long queries than for title queries, especially, when μ is small. • When μ is large, all long queries performed better than short queries, opposite to μ small. • The optimal value of μ tends to be larger for long queries than for title queries. • The value of μ tends to vary from collection to collection.

Parameters Setting • Conclusions Absolute discounting: • The precision is more sensitive to δ for long queries than for title queries. • The optimal value of δ0.7 does not seem to be much different for title queries and long queries. • Smoothing plays a more important role for long verbose queries than for concise queries.

Interpolation vs. Back-off

Interpolation vs. Back-off • Interpolation-based methods: counts of the seen words and the extra counts are shared by both the seen words and unseen words. • Back-off: Trust in the MLE for the high count words and discount and redistribute mass only for the less common terms.

Interpolation vs. Back-off • Interpolation:

Interpolation vs. Back-off • Back-Off:

Interpolation vs. Back-off • Results: • The performance of the back-off strategy is more sensitive to the smoothing parameters. • Specially: Jeliner-Mercer and Dirichlet priors. • This sensitivity is smaller for the absolute discounting method, due to the lower upper bound.

Comparisson of methods

Comparison of methods • For title queries: • Dirichlet prior is better than absolute discounting, which is better than J-M. • Dirichlet prior performed extremelly well on the web collection and it is insensitive to the value of μ. • Many no optimal runs were better than the other two methods.

Comparison of methods • For long queries: • Jelinek-Mercer is better than Dirichlet, which is better than absolute discounting. • The three methods perform better on long queries than in short queries. • Jelinek-Mercer is much more effective for long and verbose queries. • All methods perform better for long queries than for short queries.

Comparison of methods • General Remark: • The strong correlation between the effect of smoothing and the type of the query is unexpected. • Smoothing only improves accuracy in estimating the unigram language model based on a document. Effect of verbose Queries???

Query Length/Verbosity • Four types of query: • Short keywords: Only the title of the topic description. • Long keywords: Using only the description field. • Short verbose: Using the concept field, 28 keywords on average. • Long verbose: Using the title, description and the narrative field (more than 50 words on average). • Generated for the TREC topics 1-150. • Both keywords queries behaved in the similar way and the verbose query too. • The retrieval performance is much less sensitive to smoothing in the case of the keyword queries than for the verbose queries.

Combining Methods • „A General Language Model for Information Retrieval“ (Fei Song / W. Bruce Croft)

A general LM for IR • They propose a extensible model based on: • Good-turing estimate. • Curve-fitting functions. • Model combinations. • The idea is to use n-grams is taking into account the local context, the uni-gram models assume independence.

A general LM for IR • The new model: • Smooth each document with the Good-turing estimate. • Expand each document with the corpus. • Consider terms pairs and expand the unigram model to the bi-gram model.

Step 1: Good turing Idea-Revising Ntf = Number of terms with frequency tf in a doc. E(Ntf)= Expected value of Ntf . The probability of a term with freq. tf is given by: Nd = Total number of terms occurred in d.

Step 2 • Expanding a document model with the corpus:

Step 3 • Modeling a query as a sequence of terms:

Step 4 • Combining uni-grams and bi-grams:

Results • Two collections: • The wall street journal (WSJ), 250 MB, 74.520 docs. • TREC 4, 2 GB, 567.529 docs. • Phrases of word pairs can be useful in improving the retrieval performance. • The strategy can be easily extended.

Personal Outlook / Conclusions

Personal Outlook / Conclusions • Stop-List. • Porter Steemer. • N-grams can not capture large-span relationships in the language. • The performance of the n-gram model has reached a plateau. • P(d).

Principal Component Analysis • A low dimensional representation of the data. • Relation between features. • PCA tries to find a low-rank approximation, where the quality of the approximation depends on how close the data is to lying in a subspace of the given dimensionality.

Smoothing Methods for LM in IR

Smoothing Methods for LM in IR

Presentation Transcript

Data Smoothing

Smoothing Factors for RE applications in DG

Chapter 6 Kernel Smoothing Methods

Smoothing

Exponential smoothing

Smoothing

Exponential Smoothing

smoothing

Exponential Smoothing

Exponential Smoothing

Exponential Smoothing

Exponential Smoothing Methods

Gaussian Smoothing

Exponential Smoothing

Kalman Smoothing

Vector Methods Classical IR

Spatial Smoothing for Categorical Data

4 – Exponential Smoothing Methods

Lecture 7. Kernel Smoothing Methods

Vector Methods Classical IR

IR 202 Research Methods