cs590i information retrieval
Download
Skip this Video
Download Presentation
CS590I: Information Retrieval

Loading in 2 Seconds...

play fullscreen
1 / 51

CS590I: Information Retrieval - PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on

CS590I: Information Retrieval . CS-590I Information Retrieval Retrieval Models: Language models Luo Si Department of Computer Science Purdue University. Retrieval Model: Language Model. Introduction to language model. Unigram language model. Document language model estimation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS590I: Information Retrieval ' - neve-fry


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cs590i information retrieval
CS590I: Information Retrieval
  • CS-590I
  • Information Retrieval
  • Retrieval Models: Language models
  • Luo Si
  • Department of Computer Science
  • Purdue University
retrieval model language model
Retrieval Model: Language Model
  • Introduction to language model
  • Unigram language model
  • Document language model estimation
  • Maximum Likelihood estimation
  • Maximum a posterior estimation
  • Jelinek Mercer Smoothing
  • Model-based feedback
language models motivation
Language Models: Motivation
  • Vector space model for information retrieval
    • Documents and queries are vectors in the term space
    • Relevance is measure by the similarity between document vectors and query vector
  • Problems for vector space model
    • Ad-hoc term weighting schemes
    • Ad-hoc similarity measurement

No justification of relationship between relevance and similarity

  • We need more principled retrieval models…
introduction to language models
Introduction to Language Models:
  • Language model can be created for any language sample
    • A document
    • A collection of documents
    • Sentence, paragraph, chapter, query…
  • The size of language sample affects the quality of language model
    • Long documents have more accurate model
    • Short documents have less accurate model
    • Model for sentence, paragraph or query may not be reliable
introduction to language models1
Introduction to Language Models:
  • A document language model defines a probability distribution over indexed terms
    • E.g., the probability of generating a term
    • Sum of the probabilities is 1
  • A query can be seen as observed data from unknown models
    • Query also defines a language model (more on this later)
  • How might the models be used for IR?
    • Rank documents by Pr( | )
    • Rank documents by language models of and based on kullback-Leibler (KL) divergence between the models (come later)
language model for ir example
sport, basketball

Language Model for

Language Model for

Language Model for

sport, basketball, ticket, sport

stock, finance, finance, stock

Language Model for IR: Example

Generate retrieval results

Estimate the generation probability of Pr( | )

Estimating language model for each document

basketball, ticket, finance, ticket, sport

language models
Language Models
  • Three basic problems for language models
  • What type of probabilistic distribution can be used to construct language models?
  • How to estimate the parameters of the distribution of the language models?
  • How to compute the likelihood of generating queries given the language modes of documents?
multinomial unigram language models
Examples:
  • Five words in vocabulary (sport, basketball, ticket, finance, stock)
  • For a document , its language mode is:
  • {Pi(“sport”), Pi(“basketball”), Pi(“ticket”), Pi(“finance”), Pi(“stock”)}
Multinomial/Unigram Language Models
  • Language model built by multinomial distribution on single terms (i.e., unigram) in the vocabulary
  • Formally:
  • The language model is: {Pi(w) for any word w in vocabulary V}
multinomial unigram language models1
Multinomial Model for

Multinomial Model for

Multinomial Model for

basketball, ticket, finance, ticket, sport

sport, basketball, ticket, sport

stock, finance, finance, stock

Multinomial/Unigram Language Models

Estimating language model for each document

maximum likelihood estimation mle
Maximum Likelihood Estimation (MLE)
  • Maximum Likelihood Estimation:
  • Find model parameters that make generation likelihood reach maximum:

M*=argmaxMPr(D|M)

There are K words in vocabulary, w1...wK (e.g., 5)

Data: one document with counts tfi(w1), …, tfi(wK), and length | |

Model: multinomial M with parameters {pi(wk)}

Likelihood: Pr( | M)

M*=argmaxMPr( |M)

maximum likelihood estimation mle1
Maximum Likelihood Estimation (MLE)

Use Lagrange multiplier approach Set partial derivatives to zero Get maximum likelihood estimate

maximum likelihood estimation mle2
(psp, pb, pt, pf, pst) =

(0.5,0.25,0.25,0,0)

(psp, pb, pt, pf, pst) =

(0.2,0.2,0.4,0.2,0)

(psp, pb, pt, pf, pst) =

(0,0,0,0.5,0.5)

basketball, ticket, finance, ticket, sport

sport, basketball, ticket, sport

stock, finance, finance, stock

Maximum Likelihood Estimation (MLE)

Estimating language model for each document

slide13
Maximum Likelihood Estimation (MLE)
  • Maximum Likelihood Estimation:
  • Assign zero probabilities to unseen words in small sample
  • A specific example:
  • Only two words in vocabulary (w1=sport, w2=business) like (head, tail) for a coin; A document generates sequence of two words or draw a coin for many times
  • Only observe two words (flip the coin twice) and MLE estimators are:
  • “business sport” Pi(w1)=0.5
  • “sport sport” Pi(w1)=1 ?
  • “business business” Pi(w1)=0 ?
slide14
Maximum Likelihood Estimation (MLE)
  • A specific example:
  • Only observe two words (flip the coin twice) and MLE estimators are:
  • “business sport” Pi(w1)*=0.5
  • “sport sport” Pi(w1)*=1 ?
  • “business business” Pi(w1)*=0 ?
  • Data sparseness problem
solution to sparse data problems
Solution to Sparse Data Problems
  • Maximum a posterior (MAP) estimation
  • Shrinkage
  • Bayesian ensemble approach
maximum a posterior map estimation
A specific examples:
  • Only two words in vocabulary (sport, business)
  • For a document :

Prior Distribution

Maximum A Posterior (MAP) Estimation
  • Maximum A Posterior Estimation:
  • Select a model that maximizes the probability of model given observed data

M*=argmaxMPr(M|D)=argmaxMPr(D|M)Pr(M)

  • Pr(M): Prior belief/knowledge
  • Use prior Pr(M) to avoid zero probabilities
maximum a posterior map estimation1
Maximum A Posterior (MAP) Estimation
  • Maximum A Posterior Estimation:
  • Introduce prior on the multinomial distribution
  • Use prior Pr(M) to avoid zero probabilities, most of coins are more or less unbiased
  • Use Dirichlet prior on p(w)

Hyper-parameters

Constant for pK

(x) is gamma function

slide19
Maximum A Posterior (MAP) Estimation
  • Maximum A Posterior:

M*=argmaxMPr(M|D)=argmaxMPr(D|M)Pr(M)

Pseudo Counts

slide20
Maximum A Posterior (MAP) Estimation
  • A specific example:
  • Only observe two words (flip a coin twice):
  • “sport sport” Pi(w1)*=1 ?

P(w1)2 (1-P(w1)2)

times

slide21
Maximum A Posterior (MAP) Estimation
  • A specific example:
  • Only observe two words (flip a coin twice):
  • “sport sport” Pi(w1)*=1 ?
slide22
MAP EstimationUnigram Language Model
  • Maximum A Posterior Estimation:
  • Use Dirichlet prior for multinomial distribution
  • How to set the parameters for Dirichlet prior
slide23
MAP EstimationUnigram Language Model
  • Maximum A Posterior Estimation:
  • Use Dirichlet prior for multinomial distribution
  • There are K terms in the vocabulary:

Hyper-parameters

Constant for pK

slide24
MAP EstimationUnigram Language Model
  • MAP Estimation for unigram language model:
  • Use Lagrange Multiplier; Set derivative to 0

Pseudo counts set by hyper-parameters

slide25
MAP EstimationUnigram Language Model
  • MAP Estimation for unigram language model:
  • Use Lagrange Multiplier; Set derivative to 0
  • How to determine the appropriate value for hyper-parameters?
  • When nothing observed from a document
  • What is most likely pi(wk) without looking at the content of the document?
slide26
MAP EstimationUnigram Language Model
  • MAP Estimation for unigram language model:
  • What is most likely pi(wk) without looking at the content of the document?
  • The most likely pi(wk) without looking into the content of the document d is the unigram probability of the collection:
    • {p(w1|c), p(w2|c),…, p(wK|c)}
  • Without any information, guess the behavior of one member on the behavior of whole population

Constant

slide27
MAP EstimationUnigram Language Model
  • MAP Estimation for unigram language model:
  • Use Lagrange Multiplier; Set derivative to 0

Pseudo counts

Pseudo document length

slide28
Maximum A Posterior (MAP) Estimation
  • Dirichlet MAP Estimation for unigram language model:
  • Step 0: compute the probability on whole collection based collection unigram language model
  • Step 1: for each document , compute its smoothed unigram language model (Dirichlet smoothing) as
slide29
Maximum A Posterior (MAP) Estimation
  • Dirichlet MAP Estimation for unigram language model:
  • Step 2: For a given query ={tfq(w1),…, tfq(wk)}
  • For each document , compute likelihood
  • The larger the likelihood, the more relevant the document is to the query
slide30
Dirichlet Smoothing & TF-IDF
  • Dirichlet Smoothing:

?

  • TF-IDF Weighting:
slide31
Dirichlet Smoothing & TF-IDF
  • Dirichlet Smoothing:
  • TF-IDF Weighting:
slide32
Dirichlet Smoothing & TF-IDF
  • Dirichlet Smoothing:
slide33
Dirichlet Smoothing & TF-IDF
  • Dirichlet Smoothing:

Irrelevant part

  • TF-IDF Weighting:
slide34
Dirichlet Smoothing & TF-IDF
  • Dirichlet Smoothing:
  • Look at the tf.idf part
slide35
Dirichlet Smoothing Hyper-Parameter
  • Dirichlet Smoothing:

Hyper-parameter

  • When is very small, approach MLE estimator
  • When is very large, approach probability on whole collection
  • How to set appropriate ?
slide36
w1

Leave w1 out

Leave wj out

...

wj

...

Dirichlet Smoothing Hyper-Parameter

  • Leave One out Validation:

...

...

slide37
w1

...

wj

...

Dirichlet Smoothing Hyper-Parameter

  • Leave One out Validation:

Leave all words out one by one for a document

Do the procedure for all documents in a collection

Find appropriate

slide38
Dirichlet Smoothing Hyper-Parameter
  • What type of document/collection would get large ?
    • Most documents use similar vocabulary and wording pattern as the whole collection
  • What type of document/collection would get small ?
    • Most documents use different vocabulary and wording pattern than the whole collection
slide39
U.S.

Indiana

West Lafayette

Shrinkage

  • Maximum Likelihood (MLE) builds model purely on document data and generates query word
    • Model may not be accurate when document is short (many unseen words)
  • Shrinkage estimator builds more reliable model by consulting more general models (e.g., collection language model)

Example: Estimate P(Lung_Cancer|Smoke)

slide40
Jelinek Mercer Smoothing
    • Assume for each word, with probability , it is generated from document language model (MLE), with probability 1- , it is generated from collection language model (MLE)
    • Linear interpolation between document language model and collection language model

Shrinkage

JM Smoothing:

slide41
Shrinkage
  • Relationship between JM Smoothing and Dirichlet Smoothing

JM Smoothing:

slide42
Model Based Feedback
  • Equivalence of retrieval based on query generation likelihood and Kullback-Leibler (KL) Divergence between query and document language models
  • Kullback-Leibler (KL) Divergence between two probabilistic distributions
  • It is the distance between two probabilistic distributions
  • It is always larger than zero

How to prove it ?

slide43
Model Based Feedback
  • Equivalence of retrieval based on query generation likelihood and Kullback-Leibler (KL) Divergence between query and document language models

Loglikelihood of query generation probability

Document independent constant

  • Generalize query representation to be a distribution (fractional term weighting)
slide44
Retrieval results

Retrieval results

Calculate KL Divergence

Estimate the generation probability of Pr( | )

Language Model for

Language Model for

Language Model for

Estimating query language model

Estimating language model

Estimating document language model

Model Based Feedback

model based feedback
Language Model for

Language Model for

Language Model for

No feedback

Full feedback

Estimating document language model

Model Based Feedback

Feedback Documents from initial results

Retrieval results

Calculate KL Divergence

New Query Model

Estimating query language model

slide46
Model Based Feedback: Estimate
  • Assume there is a generative model to produce each word within feedback document(s)
  • For each word in feedback document(s), given 

Background model

Feedback Documents

1-

PC(w)

w

Flip a coin

Topic words

w

qF(w)

slide47
Model Based Feedback: Estimate

MLE

Estimator

  • For each word, there is a hidden variable telling which language model it comes from

the 0.12

to 0.05

it 0.04

a 0.02

sport 0.0001

basketball 0.00005

Feedback

Documents

Background

Model

pC(w|C)

1-=0.8

Unknown

query topic

p(w|F)=?

“Basketball”

sport =?

basketball =?

game =?

player =?

=0.2

If we know the value of hidden variable of each word ...

slide48
Model Based Feedback: Estimate

E-step

  • For each word, the hidden variable Zi = {1 (feedback), 0 (background}
  • Step1: estimate hidden variable based current on model parameter (Expectation)

the (0.1) basketball (0.7) game (0.6) is (0.2) ….

  • Step2: Update model parameters based on the guess in step1 (Maximization)

M-Step

slide49
Model Based Feedback: Estimate
  • Expectation-Maximization (EM) algorithm
  • Step 0: Initialize values of
  • Step1: (Expectation)
  • Step2: (Maximization)

Give =0.5

slide50
Model Based Feedback: Estimate
  • Properties of parameter 
  • If is close to 0, most common words can be generated from collection language model, so more topic words in query language mode
  • If is close to 1, query language model has to generate most common words, so fewer topic words in query language mode
retrieval model language model1
Retrieval Model: Language Model
  • Introduction to language model
  • Unigram language model
  • Document language model estimation
  • Maximum Likelihood estimation
  • Maximum a posterior estimation
  • Jelinek Mercer Smoothing
  • Model-based feedback
ad