Formal Multinomial and Multiple-Bernoulli Language Models

1 / 20

# Formal Multinomial and Multiple-Bernoulli Language Models - PowerPoint PPT Presentation

Formal Multinomial and Multiple-Bernoulli Language Models. Don Metzler. Overview. Two formal estimation techniques MAP estimates [Zargoza, Hiemstra, Tipping, SIGIR’03] Posterior expectations Language models considered Multinomial Multiple-Bernoulli (2 models).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Formal Multinomial and Multiple-Bernoulli Language Models' - Thomas

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Formal Multinomial and Multiple-Bernoulli Language Models

Don Metzler

Overview
• Two formal estimation techniques
• MAP estimates [Zargoza, Hiemstra, Tipping, SIGIR’03]
• Posterior expectations
• Language models considered
• Multinomial
• Multiple-Bernoulli (2 models)
Bayesian Framework(MAP Estimation)
• Assume textual data X (document, query, etc) is generated by sampling from some distribution P(X | θ) parameterized by θ
• Assume some prior over θ.
• For each X, we want to find the maximum a posteriori (MAP) estimate:
• θX is our (language) model for data X
Multinomial
• Modeling assumptions:
• Why Dirichlet?
• Conjugate prior to multinomial
• Easy to work with
How do we set α?
• α= 1 => uniform prior => ML estimate
• α= 2 => Laplacian smoothing
• Dirichlet-like smoothing:

left – ML estimate – α = 1

center – Laplace – α = 2

right – α = μP(w | C)

μ= 10

X = A B B B

P(A | C) = 0.45

P(B | C) = 0.55

Multiple-Bernoulli
• Assume vocabulary V = A B C D
• How do we model text X = D B B D?
• In multinomial, we represent X as the sequence D B B D
• In multiple-Bernoulli we represent X as the vector [0 1 0 1] denoting terms B and D occur in X
• Each X represented by single binary vector
Multiple-Bernoulli(Model A)
• Modeling assumptions:
• Each X is a single sample from a multiple-Bernoulli distribution parameterized by θ
• Use conjugate prior (multiple-Beta)
Problems with Model A
• Ignores document length
• This may be desirable in some applications
• Ignores term frequencies
• How to solve this?
• Model X as a collection of samples (one per word occurrence) from an underlying multiple-Bernoulli distribution
• Example:V = A B C D, X = B D D BRepresentation: { [0 1 0 0], [0 0 0 1], [0 0 0 1], [0 1 0 0] }
Multiple-Bernoulli(Model B)
• Modeling assumptions:
• Each X is a collection (multiset) of indicator vectors sampled from a multiple-Bernoulli distribution parameterized by θ
• Use conjugate prior (multiple-Beta)
How do we set α, β?
• α= β= 1 => uniform prior => ML estimate
• But we want smoothed probabilities…
• One possibility:

Multiple-Bernoulli Model B

left – ML estimate

α = β = 1 center – smoothed (μ= 1)

right – smoothed (μ= 10)

X = A B B B

P(A | C) = 0.45

P(B | C) = 0.55

Another approach…
• Another way to formally estimate language models is via:
• Expectation over posterior
• Takes more uncertainty into account than MAP estimate
• Because we chose to use conjugate priors the integral can be evaluated analytically
Multinomial / Multiple-BernoulliConnection
• Multinomial
• Multiple-Bernoulli
• Dirichlet smoothing
Bayesian Framework(Ranking)
• Query likelihood
• estimate model θD for each document D
• score document D by P(Q | θD)
• measures likelihood of observing query Q given model θD
• KL-divergence
• estimate model for both query and document
• score document D by KL(θQ || θD)
• measures “distance” between two models
• Predictive density
Conclusions
• Both estimation and smoothing can achieved using Bayesian estimation techniques
• Little difference between MAP and posterior expectation estimates – mostly depends on μ
• Not much difference between Multinomial and Multiple-Bernoulli language models
• Scoring multinomial is cheaper
• No good reason to choose multiple-Bernoulli over multinomial in general