1 / 20

# formal multinomial and multiple-bernoulli language models - PowerPoint PPT Presentation

Formal Multinomial and Multiple-Bernoulli Language Models. Don Metzler. Overview. Two formal estimation techniques MAP estimates [Zargoza, Hiemstra, Tipping, SIGIR’03] Posterior expectations Language models considered Multinomial Multiple-Bernoulli (2 models).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'formal multinomial and multiple-bernoulli language models' - Thomas

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Formal Multinomial and Multiple-Bernoulli Language Models

Don Metzler

• Two formal estimation techniques

• MAP estimates [Zargoza, Hiemstra, Tipping, SIGIR’03]

• Posterior expectations

• Language models considered

• Multinomial

• Multiple-Bernoulli (2 models)

Bayesian Framework(MAP Estimation)

• Assume textual data X (document, query, etc) is generated by sampling from some distribution P(X | θ) parameterized by θ

• Assume some prior over θ.

• For each X, we want to find the maximum a posteriori (MAP) estimate:

• θX is our (language) model for data X

• Modeling assumptions:

• Why Dirichlet?

• Conjugate prior to multinomial

• Easy to work with

• α= 1 => uniform prior => ML estimate

• α= 2 => Laplacian smoothing

• Dirichlet-like smoothing:

left – ML estimate – α = 1

center – Laplace – α = 2

right – α = μP(w | C)

μ= 10

X = A B B B

P(A | C) = 0.45

P(B | C) = 0.55

• Assume vocabulary V = A B C D

• How do we model text X = D B B D?

• In multinomial, we represent X as the sequence D B B D

• In multiple-Bernoulli we represent X as the vector [0 1 0 1] denoting terms B and D occur in X

• Each X represented by single binary vector

Multiple-Bernoulli(Model A)

• Modeling assumptions:

• Each X is a single sample from a multiple-Bernoulli distribution parameterized by θ

• Use conjugate prior (multiple-Beta)

• Ignores document length

• This may be desirable in some applications

• Ignores term frequencies

• How to solve this?

• Model X as a collection of samples (one per word occurrence) from an underlying multiple-Bernoulli distribution

• Example:V = A B C D, X = B D D BRepresentation: { [0 1 0 0], [0 0 0 1], [0 0 0 1], [0 1 0 0] }

Multiple-Bernoulli(Model B)

• Modeling assumptions:

• Each X is a collection (multiset) of indicator vectors sampled from a multiple-Bernoulli distribution parameterized by θ

• Use conjugate prior (multiple-Beta)

How do we set α, β?

• α= β= 1 => uniform prior => ML estimate

• But we want smoothed probabilities…

• One possibility:

left – ML estimate

α = β = 1 center – smoothed (μ= 1)

right – smoothed (μ= 10)

X = A B B B

P(A | C) = 0.45

P(B | C) = 0.55

• Another way to formally estimate language models is via:

• Expectation over posterior

• Takes more uncertainty into account than MAP estimate

• Because we chose to use conjugate priors the integral can be evaluated analytically

Multinomial / Multiple-BernoulliConnection

• Multinomial

• Multiple-Bernoulli

• Dirichlet smoothing

Bayesian Framework(Ranking)

• Query likelihood

• estimate model θD for each document D

• score document D by P(Q | θD)

• measures likelihood of observing query Q given model θD

• KL-divergence

• estimate model for both query and document

• score document D by KL(θQ || θD)

• measures “distance” between two models

• Predictive density

• Both estimation and smoothing can achieved using Bayesian estimation techniques

• Little difference between MAP and posterior expectation estimates – mostly depends on μ

• Not much difference between Multinomial and Multiple-Bernoulli language models

• Scoring multinomial is cheaper

• No good reason to choose multiple-Bernoulli over multinomial in general