Loading in 2 Seconds...

Formal Multinomial and Multiple-Bernoulli Language Models

Loading in 2 Seconds...

- 269 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'formal multinomial and multiple-bernoulli language models' - Thomas

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Overview

- Two formal estimation techniques
- MAP estimates [Zargoza, Hiemstra, Tipping, SIGIR’03]
- Posterior expectations
- Language models considered
- Multinomial
- Multiple-Bernoulli (2 models)

Bayesian Framework(MAP Estimation)

- Assume textual data X (document, query, etc) is generated by sampling from some distribution P(X | θ) parameterized by θ
- Assume some prior over θ.
- For each X, we want to find the maximum a posteriori (MAP) estimate:
- θX is our (language) model for data X

Multinomial

- Modeling assumptions:
- Why Dirichlet?
- Conjugate prior to multinomial
- Easy to work with

How do we set α?

- α= 1 => uniform prior => ML estimate
- α= 2 => Laplacian smoothing
- Dirichlet-like smoothing:

center – Laplace – α = 2

right – α = μP(w | C)

μ= 10

X = A B B B

P(A | C) = 0.45

P(B | C) = 0.55

Multiple-Bernoulli

- Assume vocabulary V = A B C D
- How do we model text X = D B B D?
- In multinomial, we represent X as the sequence D B B D
- In multiple-Bernoulli we represent X as the vector [0 1 0 1] denoting terms B and D occur in X
- Each X represented by single binary vector

Multiple-Bernoulli(Model A)

- Modeling assumptions:
- Each X is a single sample from a multiple-Bernoulli distribution parameterized by θ
- Use conjugate prior (multiple-Beta)

Problems with Model A

- Ignores document length
- This may be desirable in some applications
- Ignores term frequencies
- How to solve this?
- Model X as a collection of samples (one per word occurrence) from an underlying multiple-Bernoulli distribution
- Example:V = A B C D, X = B D D BRepresentation: { [0 1 0 0], [0 0 0 1], [0 0 0 1], [0 1 0 0] }

Multiple-Bernoulli(Model B)

- Modeling assumptions:
- Each X is a collection (multiset) of indicator vectors sampled from a multiple-Bernoulli distribution parameterized by θ
- Use conjugate prior (multiple-Beta)

How do we set α, β?

- α= β= 1 => uniform prior => ML estimate
- But we want smoothed probabilities…
- One possibility:

left – ML estimate

α = β = 1 center – smoothed (μ= 1)

right – smoothed (μ= 10)

X = A B B B

P(A | C) = 0.45

P(B | C) = 0.55

Another approach…

- Another way to formally estimate language models is via:
- Expectation over posterior
- Takes more uncertainty into account than MAP estimate
- Because we chose to use conjugate priors the integral can be evaluated analytically

Multinomial / Multiple-BernoulliConnection

- Multinomial
- Multiple-Bernoulli
- Dirichlet smoothing

Bayesian Framework(Ranking)

- Query likelihood
- estimate model θD for each document D
- score document D by P(Q | θD)
- measures likelihood of observing query Q given model θD
- KL-divergence
- estimate model for both query and document
- score document D by KL(θQ || θD)
- measures “distance” between two models
- Predictive density

Conclusions

- Both estimation and smoothing can achieved using Bayesian estimation techniques
- Little difference between MAP and posterior expectation estimates – mostly depends on μ
- Not much difference between Multinomial and Multiple-Bernoulli language models
- Scoring multinomial is cheaper
- No good reason to choose multiple-Bernoulli over multinomial in general

Download Presentation

Connecting to Server..