Sequence models
This presentation is the property of its rightful owner.
Sponsored Links
1 / 51

Sequence Models PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on
  • Presentation posted in: General

Sequence Models. With slides by me, Joshua Goodman, Fei Xia. Outline. Language Modeling Ngram Models Hidden Markov Models Supervised Parameter Estimation Probability of a sequence Viterbi (or decoding) Baum-Welch. A bad language model. A bad language model. A bad language model.

Download Presentation

Sequence Models

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sequence models

Sequence Models

With slides by me, Joshua Goodman, Fei Xia


Outline

Outline

  • Language Modeling

  • Ngram Models

  • Hidden Markov Models

    • Supervised Parameter Estimation

    • Probability of a sequence

    • Viterbi (or decoding)

    • Baum-Welch


A bad language model

A bad language model


A bad language model1

A bad language model


A bad language model2

A bad language model


A bad language model3

A bad language model


What is a language model

What is a language model?

Language Model: A distribution that assigns a probability to language utterances.

e.g., PLM(“zxcv ./,mweaafsido”) is zero;

PLM(“mat cat on the sat”) is tiny;

PLM(“Colorless green ideas sleep furiously”) is bigger;

PLM(“A cat sat on the mat”) is bigger still.


What s a language model for

What’s a language model for?

  • Information Retrieval

  • Handwriting recognition

  • Speech Recognition

  • Spelling correction

  • Optical character recognition

  • Machine translation


Example language model application

Example Language Model Application

Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file).

Straightforward model:

But this can be hard to train effectively (although see CRFs later).


Example language model application1

Example Language Model Application

Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file).

Traditional solution: Bayes’ Rule

Acoustic Model

(easier to train)

Language Model

Ignore: doesn’t matter for picking a good text


Importance of sequence

Importance of Sequence

So far, we’ve been making the exchangeability, or bag-of-words, assumption:

The order of words is not important.

It turns out, that’s actually not true (duh!).

“cat mat on the the sat” ≠ “the cat sat on the mat”

“Mary loves John” ≠ “John loves Mary”


Language models with sequence information

Language Models with Sequence Information

Problem: How can we define a model that

  • assigns probability to sequences of words (a language model)

  • the probability depends on the order of the words

  • the model can be trained and computed tractably?


Outline1

Outline

  • Language Modeling

  • Ngram Models

  • Hidden Markov Models

    • Supervised parameter estimation

    • Probability of a sequence (decoding)

    • Viterbi (Best hidden layer sequence)

    • Baum-Welch


Smoothing kneser ney

Smoothing: Kneser-Ney

P(Francisco | eggplant) vs P(stew | eggplant)

  • “Francisco” is common, so backoff, interpolated methods say it is likely

  • But it only occurs in context of “San”

  • “Stew” is common, and in many contexts

  • Weight backoff by number of contexts word occurs in


Kneser ney smoothing cont

Kneser-Ney smoothing (cont)

Backoff:

Interpolation:


Outline2

Outline

  • Language Modeling

  • Ngram Models

  • Hidden Markov Models

    • Supervised parameter estimation

    • Probability of an observation sequence

    • Viterbi (Best hidden layer sequence)

    • Baum-Welch


The hidden markov model

The Hidden Markov Model

A dynamicBayes Net (dynamic because the size can change).

The Oi nodes are called observed nodes.

The Xi nodes are called hidden nodes.

X1

X2

XT

O1

O2

OT

NLP


Hmms and language processing

HMMs and Language Processing

  • HMMs have been used in a variety of applications, but especially:

    • Speech recognition

      (hidden nodes are text words, observations are spoken words)

    • Part of Speech Tagging

      (hidden nodes are parts of speech, observations are words)

X1

X2

XT

O1

O2

OT

NLP


Hmm independence assumptions

HMM Independence Assumptions

HMMs assume that:

  • Xi is independent of X1 through Xi-2, given Xi-1 (Markov assumption)

  • Oi is independent of all other nodes, given Xi

  • P(Xi | Xi-1) and P(Oi | Xi) do not depend on i (stationary assumption)

    Not very realistic assumptions about language – but HMMs are often good enough, and very convenient.

X1

X2

XT

O1

O2

OT

NLP


Hmm formula

HMM Formula

An HMM predicts that the probability of observing a sequence o = <o1, o2, …, oT> with a particular set of hidden states x = <x1, … xT> is:

To calculate, we need:

- Prior: P(x1) for all values of x1

- Observation: P(oi|xi) for all values of oi and xi

- Transition: P(xi|xi-1) for all values of xi and xi-1


Hmm pieces

HMM: Pieces

  • A set of hidden states H = {h1, …, hN} that are the values which hidden nodes may take.

  • A vocabulary, or set of states V = {v1, …, vM} that are the values which an observed node may take.

  • Initial probabilities P(x1=hj) for all j

    • Written as a vector of N initial probabilities, called π

    • πj = P(x1=hj)

  • Transition probabilities P(xt=hj | xt-1=hk) for all j, k

    • Written as an NxN ‘transition matrix’ A

    • Aj,k = P(xt=hj | xt-1=hk)

  • Observation probabilities P(ot=vk|st=hj) for all k, j

    • written as an MxN ‘observation matrix’ B

    • Bj,k = P(ot=vk|st=hj)


Hmm for pos tagging

HMM for POS Tagging

  • H = {DT, NN, VB, IN, …}, the set of all POS tags.

  • V = the set of all words in English.

  • Initial probabilities πj are the probability that POS tag hj can start a sentence.

  • Transition probabilities Ajk represent the probability that one tag (e.g., hk=VB) can follow another (e.g., hj=NN)

  • Observation probabilities Bjk represent the probability that a tag (e.g., hj=NN) will generate a particular word (e.g., vk=cat).


Outline3

Outline

  • Language Models

  • Ngram modeling

  • Hidden Markov Models

    • Supervised parameter estimation

    • Probability of a sequence

    • Viterbi: what’s the best hidden state sequence?

    • Baum-Welch: unsupervised parameter estimation


Supervised parameter estimation

o1

ot-1

ot

ot+1

oT

Supervised Parameter Estimation

A

A

A

A

X1

Xt-1

Xt

Xt+1

XT

B

B

B

B

B

  • Given an observation sequence and states, find the HMM parameters (π, A, and B) that are most likely to produce the sequence.

  • For example, POS-tagged data from the Penn Treebank


Maximum likelihood parameter estimation

o1

ot-1

ot

ot+1

oT

Maximum Likelihood Parameter Estimation

A

A

A

A

x1

xt-1

xt

xt+1

xT

B

B

B

B

B


Outline4

Outline

  • Language Modeling

  • Ngram models

  • Hidden Markov Models

    • Supervised parameter estimation

    • Probability of a sequence

    • Viterbi

    • Baum-Welch


What s the probability of a sentence

What’s the probability of a sentence?

Suppose I asked you, ‘What’s the probability of seeing a sentence v1, …, vT on the web?’

If we have an HMM model of English, we can use it to estimate the probability.

(In other words, HMMs can be used as language models.)


Conditional probability of a sentence

Conditional Probability of a Sentence

  • If we knew the hidden states that generated each word in the sentence, it would be easy:


Marginal probability of a sentence

Marginal Probability of a Sentence

Via marginalization, we have:

Unfortunately, if there are N values for each xi (h1 through hN),

Then there are NT values for x1,…,xT.

Brute-force computation of this sum is intractable.


Forward procedure

x1

xt-1

xt

xt+1

xT

o1

ot-1

ot

ot+1

oT

Forward Procedure

  • Special structure gives us an efficient solution using dynamic programming.

  • Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences.

  • Define:


Forward procedure1

x1

xt-1

xt

xt+1

xT

o1

ot-1

ot

ot+1

oT

Forward Procedure


Forward procedure2

x1

xt-1

xt

xt+1

xT

o1

ot-1

ot

ot+1

oT

Forward Procedure


Forward procedure3

x1

xt-1

xt

xt+1

xT

o1

ot-1

ot

ot+1

oT

Forward Procedure


Forward procedure4

x1

xt-1

xt

xt+1

xT

o1

ot-1

ot

ot+1

oT

Forward Procedure


Forward procedure5

x1

xt-1

xt

xt+1

xT

o1

ot-1

ot

ot+1

oT

Forward Procedure


Forward procedure6

x1

xt-1

xt

xt+1

xT

o1

ot-1

ot

ot+1

oT

Forward Procedure


Forward procedure7

x1

xt-1

xt

xt+1

xT

o1

ot-1

ot

ot+1

oT

Forward Procedure


Forward procedure8

x1

xt-1

xt

xt+1

xT

o1

ot-1

ot

ot+1

oT

Forward Procedure


Backward procedure

Backward Procedure

x1

xt-1

xt

xt+1

xT

o1

ot-1

ot

ot+1

oT

Probability of the rest of the states given the first state


Decoding solution

Decoding Solution

x1

xt-1

xt

xt+1

xT

o1

ot-1

ot

ot+1

oT

Forward Procedure

Backward Procedure

Combination


Outline5

Outline

  • Language modeling

  • Ngram models

  • Hidden Markov Models

    • Supervised parameter estimation

    • Probability of a sequence

    • Viterbi: what’s the best hidden state sequence?

    • Baum-Welch


Best state sequence

Find the hidden state sequence that best explains the observations

Viterbi algorithm

o1

ot-1

ot

ot+1

oT

Best State Sequence


Viterbi algorithm

Viterbi Algorithm

x1

xt-1

hj

o1

ot-1

ot

ot+1

oT

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state hj, and seeing the observation at time t


Viterbi algorithm1

o1

ot-1

ot

ot+1

oT

Viterbi Algorithm

x1

xt-1

xt

xt+1

Recursive Computation


Viterbi algorithm2

o1

ot-1

ot

ot+1

oT

Viterbi Algorithm

x1

xt-1

xt

xt+1

xT

Compute the most likely state sequence by working backwards


Outline6

Outline

  • Language modeling

  • Ngram models

  • Hidden Markov Models

    • Supervised parameter estimation

    • Probability of a sequence

    • Viterbi

    • Baum-Welch: Unsupervised parameter estimation


Unsupervised parameter estimation

o1

ot-1

ot

ot+1

oT

Unsupervised Parameter Estimation

A

A

A

A

B

B

B

B

B

  • Given an observation sequence, find the model that is most likely to produce that sequence.

  • No analytic method

  • Given a model and observation sequence, update the model parameters to better fit the observations.


Parameter estimation

o1

ot-1

ot

ot+1

oT

Parameter Estimation

A

A

A

A

B

B

B

B

B

Probability of traversing an arc

Probability of being in state i


Parameter estimation1

o1

ot-1

ot

ot+1

oT

Parameter Estimation

A

A

A

A

B

B

B

B

B

Now we can compute the new estimates of the model parameters.


Parameter estimation2

o1

ot-1

ot

ot+1

oT

Parameter Estimation

A

A

A

A

B

B

B

B

B

  • Guarantee: P(o1:T|A,B,π) <= P(o1:T|Â,B̂,π̂)

  • In other words, by repeating this procedure, we can gradually improve how well the HMM fits the unlabeled data.

  • There is no guarantee that this will converge to the best possible HMM, however (only guaranteed to find a local maximum).


The most important thing

o1

ot-1

ot

ot+1

oT

The Most Important Thing

A

A

A

A

B

B

B

B

B

We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not tractable.


  • Login