Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms

Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms Group 9: Jaspreet Arora, Vaishnavi Ravindran, Mathanky Sankaranarayanan, Shadi Shahsavari

Outline • Tagging problem in NLP • Modelling the tagging problem as Hidden Markov Models • HMM example • Training HMMs - MLE • Viterbi algorithm + example • Proposed method: Perceptron Algorithm • Proposed method: Example trigram • Theorems • Results

Parts of Speech • Word classes, Lexical Categories, or “tags” • Noun, Verb, Adverb, Adjective, Preposition, etc. • Open & Closed classes • E.g. Pronouns, articles are limited

POS Tagging Problem • The POS tagging problem is to determine the POS tag for a particular instance of a word • Problem? • Tags depend on context • New types of contexts and new words keep coming up in dictionaries in various languages • manual POS tagging is not scalable

Example

Example • E.g. Mrs. Stark never got around to joining the team • All you gotta do is go around the corner • The entry fee costs around 250.

Example - ambiguity • E.g. Mrs. Stark never got around/RP to joining the team • All you gotta do is go around/IN the corner • The entry fee costs around/RB 250 RP – Particle (a particle is a function word that must be associated with another word or phrase to impart meaning) IN – Preposition or subordinating conjunction RB - Adverb (modifying 250)

Example

Set of possibilities

Supervised Learning Problem • Input a sequence of observations x = x1...xn – e.g. x = the men saw the dog • Output a sequence of labels y = y1....yn – e.g. y = D N V D N

Supervised Learning Problem • Learn a function h that maps input x to labels y • Y(x) is set of possible labels for x • In a structured problem Y(x) is very large • Each member y has a structure • In tagging, we are trying to predict many tags for the entire sentence • Structured prediction as density estimation : p(y/x)

Supervised Learning Problem • Input a sequence of observations x = x1...xn – e.g. x = the men saw the dog • Output a sequence of labels y = y1....yn – e.g. y = D N V D N • In a probabilistic model, we want: – argmaxy P(y|x) = argmaxy P(x,y)/P(x) = argmaxy P(x,y)

Generative Probabilistic Model P(x,y) = P(x|y)P(y) Eg. The can is in the garage x = {the, can, is, in, the, garage} y = {DT, N.. etc.} Local - Can is more likely a modal verb than a noun Context - Noun is much more likely than a verb to follow a determiner

Generative Probabilistic Model P(x,y) = P(x|y)P(y) Eg. The can is in the garage Local - P(x/y) Can is more likely a modal verb than a noun Context - P(y) Noun is much more likely than a verb to follow a determiner

Markov Chains We first need to understand what is the Markov property and what is it used for? • Objective in Markov chain model: Markov chain model is used to find the probability of observing a sequence of ordered events (there is dependency between the events) Find: 2. What is the Markov property? By chain rule we know that the probability of any sequence is modelled as:

What is the Markov property? • But according to Markov property, an event at time t only depends on the event at time t-1. So now to model a sequence • Now we need to know two types of probabilities to achieve the above: • At time Called Initial distribution. Since we have N states, we will have N probabilities:

What is the Markov property? 2. At time We call this the transition probability matrix A.

Example 1 : Predicting weather • Let's say we have three states : Sunny, Windy, Foggy • Transition probability matrix:

Example 1 : Predicting weather (Continued) What’s the probability that tomorrow is sunny and the day after is rainy, given today is sunny?

Hidden Markov Model We can now use the Markov property to build what are known as Hidden Markov Models. What is the objective of HMMs? • Input: an ordered sequence of events (Observed Sequence) • Output: also an ordered sequence of events (Hidden Sequence) • Example: • Find: P(<y1,y2,3,...,yn>) where <y1,y2,3,...,yn> are some sequence of ordered events given that the input is <x1,x2,x3,...,xn>. • The tagging problem is a perfect example of this. The input is a sequence of words (order matters) and the output is a sequence of tags for these words.

The decoding problem in HMM • Given a HMM model 𝛌 = (A,B,π) and an input observation sequence we want to find the most probable sequences • We will refer to as x and as y • In the above equation, we have two probabilities: • the conditional probability of observation sequence when a state sequence known (by the markov assumption) • P(y|𝛌): the prior probability of a state sequence 𝒚 (by the markov assumption) Emission state prob. Transition prob.

Learning the parameters of an HMM • Given an observation sequence x (training data) and the set of possible states in the HMM, learn the HMM parameters 𝛌 = (A,B,π) • Find 𝛌 = (A,B,π) that locally maximizes (Maximum Likelihood Estimation, MLE) • Find A,B using counts from the training data: • Where C(i->j) is the number of times state i transits to state j Where C(i->S) is the number of times state i was seen with symbol S

Finding output sequence - Brute Force • We now need to enumerate all the possible state sequences, and pick the one with the maximum likelihood • How many such sequences would we have? • NT ----> Why?

Finding output sequence - Brute Force |Total no. of states| ^ (length of sequence) Exponential growth

Viterbi Algorithm Dynamic Programming to the rescue

So far what we have learnt: • Pos Tagging as Markov chains • Based on probabilistic methods we find the most possible sequence • Question: How we find these probabilities of long sequences? By using Viterbi algorithm

Viterbi Algorithm • Main idea: using previous calculations to get new results • Uses a table to store intermediate values • Approach: • Compute the likelihood of the observation sequence • By summing over all possible hidden state sequences • But doing this efficiently

Viterbi Algorithm: An example How many distinct ways exists from A to B? With only Right and down movements A B

Viterbi Algorithm: An example Wouldn’t be easier to know ways from A to C and D first? A D C B

Viterbi Algorithm in POS tagging: • Dynamical programming algorithm that allows us to compute the most probable path. • Here, P(x,y) = P(x|y)P(y) = ∏iP(xi |yi )• ∏iP(yi |yi-1 ) • Therefore a recursive algorithm can be proposed

How To be Recursive?

How To be Recursive? natural language processing ( nlp )

How To be Recursive? • Assume a sequence of words x1,x2,x3,...,xt with corresponding tags y1,y2,y3,...,yt (states) . • Let’s define the final state as j, i.e., yt=j • We would like to calculate P(x,y)=P(y1,y2,y3,...,yt=j,x1,x2,x3,...,xt) • Let’s define vt (j) =max1,..,t-1 P(x,y) • vt (j) is the probability of the most probable path accounting for the first t observations and ending in state j

How To be Recursive? vt (j) = max1,..,t-1 P(y1,y2,y3,...,yt=j,x1,x2,x3,...,xt) = max1,..,t-2maxi P(y1,y2,y3,...,yt-1= i,yt= j,x1,x2,x3,...,xt) = max1,..,t-2maxi P(y1,y2,y3,...,yt-1= i,x1,x2,x3,...,xt-1)P(yt= j |yt-1= i)P(yt|xt) = maxi max1,..,t-2P(y1,y2,y3,...,yt-1= i,x1,x2,x3,...,xt-1)P(yt= j |yt-1= i)P(yt|xt) =maxi vt-1 (i) P(yt= j |yt-1= i)P(yt|xt)

How To be Recursive? Therefore: vt (j)=maxi vt-1 (i) P(yt= j |yt-1= i)P(yt|xt) If we use HMM parameters: vt (j)=maxi vt-1(i) aijbjxt

Viterbi steps Step 1: Initialization when 𝑡=1: • v1(j) = 𝜋𝑗 𝑏𝑗𝑥1, for 1≤𝑗≤𝑁 Step 2: Recursion when 1<𝑡≤𝑇: • vt(j) = maxi vt-1(i)aijbjxt (1< t<=T ),for 1≤𝑗≤𝑁 Step 3: Termination • 𝑃∗=maxP(𝒚|𝒙,𝜆)=max𝑗 𝑣𝑇(𝑗) • Backtracking from argmax𝑗 𝑣T(𝑗)

How To be Recursive?

How to calculate these probabilities? • A probabilistic model • Using Maximum entropy methods: • logP(x,y)=∑ilogP(xi|yi)+∑ilogP(yi|yi-1) • A non-probabilistic model: • Perceptron method

Perceptron Algorithm • Rosenblatt’s perceptron is a binary single-neuron model. • It was the first algorithmically described neural network that consists of a linear combiner followed by a hard limiter. • The inputs integration is implemented through the addition of the weighted inputs that have fixed weights obtained during the training stage. • If the result of this addition is larger than a given threshold θ the neuron fires. • When the neuron fires its output is set to 1, otherwise it’s set to -1.

Perceptron Algorithm Finds a vector w such that the corresponding hyperplane separates + from -.

Perceptron Algorithm

Voted Perceptron

Averaged Perceptron

POS Tagging features for perceptron

POS Tagging features for perceptron • For every word/tag pair in the training data we create multiple features that are functions of the “history” at that point and the tag. • Example: • Global features are defined for a whole sequence. They are the sums of the above local features summed over every word/tag pair in that sequence.

Example of Feature vector

Generalised Parameter Estimation Feature vectors 𝝓 together with a parameter vector ϵ Rd are used to define a conditional probability distribution over tags given a history as Where

Generalised Parameter Estimation The log of the probability has the form Log p(t | h; ) = The log probability for a sequence (w[1:n]; t[1:n]) pair will be where hi = <ti-1; ti-2; w[1:n]; i>.

Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms

Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7