- 349 Views
- Updated On :

Parts of Speech. Sudeshna Sarkar 7 Aug 2008. Why Do We Care about Parts of Speech?. Pronunciation Hand me the lead pipe. Predicting what words can be expected next Personal pronoun (e.g., I , she ) ____________ Stemming -s means singular for verbs, plural for nouns

Related searches for Parts of Speech

Download Presentation
## PowerPoint Slideshow about 'Parts of Speech' - maximos

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Why Do We Care about Parts of Speech? Stemming As the basis for syntactic parsing and then meaning extraction Machine translation

- Pronunciation
- Hand me the lead pipe.
- Predicting what words can be expected next
- Personal pronoun (e.g., I, she) ____________

- -s means singular for verbs, plural for nouns

- I will lead the group into the lead smelter.

- (E) content +N (F) contenu +N
- (E) content +Adj (F) content +Adj or satisfait +Adj

What is a Part of Speech?

Is this a semantic distinction? For example, maybe Noun is the class of words for people, places and things. Maybe Adjective is the class of words for properties of nouns.

Consider: green book

book is a Noun

green is an Adjective

Now consider: book worm

This green is very soothing.

How Many Parts of Speech Are There?

- A first cut at the easy distinctions:
- Open classes:
- nouns, verbs, adjectives, adverbs

- Closed classes: function words
- conjunctions: and, or, but
- pronounts: I, she, him
- prepositions: with, on
- determiners: the, a, an

- Open classes:

Part of speech tagging

- 8 (ish) traditional parts of speech
- Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc
- This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.)
- Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS
- We’ll use POS most frequently
- I’ll assume that you all know what these are

POS examples

- N noun chair, bandwidth, pacing
- V verb study, debate, munch
- ADJ adj purple, tall, ridiculous
- ADV adverb unfortunately, slowly,
- P preposition of, by, to
- PRO pronoun I, me, mine
- DET determiner the, a, that, those

Tagsets

Brown corpus tagset (87 tags):

http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html

Penn Treebank tagset (45 tags):

http://www.cs.colorado.edu/~martin/SLP/Figures/ (8.6)

C7 tagset (146 tags)

http://www.comp.lancs.ac.uk/ucrel/claws7tags.html

TAGS

the

koala

put

the

keys

on

the

table

N

V

P

DET

POS Tagging: Definition- The process of assigning a part-of-speech or lexical class marker to each word in a corpus:

POS tagging: Choosing a tagset

- There are so many parts of speech, potential distinctions we can draw
- To do POS tagging, need to choose a standard set of tags to work with
- Could pick very coarse tagets
- N, V, Adj, Adv.

- More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags
- PRP$, WRB, WP$, VBG

- Even more fine-grained tagsets exist

Using the UPenn tagset

- The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
- Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”)
- Except the preposition/complementizer “to” is just marked “to”.

POS Tagging

- Words often have more than one POS: back
- The back door = JJ
- On my back = NN
- Win the voters back = RB
- Promised to back the bill = VB

- The POS tagging problem is to determine the POS tag for a particular instance of a word.

Algorithms for POS Tagging

- Ambiguity – In the Brown corpus, 11.5% of the word types are ambiguous (using 87 tags):

- Worse, 40% of the tokens are ambiguous.

Algorithms for POS Tagging

- Why can’t we just look them up in a dictionary?
- Words that aren’t in the dictionary

http://story.news.yahoo.com/news?tmpl=story&cid=578&ncid=578&e=1&u=/nm/20030922/ts_nm/iraq_usa_dc

- One idea: P(ti| wi) = the probability that a random hapax legomenon in the corpus has tag ti.
- Nouns are more likely than verbs, which are more likely than pronouns.

- Another idea: use morphology.

Algorithms for POS Tagging - Knowledge

- Dictionary
- Morphological rules, e.g.,
- _____-tion
- _____-ly
- capitalization

- N-gram frequencies
- to _____
- DET _____ N
- But what about rare words, e.g, smelt (two verb forms, melt and past tense of smell, and one noun form, a small fish)

- Combining these
- V _____-ing I was gracking vs. Gracking is fun.

POS Tagging - Approaches

- Approaches
- Rule-based tagging
- (ENGTWOL)

- Stochastic (=Probabilistic) tagging
- HMM (Hidden Markov Model) tagging

- Transformation-based tagging
- Brill tagger

- Rule-based tagging
- Do we return one best answer or several answers and let later steps decide?
- How does the requisite knowledge get entered?

3 methods for POS tagging

1. Rule-based tagging

- Example: Karlsson (1995) EngCGtagger based on the Constraint Grammar architecture and ENGTWOL lexicon
- Basic Idea:
- Assign all possible tags to words (morphological analyzer used)
- Remove wrong tags according to set of constraint rules (typically more than 1000 hand-written constraint rules, but may be machine-learned)

- Basic Idea:

3 methods for POS tagging

2. Transformation-based tagging

- Example: Brill (1995) tagger - combination of rule-based and stochastic (probabilistic) tagging methodologies
- Basic Idea:
- Start with a tagged corpus + dictionary (with most frequent tags)
- Set the most probable tag for each word as a start value
- Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order (like rule-based taggers)
- machine learning is used—the rules are automatically induced from a previously tagged training corpus (like stochastic approach)

- Basic Idea:

3 methods for POS tagging

3. Stochastic (=Probabilistic) tagging

- Example: HMM (Hidden Markov Model) tagging - a training corpus used to compute the probability (frequency) of a given word having a given POS tag in a given context

Hidden Markov Model (HMM) Tagging

- Using an HMM to do POS tagging
- HMM is a special case of Bayesian inference
- It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition)

Hidden Markov Model (HMM) Taggers

- Goal: maximize P(word|tag) x P(tag|previous n tags)
- P(word|tag)
- word/lexical likelihood
- probability that given this tag, we have this word
- NOT probability that this word has this tag
- modeled through language model (word-tag matrix)

- P(tag|previous n tags)
- tag sequence likelihood
- probability that this tag follows these previous tags
- modeled through language model (tag-tag matrix)

Lexical information

Syntagmatic information

POS tagging as a sequence classification task

- We are given a sentence (an “observation” or “sequence of observations”)
- Secretariat is expected to race tomorrow
- sequence of n words w1…wn.

- What is the best sequence of tags which corresponds to this sequence of observations?
- Probabilistic/Bayesian view:
- Consider all possible sequences of tags
- Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn.

Getting to HMM

- Let T = t1,t2,…,tn
- Let W = w1,w2,…,wn
- Goal: Out of all sequences of tags t1…tn, get the the most probable sequence of POS tags T underlying the observed sequence of words w1,w2,…,wn
- Hat ^ means “our estimate of the best = the most probable tag sequence”
- Argmaxxf(x) means “the x such that f(x) is maximized”
it maximazes our estimate of the best tag sequence

Getting to HMM

- This equation is guaranteed to give us the best tag sequence
- But how do we make it operational? How do we compute this value?
- Intuition of Bayesian classification:
- Use Bayes rule to transform it into a set of other probabilities that are easier to compute
- Thomas Bayes: British mathematician (1702-1761)

Bayes Rule

Breaks down any conditional probability P(x|y) into three other probabilities

P(x|y): The conditional probability of an event x assuming that y has occurred

Bayes Rule

We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words

Likelihood and prior Further Simplifications

1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it

n

2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag

3. The most probable tag sequence estimated by the bigram tagger

TAGS

the

koala

put

the

keys

on

the

table

N

V

P

DET

Likelihood and prior Further Simplifications1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it

n

Likelihood and prior Further Simplifications

2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag

Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram.

Bigrams are used as the basis for simple statistical analysis of text

The bigram assumption is related to the first-order Markov assumption

Likelihood and prior Further Simplifications

3. The most probable tag sequence estimated by the bigram tagger

---------------------------------------------------------------------------------------------------------------

n

biagram assumption

Two kinds of probabilities (1)

- Tag transition probabilities p(ti|ti-1)
- Determiners likely to precede adjs and nouns
- That/DT flight/NN
- The/DT yellow/JJ hat/NN
- So we expect P(NN|DT) and P(JJ|DT) to be high
- But P(DT|JJ) to be:?

- Determiners likely to precede adjs and nouns

Two kinds of probabilities (1)

- Tag transition probabilities p(ti|ti-1)
- Compute P(NN|DT) by counting in a labeled corpus:

# of times DT is followed by NN

Two kinds of probabilities (2)

- Word likelihood probabilities p(wi|ti)
- P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is”
- Compute P(is|VBZ) by counting in a labeled corpus:

If we were expecting a third person singular verb, how likely is it that

this verb would be is?

An Example: the verb “race”

- Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR
- People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DTrace/NN for/IN outer/JJ space/NN
- How do we pick the right tag?

Disambiguating “race”

- P(NN|TO) = .00047
- P(VB|TO) = .83
The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’

- P(race|NN) = .00057
- P(race|VB) = .00012
Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB.

- P(NR|VB) = .0027
- P(NR|NN) = .0012
tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun

- P(VB|TO)P(NR|VB)P(race|VB) = .00000027
- P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins

Hidden Markov Models

- What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)
- Let’s just spend a bit of time tying this into the model
- In order to define HMM, we will first introduce the Markov Chain, or observable Markov Model.

Definitions

- A weighted finite-state automaton adds probabilities to the arcs
- The sum of the probabilities leaving any arc must sum to one

- A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through
- Markov chains can’t represent inherently ambiguous problems
- Useful for assigning probabilities to unambiguous sequences

Markov chain = “First-order observed Markov Model”

- a set of states
- Q = q1, q2…qN; the state at time t is qt

- a set of transition probabilities:
- a set of probabilities A = a01a02…an1…ann.
- Each aij represents the probability of transitioning from state i to state j
- The set of these is the transition probability matrix A

- Distinguished start and end states
Special initial probability vector

ithe probability that the MM will start in state i, each iexpresses the probability p(qi|START)

Markov chain = “First-order observed Markov Model”

Markov Chain for weather: Example 1

- three types of weather: sunny, rainy, foggy
- we want to find the following conditional probabilities:
P(qn|qn-1, qn-2, …, q1)

- I.e., the probability of the unknown weather on day n, depending on the (known) weather of the preceding days

- We could infer this probability from the relative frequency (the statistics) of past observations of weather sequences

Problem: the larger n is, the more observations we must collect.

Suppose that n=6, then we have to collect statistics for 3(6-1) = 243 past histories

Markov chain = “First-order observed Markov Model”

- Therefore, we make a simplifying assumption, called the (first-order) Markov assumption
for a sequence of observations q1, … qn,

current state only depends on previous state

- the joint probability of certain past and current observations

Markov chain = “First-order observable Markov Model”

Markov chain = “First-order observed Markov Model”

- Given that today the weather is sunny, what's the probability that tomorrow is sunny and the day after is rainy?
- Using the Markov assumption and the probabilities in table 1, this translates into:

The weather figure: specific example

- Markov Chain for weather: Example 2

Markov chain for weather

- What is the probability of 4 consecutive rainy days?
- Sequence is rainy-rainy-rainy-rainy
- I.e., state sequence is 3-3-3-3
- P(3,3,3,3) =
- 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432

Hidden Markov Model

- For Markov chains, the output symbols are the same as the states.
- See sunny weather: we’re in state sunny

- But in part-of-speech tagging (and other things)
- The output symbols are words
- But the hidden states are part-of-speech tags

- So we need an extension!
- A Hidden Markov Model is an extension of a Markov chain in which the output symbols are not the same as the states.
- This means we don’t know which state we are in.

Hidden Markov Models

- States Q = q1, q2…qN;
- Observations O = o1, o2…oN;
- Each observation is a symbol from a vocabulary V = {v1,v2,…vV}

- Transition probabilities (prior)
- Transition probability matrix A = {aij}

- Observation likelihoods (likelihood)
- Output probability matrix B={bi(ot)}
a set of observation likelihoods, each expressing the probability of an observation ot being generated from a state i, emission probabilities

- Output probability matrix B={bi(ot)}
- Special initial probability vector
ithe probability that the HMM will start in state i, each iexpresses the probability

p(qi|START)

Assumptions

- Markov assumption: the probability of a particular state depends only on the previous state
- Output-independence assumption: the probability of an output observation depends only on the state that produced that observation

HMM for Ice Cream

- You are a climatologist in the year 2799
- Studying global warming
- You can’t find any records of the weather in Boston, MA for summer of 2007
- But you find Jason Eisner’s diary
- Which lists how many ice-creams Jason ate every date that summer
- Our job: figure out how hot it was

Noam task

- Given
- Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
(cp. with output symbols)

- Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
- Produce:
- Weather Sequence: C,C,H,C,C,C,H …
(cp. with hidden states, causing states)

- Weather Sequence: C,C,H,C,C,C,H …

HMM Taggers

- Two kinds of probabilities
- A transition probabilities (PRIOR)
- B observation likelihoods (LIKELIHOOD)

- HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability

The A matrix for the POS HMM A probs

The B matrix for the POS HMM A probs

HMM Taggers A probs

- The probabilities are trained on hand-labeled training corpora (training set)
- Combine different N-gram levels
- Evaluated by comparing their output from a test set to human labels for that test set (Gold Standard)

The Viterbi Algorithm A probs

- best tag sequence for "John likes to fish in the sea"?
- efficiently computes the most likely state sequence given a particular output sequence
- based on dynamic programming

b A probs

a

0.2

0.8

0.4

0.6

0.7

end

start

r

q

1

1

0.5

0.3

0.5

A smaller examplea

b

- What is the best sequence of states for the input string “bbba”?
- Computing all possible paths and finding the one with the max probability is exponential

A smaller example (con’t) A probs

- For each state, store the most likely sequence that could lead to it (and its probability)
- Path probability matrix:
- An array of states versus time (tags versus words)
- That stores the prob. of being at each state at each time in terms of the prob. for being in each state at the preceding time.

The Viterbi Algorithm A probs

Intuition A probs

- The value in each cell is computed by taking the MAX over all paths that lead to this cell.
- An extension of a path from state i at time t-1 is computed by multiplying:
- Previous path probability from previous cell viterbi[t-1,i]
- Transition probability aij from previous state I to current state j
- Observation likelihood bj(ot) that current state j matches observation symbol t

Viterbi example A probs

Smoothing of probabilities A probs

- Data sparseness is a problem when estimating probabilities based on corpus data.
- The “add one” smoothing technique –

C- absolute frequency

N: no of training instances

B: no of different types

- Linear interpolation methods can compensate for data sparseness with higher order models. A common method is interpolating trigrams, bigrams and unigrams:

- The lambda values are automatically determined using a variant of the Expectation Maximization algorithm.

Possible improvements A probs

- in bigram POS tagging, we condition a tag only on the preceding tag
- why not...
- use more context (ex. use trigram model)
- more precise:
- “is clearly marked”--> verb, past participle
- “he clearly marked” -->verb, past tense

- combine trigram, bigram, unigram models

- more precise:
- condition on words too

- use more context (ex. use trigram model)
- but with an n-gram approach, this is too costly (too many parameters to model)

Further issues with Markov Model tagging A probs

- Unknown words are a problem since we don’t have the required probabilities. Possible solutions:
- Assign the word probabilities based on corpus-wide distribution of POS
- Use morphological cues (capitalization, suffix) to assign a more calculated guess.

- Using higher order Markov models:
- Using a trigram model captures more context
- However, data sparseness is much more of a problem.

TnT A probs

- Efficient statistical POS tagger developed by Thorsten Brants, ANLP-2000
- Underlying model:
Trigram modelling –

- The probability of a POS only depends on its two preceding POS
- The probability of a word appearing at a particular position given that its POS occurs at that position is independent of everything else.

Training A probs

- Maximum likelihood estimates:

Smoothing : context-independent variant of linear interpolation.

Smoothing algorithm A probs

- Set λi=0
- For each trigram t1 t2 t3 with f(t1,t2,t3 )>0
- Depending on the max of the following three values:
- Case (f(t1,t2,t3 )-1)/ f(t1,t2) : incr λ3 by f(t1,t2,t3 )
- Case (f(t2,t3 )-1)/ f(t2) : incr λ2 by f(t1,t2,t3 )
- Case (f(t3 )-1)/ N-1 : incr λ1 by f(t1,t2,t3 )

- Depending on the max of the following three values:
- Normalize λi

Evaluation of POS taggers A probs

- compared with gold-standard ofhuman performance
- metric:
- accuracy = % of tags that are identical to gold standard

- most taggers ~96-97% accuracy
- must compare accuracy to:
- ceiling (best possible results)
- how do human annotators score compared to each other? (96-97%)
- so systems are not bad at all!

- baseline (worst possible results)
- what if we take the most-likely tag (unigram model) regardless of previous tags ? (90-91%)
- so anything less is really bad

- ceiling (best possible results)

More on tagger accuracy A probs

- is 95% good?
- that’s 5 mistakes every 100 words
- if on average, a sentence is 20 words, that’s 1 mistake per sentence

- when comparing tagger accuracy, beware of:
- size of training corpus
- the bigger, the better the results

- difference between training & testing corpora (genre, domain…)
- the closer, the better the results

- size of tag set
- Prediction versus classification

- unknown words
- the more unknown words (not in dictionary), the worst the results

- size of training corpus

Error Analysis A probs

- Look at a confusion matrix (contingency table)
- E.g. 4.4% of the total errors caused by mistagging VBD as VBN
- See what errors are causing problems
- Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)
- Adverb (RB) vs Particle (RP) vs Prep (IN)
- Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

- ERROR ANALYSIS IS ESSENTIAL!!!

Tag indeterminacy A probs

Major difficulties in POS tagging A probs

- Unknown words (proper names)
- because we do not know the set of tags it can take
- and knowing this takes you a long way (cf. baseline POS tagger)
- possible solutions:
- assign all possible tags with probabilities distribution identical to lexicon as a whole
- use morphological cues to infer possible tags
- ex. word ending in -ed are likely to be past tense verbs or past participles

- Frequently confused tag pairs
- preposition vs particle
<running> <up> a hill (prep) / <running up> a bill (particle)

- verb, past tense vs. past participle vs. adjective

- preposition vs particle

Unknown Words A probs

- Most-frequent-tag approach.
- What about words that don’t appear in the training set?
- Suffix analysis:
- The probability distribution for a particular suffix is generated from all words in the training set that share the same suffix.

- Suffix estimation – Calculate the probability of a tag t given the last i letters of an n letter word.
- Smoothing: successive abstraction through sequences of increasingly more general contexts (i.e., omit more and more characters of the suffix)
- Use a morphological analyzer to get the restriction on the possible tags.

Unknown words A probs

Different Models for POS tagging A probs

- HMM
- Maximum Entropy Markov Models
- Conditional Random Fields

Dependency (1st order) A probs

Disadvantage of HMMs (1) A probs

- No Rich Feature Information
- Rich information are required
- When xk is complex
- When data of xk is sparse

- Rich information are required
- Example: POS Tagging
- How to evaluate P(wk|tk) for unknown words wk ?
- Useful features
- Suffix, e.g., -ed, -tion, -ing, etc.
- Capitalization

- Generative Model
- Parameter estimation: maximize the joint likelihood of training examples

Generative Models A probs

- Hidden Markov models (HMMs) and stochastic grammars
- Assign a joint probability to paired observation and label sequences
- The parameters typically trained to maximize the joint likelihood of train examples

Generative Models (cont’d) A probs

- Difficulties and disadvantages
- Need to enumerate all possible observation sequences
- Not practical to represent multiple interacting features or long-range dependencies of the observations
- Very strict independence assumptions on the observations

- Better Approach A probs
- Discriminative model which models P(y|x) directly
- Maximize the conditional likelihood of training examples

Maximum Entropy modeling A probs

- N-gram model : probabilities depend on the previous few tokens.
- We may identify a more heterogeneous set of features which contribute in some way to the choice of the current word. (whether it is the first word in a story, whether the next word is to, whether one of the last 5 words is a preposition, etc)
- Maxent combines these features in a probabilistic model.
- The given features provide a constraint on the model.
- We would like to have a probability distribution which, outside of these constraints, is as uniform as possible – has the maximum entropy among all models that satisfy these constraints.

Maximum Entropy Markov Model A probs

- Discriminative Sub Models
- Unify two parameters in generative model into one conditional model
- Two parameters in generative model,
- parameter in source model and parameter in noisy channel
- Unified conditional model

- Employ maximum entropy principle

- Unify two parameters in generative model into one conditional model

- Maximum Entropy Markov Model

General Maximum Entropy Principle A probs

- Model
- Model distribution P(Y|X) with a set of features {f1, f2, , fl} defined on X and Y

- Idea
- Collect information of features from training data
- Principle
- Model what is known
- Assume nothing else
Flattest distribution

Distribution with the maximum Entropy

Example A probs

- (Berger et al., 1996) example
- Model translation of word “in” from English to French
- Need to model P(wordFrench)
- Constraints
- 1: Possible translations: dans, en, à, au course de, pendant
- 2: “dans” or “en” used in 30% of the time
- 3: “dans” or “à” in 50% of the time

- Model translation of word “in” from English to French

Features A probs

- Features
- 0-1 indicator functions
- 1 if (x, y)satisfies a predefined condition
- 0 if not

- 0-1 indicator functions
- Example: POS Tagging

Constraints A probs

- Empirical Information
- Statistics from training data T

- Expected Value
- From the distribution P(Y|X) we want to model

- Constraints

Dual Problem A probs

- Dual Problem
- Conditional model
- Maximum likelihood of conditional data

- Solution
- Improved iterative scaling (IIS) (Berger et al. 1996)
- Generalized iterative scaling (GIS) (McCallum et al. 2000)

Maximum Entropy Markov Model A probs

- Use Maximum Entropy Approach to Model
- 1st order

- Features
- Basic features (like parameters in HMM)
- Bigram (1st order) or trigram (2nd order) in source model
- State-output pair feature (Xk = xk,Yk=yk)

- Advantage: incorporate other advanced features on (xk,yk)

- Basic features (like parameters in HMM)

Performance in POS Tagging A probs

- POS Tagging
- Data set: WSJ
- Features:
- HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)

- Results (Lafferty et al. 2001)
- 1st order HMM
- 94.31% accuracy, 54.01% OOV accuracy

- 1st order MEMM
- 95.19% accuracy, 73.01% OOV accuracy

- 1st order HMM

ME applications A probs

- Part of Speech (POS) Tagging (Ratnaparkhi, 1996)
- P(POS tag | context)
- Information sources
- Word window (4)
- Word features (prefix, suffix, capitalization)
- Previous POS tags

ME applications A probs

- Abbreviation expansion (Pakhomov, 2002)
- Information sources
- Word window (4)
- Document title

- Information sources
- Word Sense Disambiguation (WSD) (Chao & Dyer, 2002)
- Information sources
- Word window (4)
- Structurally related words (4)

- Information sources
- Sentence Boundary Detection (Reynar & Ratnaparkhi, 1997)
- Information sources
- Token features (prefix, suffix, capitalization, abbreviation)
- Word window (2)

- Information sources

Solution A probs

- Global Optimization
- Optimize parameters in a global model simultaneously, not in sub models separately

- Alternatives
- Conditional random fields
- Application of perceptron algorithm

Why ME? A probs

- Advantages
- Combine multiple knowledge sources
- Local
- Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996))
- Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002))
- Token prefix, suffix, capitalization, abbreviation (Sentence Boundary - (Reynar & Ratnaparkhi, 1997))

- Global
- N-grams (Rosenfeld, 1997)
- Word window
- Document title (Pakhomov, 2002)
- Structurally related words (Chao & Dyer, 2002)
- Sentence length, conventional lexicon (Och & Ney, 2002)

- Local
- Combine dependent knowledge sources

- Combine multiple knowledge sources

Why ME? A probs

- Advantages
- Add additional knowledge sources
- Implicit smoothing

- Disadvantages
- Computational
- Expected value at each iteration
- Normalizing constant

- Overfitting
- Feature selection
- Cutoffs
- Basic Feature Selection (Berger et al., 1996)

- Feature selection

- Computational

Conditional Models A probs

- Conditional probabilityP(label sequence y | observation sequence x)rather than joint probability P(y, x)
- Specify the probability of possible label sequences given an observation sequence

- Allow arbitrary, non-independent features on the observation sequence X
- The probability of a transition between labels may depend onpastandfutureobservations
- Relax strong independence assumptions in generative models

Discriminative Models A probsMaximum Entropy Markov Models (MEMMs)

- Exponential model
- Given training set X with label sequence Y:
- Train a model θthat maximizes P(Y|X, θ)
- For a new data sequence x, the predicted label y maximizes P(y|x, θ)
- Notice the per-state normalization

MEMMs (cont’d) A probs

- MEMMs have all the advantages of Conditional Models
- Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states (“conservation of score mass”)
- Subject to Label Bias Problem
- Bias toward states with fewer outgoing transitions

Label Bias Problem A probs

- Consider this MEMM:

- P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r)
- P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)
- SinceP(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)
- In the training data, label value 2 is the only label value observed after label value 1
- ThereforeP(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x
- However, we expectP(1 and 2 | ri)to be greater thanP(1 and 2 | ro).
- Per-state normalization does not allow the required expectation

Solve the Label Bias Problem A probs

- Change the state-transition structure of the model
- Not always practical to change the set of states

- Start with a fully-connected model and let the training procedure figure out a good structure
- Prelude the use of prior, which is very valuable (e.g. in information extraction)

Random Field A probs

Conditional Random Fields (CRFs) A probs

- CRFs have all the advantages of MEMMs without label bias problem
- MEMM uses per-state exponential model for the conditional probabilities of next states given the current state
- CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence

- Undirected acyclic graph
- Allow some transitions “vote” more strongly than others depending on the corresponding observations

Definition of CRFs A probs

X is a random variable over data sequences to be labeled

Y is a random variable over corresponding label sequences

Example A probsof CRFs

Graphical comparison among A probsHMMs, MEMMs and CRFs

HMM MEMM CRF

If the graph A probsG = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:

x is a data sequence

y is a label sequence

v is a vertex from vertex set V = set of label random variables

e is an edge from edge set E over V

fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature

k is the number of features

are parameters to be estimated

y|e is the set of components of y defined by edge e

y|v is the set of components of y defined by vertex v

Conditional DistributionConditional Distribution (cont’d) A probs

- CRFs use the observation-dependent normalization Z(x) for the conditional distributions:

Z(x) is a normalization over the data sequence x

Parameter Estimation for CRFs A probs

- The paper provided iterative scaling algorithms
- It turns out to be very inefficient
- Prof. Dietterich’s group appliedGradient Descendent Algorithm, which is quite efficient

Training of CRFs (From Prof. Dietterich)

- Then, take the derivative of the above equation

- For training, the first 2 items are easy to get.
- For example, for each lk, fk is a sequence of Boolean numbers, such as 00101110100111.
- is just the total number of 1’s in the sequence.

- The hardest thing is how to calculateZ(x)

y A probs1

y2

y3

y4

c1

c2

c3

c1

c2

c3

Training of CRFs (From Prof. Dietterich) (cont’d)- Maximal cliques

POS tagging Experiments A probs

POS tagging Experiments (cont’d) A probs

- Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging
- Each word in a given input sentence must be labeled with one of 45 syntactic tags
- Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
- oov = out-of-vocabulary (not observed in the training set)

Summary A probs

- Discriminative models are prone to the label bias problem
- CRFs provide the benefits of discriminative models
- CRFs solve the label bias problem well, and demonstrate good performance

Download Presentation

Connecting to Server..