Parts of speech
1 / 126

Parts of Speech - PowerPoint PPT Presentation

  • Updated On :

Parts of Speech. Sudeshna Sarkar 7 Aug 2008. Why Do We Care about Parts of Speech?. Pronunciation Hand me the lead pipe. Predicting what words can be expected next Personal pronoun (e.g., I , she ) ____________ Stemming -s means singular for verbs, plural for nouns

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Parts of Speech' - maximos

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Parts of speech l.jpg

Parts of Speech

Sudeshna Sarkar

7 Aug 2008

Why do we care about parts of speech l.jpg
Why Do We Care about Parts of Speech?

  • Pronunciation

  • Hand me the lead pipe.

  • Predicting what words can be expected next

    • Personal pronoun (e.g., I, she) ____________

  • Stemming

    • -s means singular for verbs, plural for nouns

  • As the basis for syntactic parsing and then meaning extraction

    • I will lead the group into the lead smelter.

  • Machine translation

    • (E) content +N  (F) contenu +N

    • (E) content +Adj  (F) content +Adj or satisfait +Adj

  • What is a part of speech l.jpg
    What is a Part of Speech?

    Is this a semantic distinction? For example, maybe Noun is the class of words for people, places and things. Maybe Adjective is the class of words for properties of nouns.

    Consider: green book

    book is a Noun

    green is an Adjective

    Now consider: book worm

    This green is very soothing.

    How many parts of speech are there l.jpg
    How Many Parts of Speech Are There?

    • A first cut at the easy distinctions:

      • Open classes:

        • nouns, verbs, adjectives, adverbs

      • Closed classes: function words

        • conjunctions: and, or, but

        • pronounts: I, she, him

        • prepositions: with, on

        • determiners: the, a, an

    Part of speech tagging l.jpg
    Part of speech tagging

    • 8 (ish) traditional parts of speech

      • Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc

      • This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.)

      • Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS

      • We’ll use POS most frequently

      • I’ll assume that you all know what these are

    Pos examples l.jpg
    POS examples

    • N noun chair, bandwidth, pacing

    • V verb study, debate, munch

    • ADJ adj purple, tall, ridiculous

    • ADV adverb unfortunately, slowly,

    • P preposition of, by, to

    • PRO pronoun I, me, mine

    • DET determiner the, a, that, those

    Tagsets l.jpg

    Brown corpus tagset (87 tags):

    Penn Treebank tagset (45 tags): (8.6)

    C7 tagset (146 tags)

    Pos tagging definition l.jpg















    POS Tagging: Definition

    • The process of assigning a part-of-speech or lexical class marker to each word in a corpus:

    Pos tagging example l.jpg
    POS Tagging example

    WORD tag

    the DET

    koala N

    put V

    the DET

    keys N

    on P

    the DET

    table N

    Pos tagging choosing a tagset l.jpg
    POS tagging: Choosing a tagset

    • There are so many parts of speech, potential distinctions we can draw

    • To do POS tagging, need to choose a standard set of tags to work with

    • Could pick very coarse tagets

      • N, V, Adj, Adv.

    • More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags

      • PRP$, WRB, WP$, VBG

    • Even more fine-grained tagsets exist

    Using the upenn tagset l.jpg
    Using the UPenn tagset

    • The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

    • Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”)

    • Except the preposition/complementizer “to” is just marked “to”.

    Pos tagging l.jpg
    POS Tagging

    • Words often have more than one POS: back

      • The back door = JJ

      • On my back = NN

      • Win the voters back = RB

      • Promised to back the bill = VB

    • The POS tagging problem is to determine the POS tag for a particular instance of a word.

    Algorithms for pos tagging l.jpg
    Algorithms for POS Tagging

    • Ambiguity – In the Brown corpus, 11.5% of the word types are ambiguous (using 87 tags):

    • Worse, 40% of the tokens are ambiguous.

    Algorithms for pos tagging16 l.jpg
    Algorithms for POS Tagging

    • Why can’t we just look them up in a dictionary?

      • Words that aren’t in the dictionary

    • One idea: P(ti| wi) = the probability that a random hapax legomenon in the corpus has tag ti.

      • Nouns are more likely than verbs, which are more likely than pronouns.

    • Another idea: use morphology.

    Algorithms for pos tagging knowledge l.jpg
    Algorithms for POS Tagging - Knowledge

    • Dictionary

    • Morphological rules, e.g.,

      • _____-tion

      • _____-ly

      • capitalization

    • N-gram frequencies

      • to _____

      • DET _____ N

      • But what about rare words, e.g, smelt (two verb forms, melt and past tense of smell, and one noun form, a small fish)

    • Combining these

      • V _____-ing I was gracking vs. Gracking is fun.

    Pos tagging approaches l.jpg
    POS Tagging - Approaches

    • Approaches

      • Rule-based tagging

        • (ENGTWOL)

      • Stochastic (=Probabilistic) tagging

        • HMM (Hidden Markov Model) tagging

      • Transformation-based tagging

        • Brill tagger

    • Do we return one best answer or several answers and let later steps decide?

    • How does the requisite knowledge get entered?

    3 methods for pos tagging l.jpg
    3 methods for POS tagging

    1. Rule-based tagging

    • Example: Karlsson (1995) EngCGtagger based on the Constraint Grammar architecture and ENGTWOL lexicon

      • Basic Idea:

        • Assign all possible tags to words (morphological analyzer used)

        • Remove wrong tags according to set of constraint rules (typically more than 1000 hand-written constraint rules, but may be machine-learned)

    3 methods for pos tagging20 l.jpg
    3 methods for POS tagging

    2. Transformation-based tagging

    • Example: Brill (1995) tagger - combination of rule-based and stochastic (probabilistic) tagging methodologies

      • Basic Idea:

        • Start with a tagged corpus + dictionary (with most frequent tags)

        • Set the most probable tag for each word as a start value

        • Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order (like rule-based taggers)

        • machine learning is used—the rules are automatically induced from a previously tagged training corpus (like stochastic approach)

    3 methods for pos tagging21 l.jpg
    3 methods for POS tagging

    3. Stochastic (=Probabilistic) tagging

    • Example: HMM (Hidden Markov Model) tagging - a training corpus used to compute the probability (frequency) of a given word having a given POS tag in a given context

    Hidden markov model hmm tagging l.jpg
    Hidden Markov Model (HMM) Tagging

    • Using an HMM to do POS tagging

    • HMM is a special case of Bayesian inference

    • It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition)

    Slide23 l.jpg

    Hidden Markov Model (HMM) Taggers

    • Goal: maximize P(word|tag) x P(tag|previous n tags)

    • P(word|tag)

      • word/lexical likelihood

      • probability that given this tag, we have this word

      • NOT probability that this word has this tag

      • modeled through language model (word-tag matrix)

    • P(tag|previous n tags)

      • tag sequence likelihood

      • probability that this tag follows these previous tags

      • modeled through language model (tag-tag matrix)

    Lexical information

    Syntagmatic information

    Pos tagging as a sequence classification task l.jpg
    POS tagging as a sequence classification task

    • We are given a sentence (an “observation” or “sequence of observations”)

      • Secretariat is expected to race tomorrow

      • sequence of n words w1…wn.

    • What is the best sequence of tags which corresponds to this sequence of observations?

    • Probabilistic/Bayesian view:

      • Consider all possible sequences of tags

      • Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn.

    Getting to hmm l.jpg
    Getting to HMM

    • Let T = t1,t2,…,tn

    • Let W = w1,w2,…,wn

    • Goal: Out of all sequences of tags t1…tn, get the the most probable sequence of POS tags T underlying the observed sequence of words w1,w2,…,wn

    • Hat ^ means “our estimate of the best = the most probable tag sequence”

    • Argmaxxf(x) means “the x such that f(x) is maximized”

      it maximazes our estimate of the best tag sequence

    Getting to hmm26 l.jpg
    Getting to HMM

    • This equation is guaranteed to give us the best tag sequence

    • But how do we make it operational? How do we compute this value?

    • Intuition of Bayesian classification:

      • Use Bayes rule to transform it into a set of other probabilities that are easier to compute

      • Thomas Bayes: British mathematician (1702-1761)

    Bayes rule l.jpg
    Bayes Rule

    Breaks down any conditional probability P(x|y) into three other probabilities

    P(x|y): The conditional probability of an event x assuming that y has occurred

    Bayes rule28 l.jpg
    Bayes Rule

    We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words

    Likelihood and prior further simplifications l.jpg
    Likelihood and prior Further Simplifications

    1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it


    2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag

    3. The most probable tag sequence estimated by the bigram tagger

    Likelihood and prior further simplifications32 l.jpg















    Likelihood and prior Further Simplifications

    1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it


    Likelihood and prior further simplifications33 l.jpg
    Likelihood and prior Further Simplifications

    2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag

    Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram.

    Bigrams are used as the basis for simple statistical analysis of text

    The bigram assumption is related to the first-order Markov assumption

    Likelihood and prior further simplifications34 l.jpg
    Likelihood and prior Further Simplifications

    3. The most probable tag sequence estimated by the bigram tagger



    biagram assumption

    Two kinds of probabilities 1 l.jpg
    Two kinds of probabilities (1)

    • Tag transition probabilities p(ti|ti-1)

      • Determiners likely to precede adjs and nouns

        • That/DT flight/NN

        • The/DT yellow/JJ hat/NN

        • So we expect P(NN|DT) and P(JJ|DT) to be high

        • But P(DT|JJ) to be:?

    Two kinds of probabilities 136 l.jpg
    Two kinds of probabilities (1)

    • Tag transition probabilities p(ti|ti-1)

      • Compute P(NN|DT) by counting in a labeled corpus:

    # of times DT is followed by NN

    Two kinds of probabilities 2 l.jpg
    Two kinds of probabilities (2)

    • Word likelihood probabilities p(wi|ti)

      • P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is”

      • Compute P(is|VBZ) by counting in a labeled corpus:

    If we were expecting a third person singular verb, how likely is it that

    this verb would be is?

    An example the verb race l.jpg
    An Example: the verb “race”

    • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR

    • People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DTrace/NN for/IN outer/JJ space/NN

    • How do we pick the right tag?

    Disambiguating race40 l.jpg
    Disambiguating “race”

    • P(NN|TO) = .00047

    • P(VB|TO) = .83

      The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’

    • P(race|NN) = .00057

    • P(race|VB) = .00012

      Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB.

    • P(NR|VB) = .0027

    • P(NR|NN) = .0012

      tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun

    • P(VB|TO)P(NR|VB)P(race|VB) = .00000027

    • P(NN|TO)P(NR|NN)P(race|NN)=.00000000032

      Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins

    Hidden markov models l.jpg
    Hidden Markov Models

    • What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)

    • Let’s just spend a bit of time tying this into the model

    • In order to define HMM, we will first introduce the Markov Chain, or observable Markov Model.

    Definitions l.jpg

    • A weighted finite-state automaton adds probabilities to the arcs

      • The sum of the probabilities leaving any arc must sum to one

    • A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through

    • Markov chains can’t represent inherently ambiguous problems

      • Useful for assigning probabilities to unambiguous sequences

    Markov chain first order observed markov model l.jpg
    Markov chain = “First-order observed Markov Model”

    • a set of states

      • Q = q1, q2…qN; the state at time t is qt

    • a set of transition probabilities:

      • a set of probabilities A = a01a02…an1…ann.

      • Each aij represents the probability of transitioning from state i to state j

      • The set of these is the transition probability matrix A

    • Distinguished start and end states

      Special initial probability vector 

      ithe probability that the MM will start in state i, each iexpresses the probability p(qi|START)

    Markov chain first order observed markov model44 l.jpg
    Markov chain = “First-order observed Markov Model”

    Markov Chain for weather: Example 1

    • three types of weather: sunny, rainy, foggy

    • we want to find the following conditional probabilities:

      P(qn|qn-1, qn-2, …, q1)

      - I.e., the probability of the unknown weather on day n, depending on the (known) weather of the preceding days

      - We could infer this probability from the relative frequency (the statistics) of past observations of weather sequences

      Problem: the larger n is, the more observations we must collect.

      Suppose that n=6, then we have to collect statistics for 3(6-1) = 243 past histories

    Markov chain first order observed markov model45 l.jpg
    Markov chain = “First-order observed Markov Model”

    • Therefore, we make a simplifying assumption, called the (first-order) Markov assumption

      for a sequence of observations q1, … qn,

      current state only depends on previous state

    • the joint probability of certain past and current observations

    Markov chain first order observable markov model l.jpg
    Markov chain = “First-order observable Markov Model”

    Markov chain first order observed markov model47 l.jpg
    Markov chain = “First-order observed Markov Model”

    • Given that today the weather is sunny, what's the probability that tomorrow is sunny and the day after is rainy?

    • Using the Markov assumption and the probabilities in table 1, this translates into:

    The weather figure specific example l.jpg
    The weather figure: specific example

    • Markov Chain for weather: Example 2

    Markov chain for weather l.jpg
    Markov chain for weather

    • What is the probability of 4 consecutive rainy days?

    • Sequence is rainy-rainy-rainy-rainy

    • I.e., state sequence is 3-3-3-3

    • P(3,3,3,3) =

      • 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432

    Hidden markov model l.jpg
    Hidden Markov Model

    • For Markov chains, the output symbols are the same as the states.

      • See sunny weather: we’re in state sunny

    • But in part-of-speech tagging (and other things)

      • The output symbols are words

      • But the hidden states are part-of-speech tags

    • So we need an extension!

    • A Hidden Markov Model is an extension of a Markov chain in which the output symbols are not the same as the states.

    • This means we don’t know which state we are in.

    Markov chain for words l.jpg
    Markov chain for words

    Observed events: words

    Hidden events: tags

    Hidden markov models53 l.jpg
    Hidden Markov Models

    • States Q = q1, q2…qN;

    • Observations O = o1, o2…oN;

      • Each observation is a symbol from a vocabulary V = {v1,v2,…vV}

    • Transition probabilities (prior)

      • Transition probability matrix A = {aij}

    • Observation likelihoods (likelihood)

      • Output probability matrix B={bi(ot)}

        a set of observation likelihoods, each expressing the probability of an observation ot being generated from a state i, emission probabilities

    • Special initial probability vector 

      ithe probability that the HMM will start in state i, each iexpresses the probability


    Assumptions l.jpg

    • Markov assumption: the probability of a particular state depends only on the previous state

    • Output-independence assumption: the probability of an output observation depends only on the state that produced that observation

    Hmm for ice cream l.jpg
    HMM for Ice Cream

    • You are a climatologist in the year 2799

    • Studying global warming

    • You can’t find any records of the weather in Boston, MA for summer of 2007

    • But you find Jason Eisner’s diary

    • Which lists how many ice-creams Jason ate every date that summer

    • Our job: figure out how hot it was

    Noam task l.jpg
    Noam task

    • Given

      • Ice Cream Observation Sequence: 1,2,3,2,2,2,3…

        (cp. with output symbols)

    • Produce:

      • Weather Sequence: C,C,H,C,C,C,H …

        (cp. with hidden states, causing states)

    Different types of hmm structure l.jpg
    Different types of HMM structure

    Ergodic =


    Bakis = left-to-right

    Hmm taggers l.jpg
    HMM Taggers

    • Two kinds of probabilities

      • A transition probabilities (PRIOR)

      • B observation likelihoods (LIKELIHOOD)

    • HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability

    Hmm taggers64 l.jpg
    HMM Taggers A probs

    • The probabilities are trained on hand-labeled training corpora (training set)

    • Combine different N-gram levels

    • Evaluated by comparing their output from a test set to human labels for that test set (Gold Standard)

    The viterbi algorithm l.jpg
    The Viterbi Algorithm A probs

    • best tag sequence for "John likes to fish in the sea"?

    • efficiently computes the most likely state sequence given a particular output sequence

    • based on dynamic programming

    A smaller example l.jpg

    b A probs
















    A smaller example



    • What is the best sequence of states for the input string “bbba”?

    • Computing all possible paths and finding the one with the max probability is exponential

    A smaller example con t l.jpg
    A smaller example (con’t) A probs

    • For each state, store the most likely sequence that could lead to it (and its probability)

    • Path probability matrix:

      • An array of states versus time (tags versus words)

      • That stores the prob. of being at each state at each time in terms of the prob. for being in each state at the preceding time.

    Viterbi intuition we are looking for the best path l.jpg
    Viterbi intuition: we are looking for the best ‘path’ A probs






    Slide from Dekang Lin

    Intuition l.jpg
    Intuition A probs

    • The value in each cell is computed by taking the MAX over all paths that lead to this cell.

    • An extension of a path from state i at time t-1 is computed by multiplying:

      • Previous path probability from previous cell viterbi[t-1,i]

      • Transition probability aij from previous state I to current state j

      • Observation likelihood bj(ot) that current state j matches observation symbol t

    Smoothing of probabilities l.jpg
    Smoothing of probabilities A probs

    • Data sparseness is a problem when estimating probabilities based on corpus data.

    • The “add one” smoothing technique –

    C- absolute frequency

    N: no of training instances

    B: no of different types

    • Linear interpolation methods can compensate for data sparseness with higher order models. A common method is interpolating trigrams, bigrams and unigrams:

    • The lambda values are automatically determined using a variant of the Expectation Maximization algorithm.

    Possible improvements l.jpg
    Possible improvements A probs

    • in bigram POS tagging, we condition a tag only on the preceding tag

    • why not...

      • use more context (ex. use trigram model)

        • more precise:

          • “is clearly marked”--> verb, past participle

          • “he clearly marked” -->verb, past tense

        • combine trigram, bigram, unigram models

      • condition on words too

    • but with an n-gram approach, this is too costly (too many parameters to model)

    Further issues with markov model tagging l.jpg
    Further issues with Markov Model tagging A probs

    • Unknown words are a problem since we don’t have the required probabilities. Possible solutions:

      • Assign the word probabilities based on corpus-wide distribution of POS

      • Use morphological cues (capitalization, suffix) to assign a more calculated guess.

    • Using higher order Markov models:

      • Using a trigram model captures more context

      • However, data sparseness is much more of a problem.

    Slide75 l.jpg
    TnT A probs

    • Efficient statistical POS tagger developed by Thorsten Brants, ANLP-2000

    • Underlying model:

      Trigram modelling –

      • The probability of a POS only depends on its two preceding POS

      • The probability of a word appearing at a particular position given that its POS occurs at that position is independent of everything else.

    Training l.jpg
    Training A probs

    • Maximum likelihood estimates:

    Smoothing : context-independent variant of linear interpolation.

    Smoothing algorithm l.jpg
    Smoothing algorithm A probs

    • Set λi=0

    • For each trigram t1 t2 t3 with f(t1,t2,t3 )>0

      • Depending on the max of the following three values:

        • Case (f(t1,t2,t3 )-1)/ f(t1,t2) : incr λ3 by f(t1,t2,t3 )

        • Case (f(t2,t3 )-1)/ f(t2) : incr λ2 by f(t1,t2,t3 )

        • Case (f(t3 )-1)/ N-1 : incr λ1 by f(t1,t2,t3 )

    • Normalize λi

    Evaluation of pos taggers l.jpg
    Evaluation of POS taggers A probs

    • compared with gold-standard ofhuman performance

    • metric:

      • accuracy = % of tags that are identical to gold standard

    • most taggers ~96-97% accuracy

    • must compare accuracy to:

      • ceiling (best possible results)

        • how do human annotators score compared to each other? (96-97%)

        • so systems are not bad at all!

      • baseline (worst possible results)

        • what if we take the most-likely tag (unigram model) regardless of previous tags ? (90-91%)

        • so anything less is really bad

    More on tagger accuracy l.jpg
    More on tagger accuracy A probs

    • is 95% good?

      • that’s 5 mistakes every 100 words

      • if on average, a sentence is 20 words, that’s 1 mistake per sentence

    • when comparing tagger accuracy, beware of:

      • size of training corpus

        • the bigger, the better the results

      • difference between training & testing corpora (genre, domain…)

        • the closer, the better the results

      • size of tag set

        • Prediction versus classification

      • unknown words

        • the more unknown words (not in dictionary), the worst the results

    Error analysis l.jpg
    Error Analysis A probs

    • Look at a confusion matrix (contingency table)

    • E.g. 4.4% of the total errors caused by mistagging VBD as VBN

    • See what errors are causing problems

      • Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)

      • Adverb (RB) vs Particle (RP) vs Prep (IN)

      • Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)


    Major difficulties in pos tagging l.jpg
    Major difficulties in POS tagging A probs

    • Unknown words (proper names)

      • because we do not know the set of tags it can take

      • and knowing this takes you a long way (cf. baseline POS tagger)

      • possible solutions:

        • assign all possible tags with probabilities distribution identical to lexicon as a whole

        • use morphological cues to infer possible tags

          • ex. word ending in -ed are likely to be past tense verbs or past participles

    • Frequently confused tag pairs

      • preposition vs particle

        <running> <up> a hill (prep) / <running up> a bill (particle)

      • verb, past tense vs. past participle vs. adjective

    Unknown words l.jpg
    Unknown Words A probs

    • Most-frequent-tag approach.

    • What about words that don’t appear in the training set?

    • Suffix analysis:

      • The probability distribution for a particular suffix is generated from all words in the training set that share the same suffix.

    • Suffix estimation – Calculate the probability of a tag t given the last i letters of an n letter word.

    • Smoothing: successive abstraction through sequences of increasingly more general contexts (i.e., omit more and more characters of the suffix)

    • Use a morphological analyzer to get the restriction on the possible tags.

    Unknown words84 l.jpg
    Unknown words A probs

    Different models for pos tagging l.jpg
    Different Models for POS tagging A probs

    • HMM

    • Maximum Entropy Markov Models

    • Conditional Random Fields

    Hidden markov model hmm generative modeling l.jpg
    Hidden Markov Model (HMM) : Generative Modeling A probs

    Source Model P(Y)

    Noisy Channel P(X|Y)



    Disadvantage of hmms 1 l.jpg
    Disadvantage of HMMs (1) A probs

    • No Rich Feature Information

      • Rich information are required

        • When xk is complex

        • When data of xk is sparse

    • Example: POS Tagging

      • How to evaluate P(wk|tk) for unknown words wk ?

      • Useful features

        • Suffix, e.g., -ed, -tion, -ing, etc.

        • Capitalization

    • Generative Model

      • Parameter estimation: maximize the joint likelihood of training examples

    Generative models l.jpg
    Generative Models A probs

    • Hidden Markov models (HMMs) and stochastic grammars

      • Assign a joint probability to paired observation and label sequences

      • The parameters typically trained to maximize the joint likelihood of train examples

    Generative models cont d l.jpg
    Generative Models (cont’d) A probs

    • Difficulties and disadvantages

      • Need to enumerate all possible observation sequences

      • Not practical to represent multiple interacting features or long-range dependencies of the observations

      • Very strict independence assumptions on the observations

    Slide92 l.jpg

    • Better Approach A probs

      • Discriminative model which models P(y|x) directly

      • Maximize the conditional likelihood of training examples

    Maximum entropy modeling l.jpg
    Maximum Entropy modeling A probs

    • N-gram model : probabilities depend on the previous few tokens.

    • We may identify a more heterogeneous set of features which contribute in some way to the choice of the current word. (whether it is the first word in a story, whether the next word is to, whether one of the last 5 words is a preposition, etc)

    • Maxent combines these features in a probabilistic model.

    • The given features provide a constraint on the model.

    • We would like to have a probability distribution which, outside of these constraints, is as uniform as possible – has the maximum entropy among all models that satisfy these constraints.

    Maximum entropy markov model l.jpg
    Maximum Entropy Markov Model A probs

    • Discriminative Sub Models

      • Unify two parameters in generative model into one conditional model

        • Two parameters in generative model,

        • parameter in source model and parameter in noisy channel

        • Unified conditional model

      • Employ maximum entropy principle

    • Maximum Entropy Markov Model

    General maximum entropy principle l.jpg
    General Maximum Entropy Principle A probs

    • Model

      • Model distribution P(Y|X) with a set of features {f1, f2, , fl} defined on X and Y

    • Idea

      • Collect information of features from training data

      • Principle

        • Model what is known

        • Assume nothing else

           Flattest distribution

           Distribution with the maximum Entropy

    Example l.jpg
    Example A probs

    • (Berger et al., 1996) example

      • Model translation of word “in” from English to French

        • Need to model P(wordFrench)

        • Constraints

          • 1: Possible translations: dans, en, à, au course de, pendant

          • 2: “dans” or “en” used in 30% of the time

          • 3: “dans” or “à” in 50% of the time

    Features l.jpg
    Features A probs

    • Features

      • 0-1 indicator functions

        • 1 if (x, y)satisfies a predefined condition

        • 0 if not

    • Example: POS Tagging

    Constraints l.jpg
    Constraints A probs

    • Empirical Information

      • Statistics from training data T

    • Expected Value

      • From the distribution P(Y|X) we want to model

    • Constraints

    Maximum entropy objective l.jpg
    Maximum Entropy: Objective A probs

    • Entropy

    • Maximization Problem

    Dual problem l.jpg
    Dual Problem A probs

    • Dual Problem

      • Conditional model

      • Maximum likelihood of conditional data

    • Solution

      • Improved iterative scaling (IIS) (Berger et al. 1996)

      • Generalized iterative scaling (GIS) (McCallum et al. 2000)

    Maximum entropy markov model101 l.jpg
    Maximum Entropy Markov Model A probs

    • Use Maximum Entropy Approach to Model

      • 1st order

    • Features

      • Basic features (like parameters in HMM)

        • Bigram (1st order) or trigram (2nd order) in source model

        • State-output pair feature (Xk = xk,Yk=yk)

      • Advantage: incorporate other advanced features on (xk,yk)

    Hmm vs memm 1st order l.jpg
    HMM vs MEMM (1st order) A probs

    Maximum Entropy Markov Model (MEMM)


    Performance in pos tagging l.jpg
    Performance in POS Tagging A probs

    • POS Tagging

      • Data set: WSJ

      • Features:

        • HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)

    • Results (Lafferty et al. 2001)

      • 1st order HMM

        • 94.31% accuracy, 54.01% OOV accuracy

      • 1st order MEMM

        • 95.19% accuracy, 73.01% OOV accuracy

    Me applications l.jpg
    ME applications A probs

    • Part of Speech (POS) Tagging (Ratnaparkhi, 1996)

      • P(POS tag | context)

      • Information sources

        • Word window (4)

        • Word features (prefix, suffix, capitalization)

        • Previous POS tags

    Me applications105 l.jpg
    ME applications A probs

    • Abbreviation expansion (Pakhomov, 2002)

      • Information sources

        • Word window (4)

        • Document title

    • Word Sense Disambiguation (WSD) (Chao & Dyer, 2002)

      • Information sources

        • Word window (4)

        • Structurally related words (4)

    • Sentence Boundary Detection (Reynar & Ratnaparkhi, 1997)

      • Information sources

        • Token features (prefix, suffix, capitalization, abbreviation)

        • Word window (2)

    Solution l.jpg
    Solution A probs

    • Global Optimization

      • Optimize parameters in a global model simultaneously, not in sub models separately

    • Alternatives

      • Conditional random fields

      • Application of perceptron algorithm

    Why me l.jpg
    Why ME? A probs

    • Advantages

      • Combine multiple knowledge sources

        • Local

          • Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996))

          • Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002))

          • Token prefix, suffix, capitalization, abbreviation (Sentence Boundary - (Reynar & Ratnaparkhi, 1997))

        • Global

          • N-grams (Rosenfeld, 1997)

          • Word window

          • Document title (Pakhomov, 2002)

          • Structurally related words (Chao & Dyer, 2002)

          • Sentence length, conventional lexicon (Och & Ney, 2002)

      • Combine dependent knowledge sources

    Why me108 l.jpg
    Why ME? A probs

    • Advantages

      • Add additional knowledge sources

      • Implicit smoothing

    • Disadvantages

      • Computational

        • Expected value at each iteration

        • Normalizing constant

      • Overfitting

        • Feature selection

          • Cutoffs

          • Basic Feature Selection (Berger et al., 1996)

    Conditional models l.jpg
    Conditional Models A probs

    • Conditional probabilityP(label sequence y | observation sequence x)rather than joint probability P(y, x)

      • Specify the probability of possible label sequences given an observation sequence

    • Allow arbitrary, non-independent features on the observation sequence X

    • The probability of a transition between labels may depend onpastandfutureobservations

      • Relax strong independence assumptions in generative models

    Discriminative models maximum entropy markov models memms l.jpg
    Discriminative Models A probsMaximum Entropy Markov Models (MEMMs)

    • Exponential model

    • Given training set X with label sequence Y:

      • Train a model θthat maximizes P(Y|X, θ)

      • For a new data sequence x, the predicted label y maximizes P(y|x, θ)

      • Notice the per-state normalization

    Memms cont d l.jpg
    MEMMs (cont’d) A probs

    • MEMMs have all the advantages of Conditional Models

    • Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states (“conservation of score mass”)

    • Subject to Label Bias Problem

      • Bias toward states with fewer outgoing transitions

    Label bias problem l.jpg
    Label Bias Problem A probs

    • Consider this MEMM:

    • P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r)

    • P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)

    • SinceP(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)

    • In the training data, label value 2 is the only label value observed after label value 1

    • ThereforeP(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x

    • However, we expectP(1 and 2 | ri)to be greater thanP(1 and 2 | ro).

    • Per-state normalization does not allow the required expectation

    Solve the label bias problem l.jpg
    Solve the Label Bias Problem A probs

    • Change the state-transition structure of the model

      • Not always practical to change the set of states

    • Start with a fully-connected model and let the training procedure figure out a good structure

      • Prelude the use of prior, which is very valuable (e.g. in information extraction)

    Random field l.jpg
    Random Field A probs

    Conditional random fields crfs l.jpg
    Conditional Random Fields (CRFs) A probs

    • CRFs have all the advantages of MEMMs without label bias problem

      • MEMM uses per-state exponential model for the conditional probabilities of next states given the current state

      • CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence

    • Undirected acyclic graph

    • Allow some transitions “vote” more strongly than others depending on the corresponding observations

    Definition of crfs l.jpg
    Definition of CRFs A probs

    X is a random variable over data sequences to be labeled

    Y is a random variable over corresponding label sequences

    Example of crfs l.jpg
    Example A probsof CRFs

    Graphical comparison among hmms memms and crfs l.jpg
    Graphical comparison among A probsHMMs, MEMMs and CRFs


    Conditional distribution l.jpg

    If the graph A probsG = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:

    x is a data sequence

    y is a label sequence

    v is a vertex from vertex set V = set of label random variables

    e is an edge from edge set E over V

    fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature

    k is the number of features

    are parameters to be estimated

    y|e is the set of components of y defined by edge e

    y|v is the set of components of y defined by vertex v

    Conditional Distribution

    Conditional distribution cont d l.jpg
    Conditional Distribution (cont’d) A probs

    • CRFs use the observation-dependent normalization Z(x) for the conditional distributions:

    Z(x) is a normalization over the data sequence x

    Parameter estimation for crfs l.jpg
    Parameter Estimation for CRFs A probs

    • The paper provided iterative scaling algorithms

    • It turns out to be very inefficient

    • Prof. Dietterich’s group appliedGradient Descendent Algorithm, which is quite efficient

    Training of crfs from prof dietterich l.jpg

    Training of CRFs (From Prof. Dietterich)

    • Then, take the derivative of the above equation

    • For training, the first 2 items are easy to get.

    • For example, for each lk, fk is a sequence of Boolean numbers, such as 00101110100111.

      • is just the total number of 1’s in the sequence.

    • The hardest thing is how to calculateZ(x)

    Training of crfs from prof dietterich cont d l.jpg

    y A probs1










    Training of CRFs (From Prof. Dietterich) (cont’d)

    • Maximal cliques

    Pos tagging experiments cont d l.jpg
    POS tagging Experiments (cont’d) A probs

    • Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging

    • Each word in a given input sentence must be labeled with one of 45 syntactic tags

    • Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies

    • oov = out-of-vocabulary (not observed in the training set)

    Summary l.jpg
    Summary A probs

    • Discriminative models are prone to the label bias problem

    • CRFs provide the benefits of discriminative models

    • CRFs solve the label bias problem well, and demonstrate good performance