Parts of Speech

Parts of Speech Sudeshna Sarkar 7 Aug 2008

Why Do We Care about Parts of Speech? • Pronunciation • Hand me the lead pipe. • Predicting what words can be expected next • Personal pronoun (e.g., I, she) ____________ • Stemming • -s means singular for verbs, plural for nouns • As the basis for syntactic parsing and then meaning extraction • I will lead the group into the lead smelter. • Machine translation • (E) content +N  (F) contenu +N • (E) content +Adj  (F) content +Adj or satisfait +Adj

What is a Part of Speech? Is this a semantic distinction? For example, maybe Noun is the class of words for people, places and things. Maybe Adjective is the class of words for properties of nouns. Consider: green book book is a Noun green is an Adjective Now consider: book worm This green is very soothing.

How Many Parts of Speech Are There? • A first cut at the easy distinctions: • Open classes: • nouns, verbs, adjectives, adverbs • Closed classes: function words • conjunctions: and, or, but • pronounts: I, she, him • prepositions: with, on • determiners: the, a, an

Part of speech tagging • 8 (ish) traditional parts of speech • Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc • This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) • Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS • We’ll use POS most frequently • I’ll assume that you all know what these are

POS examples • N noun chair, bandwidth, pacing • V verb study, debate, munch • ADJ adj purple, tall, ridiculous • ADV adverb unfortunately, slowly, • P preposition of, by, to • PRO pronoun I, me, mine • DET determiner the, a, that, those

Tagsets Brown corpus tagset (87 tags): http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html Penn Treebank tagset (45 tags): http://www.cs.colorado.edu/~martin/SLP/Figures/ (8.6) C7 tagset (146 tags) http://www.comp.lancs.ac.uk/ucrel/claws7tags.html

WORDS TAGS the koala put the keys on the table N V P DET POS Tagging: Definition • The process of assigning a part-of-speech or lexical class marker to each word in a corpus:

POS Tagging example WORD tag the DET koala N put V the DET keys N on P the DET table N

POS tagging: Choosing a tagset • There are so many parts of speech, potential distinctions we can draw • To do POS tagging, need to choose a standard set of tags to work with • Could pick very coarse tagets • N, V, Adj, Adv. • More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags • PRP$, WRB, WP$, VBG • Even more fine-grained tagsets exist

Penn TreeBank POS Tag set

Using the UPenn tagset • The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. • Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”) • Except the preposition/complementizer “to” is just marked “to”.

POS Tagging • Words often have more than one POS: back • The back door = JJ • On my back = NN • Win the voters back = RB • Promised to back the bill = VB • The POS tagging problem is to determine the POS tag for a particular instance of a word.

How hard is POS tagging? Measuring ambiguity

Algorithms for POS Tagging • Ambiguity – In the Brown corpus, 11.5% of the word types are ambiguous (using 87 tags): • Worse, 40% of the tokens are ambiguous.

Algorithms for POS Tagging • Why can’t we just look them up in a dictionary? • Words that aren’t in the dictionary http://story.news.yahoo.com/news?tmpl=story&cid=578&ncid=578&e=1&u=/nm/20030922/ts_nm/iraq_usa_dc • One idea: P(ti| wi) = the probability that a random hapax legomenon in the corpus has tag ti. • Nouns are more likely than verbs, which are more likely than pronouns. • Another idea: use morphology.

Algorithms for POS Tagging - Knowledge • Dictionary • Morphological rules, e.g., • _____-tion • _____-ly • capitalization • N-gram frequencies • to _____ • DET _____ N • But what about rare words, e.g, smelt (two verb forms, melt and past tense of smell, and one noun form, a small fish) • Combining these • V _____-ing I was gracking vs. Gracking is fun.

POS Tagging - Approaches • Approaches • Rule-based tagging • (ENGTWOL) • Stochastic (=Probabilistic) tagging • HMM (Hidden Markov Model) tagging • Transformation-based tagging • Brill tagger • Do we return one best answer or several answers and let later steps decide? • How does the requisite knowledge get entered?

3 methods for POS tagging 1. Rule-based tagging • Example: Karlsson (1995) EngCGtagger based on the Constraint Grammar architecture and ENGTWOL lexicon • Basic Idea: • Assign all possible tags to words (morphological analyzer used) • Remove wrong tags according to set of constraint rules (typically more than 1000 hand-written constraint rules, but may be machine-learned)

3 methods for POS tagging 2. Transformation-based tagging • Example: Brill (1995) tagger - combination of rule-based and stochastic (probabilistic) tagging methodologies • Basic Idea: • Start with a tagged corpus + dictionary (with most frequent tags) • Set the most probable tag for each word as a start value • Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order (like rule-based taggers) • machine learning is used—the rules are automatically induced from a previously tagged training corpus (like stochastic approach)

3 methods for POS tagging 3. Stochastic (=Probabilistic) tagging • Example: HMM (Hidden Markov Model) tagging - a training corpus used to compute the probability (frequency) of a given word having a given POS tag in a given context

Hidden Markov Model (HMM) Tagging • Using an HMM to do POS tagging • HMM is a special case of Bayesian inference • It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition)

Hidden Markov Model (HMM) Taggers • Goal: maximize P(word|tag) x P(tag|previous n tags) • P(word|tag) • word/lexical likelihood • probability that given this tag, we have this word • NOT probability that this word has this tag • modeled through language model (word-tag matrix) • P(tag|previous n tags) • tag sequence likelihood • probability that this tag follows these previous tags • modeled through language model (tag-tag matrix) Lexical information Syntagmatic information

POS tagging as a sequence classification task • We are given a sentence (an “observation” or “sequence of observations”) • Secretariat is expected to race tomorrow • sequence of n words w1…wn. • What is the best sequence of tags which corresponds to this sequence of observations? • Probabilistic/Bayesian view: • Consider all possible sequences of tags • Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn.

Getting to HMM • Let T = t1,t2,…,tn • Let W = w1,w2,…,wn • Goal: Out of all sequences of tags t1…tn, get the the most probable sequence of POS tags T underlying the observed sequence of words w1,w2,…,wn • Hat ^ means “our estimate of the best = the most probable tag sequence” • Argmaxxf(x) means “the x such that f(x) is maximized” it maximazes our estimate of the best tag sequence

Getting to HMM • This equation is guaranteed to give us the best tag sequence • But how do we make it operational? How do we compute this value? • Intuition of Bayesian classification: • Use Bayes rule to transform it into a set of other probabilities that are easier to compute • Thomas Bayes: British mathematician (1702-1761)

Bayes Rule Breaks down any conditional probability P(x|y) into three other probabilities P(x|y): The conditional probability of an event x assuming that y has occurred

Bayes Rule We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words

Bayes Rule

Likelihood and prior n

Likelihood and prior Further Simplifications 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it n 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag 3. The most probable tag sequence estimated by the bigram tagger

WORDS TAGS the koala put the keys on the table N V P DET Likelihood and prior Further Simplifications 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it n

Likelihood and prior Further Simplifications 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram. Bigrams are used as the basis for simple statistical analysis of text The bigram assumption is related to the first-order Markov assumption

Likelihood and prior Further Simplifications 3. The most probable tag sequence estimated by the bigram tagger --------------------------------------------------------------------------------------------------------------- n biagram assumption

Two kinds of probabilities (1) • Tag transition probabilities p(ti|ti-1) • Determiners likely to precede adjs and nouns • That/DT flight/NN • The/DT yellow/JJ hat/NN • So we expect P(NN|DT) and P(JJ|DT) to be high • But P(DT|JJ) to be:?

Two kinds of probabilities (1) • Tag transition probabilities p(ti|ti-1) • Compute P(NN|DT) by counting in a labeled corpus: # of times DT is followed by NN

Two kinds of probabilities (2) • Word likelihood probabilities p(wi|ti) • P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is” • Compute P(is|VBZ) by counting in a labeled corpus: If we were expecting a third person singular verb, how likely is it that this verb would be is?

An Example: the verb “race” • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR • People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DTrace/NN for/IN outer/JJ space/NN • How do we pick the right tag?

Disambiguating “race”

Disambiguating “race” • P(NN|TO) = .00047 • P(VB|TO) = .83 The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’ • P(race|NN) = .00057 • P(race|VB) = .00012 Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB. • P(NR|VB) = .0027 • P(NR|NN) = .0012 tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun • P(VB|TO)P(NR|VB)P(race|VB) = .00000027 • P(NN|TO)P(NR|NN)P(race|NN)=.00000000032 Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins

Hidden Markov Models • What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM) • Let’s just spend a bit of time tying this into the model • In order to define HMM, we will first introduce the Markov Chain, or observable Markov Model.

Definitions • A weighted finite-state automaton adds probabilities to the arcs • The sum of the probabilities leaving any arc must sum to one • A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through • Markov chains can’t represent inherently ambiguous problems • Useful for assigning probabilities to unambiguous sequences

Markov chain = “First-order observed Markov Model” • a set of states • Q = q1, q2…qN; the state at time t is qt • a set of transition probabilities: • a set of probabilities A = a01a02…an1…ann. • Each aij represents the probability of transitioning from state i to state j • The set of these is the transition probability matrix A • Distinguished start and end states Special initial probability vector  ithe probability that the MM will start in state i, each iexpresses the probability p(qi|START)

Markov chain = “First-order observed Markov Model” Markov Chain for weather: Example 1 • three types of weather: sunny, rainy, foggy • we want to find the following conditional probabilities: P(qn|qn-1, qn-2, …, q1) - I.e., the probability of the unknown weather on day n, depending on the (known) weather of the preceding days - We could infer this probability from the relative frequency (the statistics) of past observations of weather sequences Problem: the larger n is, the more observations we must collect. Suppose that n=6, then we have to collect statistics for 3(6-1) = 243 past histories

Markov chain = “First-order observed Markov Model” • Therefore, we make a simplifying assumption, called the (first-order) Markov assumption for a sequence of observations q1, … qn, current state only depends on previous state • the joint probability of certain past and current observations

Markov chain = “First-order observable Markov Model”

Markov chain = “First-order observed Markov Model” • Given that today the weather is sunny, what's the probability that tomorrow is sunny and the day after is rainy? • Using the Markov assumption and the probabilities in table 1, this translates into:

The weather figure: specific example • Markov Chain for weather: Example 2

Markov chain for weather • What is the probability of 4 consecutive rainy days? • Sequence is rainy-rainy-rainy-rainy • I.e., state sequence is 3-3-3-3 • P(3,3,3,3) = • 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432

Hidden Markov Model • For Markov chains, the output symbols are the same as the states. • See sunny weather: we’re in state sunny • But in part-of-speech tagging (and other things) • The output symbols are words • But the hidden states are part-of-speech tags • So we need an extension! • A Hidden Markov Model is an extension of a Markov chain in which the output symbols are not the same as the states. • This means we don’t know which state we are in.

Parts of Speech

Parts of Speech

Presentation Transcript

Parts of Speech

Parts of Speech

PARTS OF SPEECH

Parts of Speech

Parts of Speech

Parts of Speech

Parts of Speech

PARTS OF SPEECH

Parts of Speech

Parts of Speech

Parts of Speech

Parts of Speech

Parts of Speech

Parts of Speech

Parts of Speech

Parts of speech

Parts of Speech:

Parts of Speech

Parts of Speech

Parts of Speech

Parts of Speech

Parts of Speech.