720 likes | 1.46k Views
Part of Speech (POS) Tagging. CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005. Sources (and Resources). Some slides adapted from Dorr, www.umiacs.umd.edu/~christof/courses/cmsc723-fall04
E N D
Part of Speech (POS) Tagging CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Sources (and Resources) • Some slides adapted from • Dorr, www.umiacs.umd.edu/~christof/courses/cmsc723-fall04 • Jurafsky, www.stanford.edu/class/linguist238 • McCoy, www.cis.udel.edu/~mccoy/courses/cisc882.03f • With some additional examples and ideas from • Martin: www.cs.colorado.edu/~martin/csci5832.html • Hearst: www.sims.berkeley.edu/courses/is290-2/f04/resources.html • Litman: www.cs.pitt.edu/~litman/courses/cs2731f03/cs2731.html • Rich: www.cs.utexas.edu/users/ear/cs378NLP • You may find some or all of these useful resources throughout the course. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Word Classes and Part-of-Speech Tagging • What is POS tagging? • Why do we need POS? • Word Classes • Rule-based Tagging • Stochastic Tagging • Transformation-Based Tagging • Tagging Unknown Words • Evaluating POS Taggers CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Parts of Speech • 8 traditional parts of speech (more or less) • Noun, verb, adjective, preposition, adverb, article, pronoun, conjunction. • This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) • Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS • Actual categories vary by language , by reason for tagging, by who you ask! CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
POS examples • N noun chair, bandwidth, pacing • V verb study, debate, munch • ADJ adj purple, tall, ridiculous • ADV adverb unfortunately, slowly, • P preposition of, by, to • PRO pronoun I, me, mine • DET determiner the, a, that, those CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
WORDS TAGS the girl kissed the boy on the cheek N V P DET Definition of POS Tagging “The process of assigning a part-of-speech or other lexical class marker to each word in a corpus” (Jurafsky and Martin) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
POS Tagging example WORD tag the DET koala N put V the DET keys N on P the DET table N CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
What does Tagging do? • Collapses Distinctions • Lexical identity may be discarded • e.g. all personal pronouns tagged with PRP • Introduces Distinctions • Ambiguities may be removed • e.g. deal tagged with NN or VB • e.g. deal tagged with DEAL1 or DEAL2 • Helps classification and prediction Modified from Diane Litman's version of Steve Bird's notesCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Significance of Parts of Speech • A word’s POS tells us a lot about the word and its neighbors: • Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) • Helps in stemming • Limits the range of following words for Speech Recognition • Can help select nouns from a document for IR • Basis for partial parsing (chunked parsing) • Parsers can build trees directly on the POS tags instead of maintaining a lexicon Modified from Diane Litman's version of Steve Bird's notesCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Word Classes • What are we trying to classify words into? • Classes based on • Syntactic properties. What can precede/follow. • Morphological properties. What affixes they take. • Not primarily by semantic coherence (Conjunction Junction notwithstanding!) • Broad "grammar" categories are familiar • NLP uses much richer "tagsets" CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Open and closed class words • Two major categories of classes: • Closed class: a relatively fixed membership • Prepositions: of, in, by, … • Auxiliaries: may, can, will had, been, … • Pronouns: I, you, she, mine, his, them, … • Usually function words(short common words which play a role in grammar) • Open class: new ones can be created all the time • English has 4: Nouns, Verbs, Adjectives, Adverbs • Many languages have all 4, but not all! CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Open Class Words • Every known human language has nouns and verbs • Nouns: people, places, things • Classes of nouns • proper vs. common • count vs. mass • Verbs: actions and processes • Adjectives: properties, qualities • Adverbs: hodgepodge! • Unfortunately, John walked home extremely slowly yesterday CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Closed Class Words • Idiosyncratic. Differ more from language to language. • Language strongly resists additions • Examples: • prepositions: on, under, over, … • particles: up, down, on, off, … • determiners: a, an, the, … • pronouns: she, who, I, .. • conjunctions: and, but, or, … • auxiliary verbs: can, may should, … • numerals: one, two, three, third, … CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Prepositions from CELEX CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
English Single-Word Particles CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Pronouns in CELEX CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Conjunctions CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Auxiliaries CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
POS Tagging: Choosing a Tagset • Many parts of speech, potential distinctions • To do POS tagging, need to choose a standard set of tags to work with • Sets vary in # of tags: a dozen to over 200 • Size of tag sets depends on language, objectives and purpose • Need to strike a balance between • Getting better information about context (best: introduce more distinctions) • Make it possible for classifiers to do their job (need to minimize distinctions) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Some of the best-known Tagsets • Brown corpus: 87 tags • Penn Treebank: 45 tags • Lancaster UCREL C5 (used to tag the BNC): 61 tags • Lancaster C7: 145 tags Slide modified from Massimo Poesio'sCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
The Brown Corpus • The first digital corpus (1961) • Francis and Kucera, Brown University • Contents: 500 texts, each 2000 words long • From American books, newspapers, magazines • Representing genres: • Science fiction, romance fiction, press reportage scientific writing, popular lore Modified from Diane Litman's version of Steve Bird's notesCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Penn Treebank • First syntactically annotated corpus • 1 million words from Wall Street Journal • Part of speech tags and syntax trees Modified from Diane Litman's version of Steve Bird's notesCSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Tag Set Example: Penn Treebank PRP PRP$ CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Example of Penn Treebank Tagging of Brown Corpus Sentence The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. VB DT NN .Book that flight . VBZ DT NN VB NN ?Does that flight serve dinner ? CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
POS Tagging • Words often have more than one POS: back • The back door = JJ • On my back = NN • Win the voters back = RB • Promised to back the bill = VB • The POS tagging problem is to determine the POS tag for a particular instance of a word. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Word Class Ambiguity(in the Brown Corpus) Unambiguous (1 tag): 35,340 Ambiguous (2-7 tags): 4,100 (Derose, 1988) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Part-of-Speech Tagging • Rule-Based Tagger: ENGTWOL • Stochastic Tagger: HMM-based • Transformation-Based Tagger: Brill CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Rule-Based Tagging • Basic Idea: • Assign all possible tags to words • Remove tags according to set of rules of type: if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv. • Typically more than 1000 hand-written rules, but may be machine-learned. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Start With a Dictionary • she: PRP • promised: VBN,VBD • to TO • back: VB, JJ, RB, NN • the: DT • bill: NN, VB • Etc… for the ~100,000 words of English CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Assign All Possible Tags NN RB VBNJJ VB PRP VBD TO VB DT NN She promised to back the bill CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Write rules to eliminate tags Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP” NN RB JJ VB PRP VBD TO VB DT NN She promised to back the bill VBN CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Sample ENGTWOL Lexicon CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Stochastic Tagging • Based on probability of certain tag occurring given various possibilities • Necessitates a training corpus • No probabilities for words not in corpus. • Training corpus may be too different from test corpus. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Stochastic Tagging (cont.) Simple Method: Choose most frequent tag in training text for each word! • Result: 90% accuracy • Why? • Baseline: Others will do better • HMM is an example CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
HMM Tagger • Intuition: Pick the most likely tag for this word. • HMM Taggers choose tag sequence that maximizes this formula: • P(word|tag) × P(tag|previous n tags) • Let T = t1,t2,…,tnLet W = w1,w2,…,wn • Find POS tags that generate a sequence of words, i.e., look for most probable sequence of tags T underlying the observed words W. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Conditional Probability • A brief digression… • Conditional probability: how do we determine the likelihood of one event following another if they are not independent? • Example: • I am trying to diagnose a rash in a 6-year-old child. • Is it measles? • On other words, given that the child has a rash, what is the probability that it is measles? CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Conditional Probabilities cont. • What would affect your decision? • The overall frequency of rashes in 6-yr-olds • The overall frequency of measles in 6-yr-olds • The frequency with which 6-yr-olds with measles have rashes. • P(measles|rash) = P(rash|measles)P(measles) P(rash) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Bayes' Theorem • Bayes' Theorem or Bayes' Rule formalizes this intuition: • P(X|Y) = P(Y|X) P(X) P(Y) • P(X) and P(Y) are known as the "prior probabilities" or "priors". CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
We want the best set of tags for a sequence of words (a sentence) W is a sequence of words T is a sequence of tags Probabilities CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
We want the best set of tags for a sequence of words (a sentence) W is a sequence of words T is a sequence of tags Probabilities CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Tag Sequence: P(T) • How do we get the probability of a specific tag sequence? • Count the number of times a sequence occurs and divide by the number of sequences of that length. Not likely. • Make a Markov assumption and use N-grams over tags... • P(T) is a product of the probability of N-grams* that make it up CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
N-Grams • The N stands for how many terms are used • Unigram: 1 term; Bigram: 2 terms; Trigrams: 3 terms • Usually don’t go beyond 3. • You can use different kinds of terms, e.g.: • Character based n-grams • Word-based n-grams • POS-based n-grams • Ordering • Often adjacent, but not required • We use N-grams to help determine the context in which some linguistic phenomenon happens. • E.g., look at the words before and after the period to see if it is the end of a sentence or not. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
P(T): Bigram Example • Given a sentence: <s> Det Adj Adj Noun </s> • Probability is product of four N-grams: P(Det|<s>) P(Adj|Det) P(Adj|Adj) P(Noun|Adj) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Counts • Where do you get the N-gram counts? • From a large hand-tagged corpus. • For N-grams, count all the Tagi Tagi+1 pairs • And smooth them to get rid of the zeroes • Alternatively, you can learn them from an untagged corpus CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
What about P(W|T) • First its odd. It is asking the probability of seeing “The big red dog” given “Det Adj Adj Noun” • Collect up all the times you see that tag sequence and see how often “The big red dog” shows up. Again not likely to work. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
P(W|T) • We’ll make the following assumption (because it’s easy)… Each word in the sequence only depends on its corresponding tag. So… • How do you get the statistics for that? CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
So… • We start with • And get CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
HMMs • This is a Hidden Markov Model (HMM) • The states in the model are the tags, and the observations are the words. • The state to state transitions are driven by the bigram statistics • The observed words are based solely on the state that you’re in CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
An Example • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN • to/TO race/???the/DT race/??? • ti = argmaxj P(tj|ti-1)P(wi|tj) • max[P(VB|TO)P(race|VB) , P(NN|TO)P(race|NN)] • Brown: • P(NN|TO) = .021 × P(race|NN) = .00041 = .000007 • P(VB|TO) = .34 × P(race|VB) = .00003 = .00001 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
Performance • This method has achieved 95-96% correct with reasonably complex English tagsets and reasonable amounts of hand-tagged training data. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari