1 / 36

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4

This lecture discusses parts of speech (POS) tagging, including influential tag sets, tricky cases, tokenization, simple models, corpus-based methods, training models, and Markov models for POS tagging.

kirbyr
Download Presentation

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NYU Part-of-Speech TaggingCSCI-GA.2590 – Lecture 4 Ralph Grishman

  2. Parts of Speech Grammar is stated in terms of parts of speech (‘preterminals’): • classes of words sharing syntactic properties: noun verb adjective … NYU

  3. POS Tag Sets Most influential tag sets were those defined for projects to produce large POS-annotated corpora: • Brown corpus • 1 million words from variety of genres • 87 tags • UPenn Tree Bank • initially 1 million words of Wall Street Journal • later retagged Brown • first POS tags, then full parses • 45 tags (some distinctions captured in parses) NYU

  4. The Penn POS Tag Set • Noun categories • NN (common singular) • NNS (common plural) • NNP (proper singular) Penn POS tags • NNPS (proper plural) • Verb categories • VB (base form) • VBZ (3rd person singular present tense) • VBP (present tense, other than 3rd person singular) • VBD (past tense) • VBG (present participle) • VBN (past participle) NYU

  5. some tricky cases • present participles which act as prepositions: • according/JJ to • nationalities: • English/JJ cuisine • an English/NNP sentence • adjective vs. participle • the striking/VBG teachers • a striking/JJ hat • he was very surprised/JJ • he was surprised/VBN by his wife NYU

  6. Tokenization • any annotated corpus assumes some tokenization • relatively straightforward for English • generally defined by whitespace and punctuation • treat negative contraction as separate token: do | n’t • treat possessive as separate token: cat | ‘s • do not split hyphenated terms: Chicago-based NYU

  7. the Tagging Task Task: assigning a POS to each word • not trivial: many words have several tags • dictionary only lists possible POS, independent of context • how about using a parser to determine tags? • some analysis (e.g., partial parsers) assume input is tagged NYU

  8. Why tag? • POS tagging can help parsing by reducing ambiguity • Can resolve some pronunciation ambiguities for text-to-speech (“desert”) • Can resolve some semantic ambiguities NYU

  9. Simple Models • Natural language is very complex     • we don't know how to model it fully,so we build simplified models which provide some approximation to natural language NYU

  10. Corpus-Based Methods How can we measure 'how good' these models are? • we assemble a text corpus • annotate it by hand with respect to the phenomenon we are interested in • compare it with the predictions of our model • for example, how well the model predicts part-of-speech or syntactic structure NYU

  11. Preparing a Good Corpus • To build a good corpus • we must define a task people can do reliably (choose a suitable POS set, for example) • we must provide good documentation for the task • so annotation can be done consistently • we must measure human performance (through dual annotation and inter-annotator agreement) • Often requires several iterations of refinement

  12. Training the model How to build a model?     • need a goodness metric • train by hand, by adjusting rules and analyzing errors (ex: Constraint Grammar) • train automatically         • develop new rules         • build probabilistic model (generally very hard to do by hand) • choice of model affected by ability to train it (NN) NYU

  13. The simplest model • The simplest POS model considers each word separately: • We tag each word with its most likely part-of-speech • this works quite well: about 90% accuracy when trained and tested on similar texts • although many words have multiple parts of speech, one POS typically dominates within a single text type • How can we take advantage of context to do better? NYU

  14. A Language Model • To see how we might do better, let us consider a related problem: building a language model • a language model can generate sentences following some probability distribution NYU

  15. Markov Model • In principle each word we select depends on all the decisions which came before (all preceding words in the sentence) • But we’ll make life simple by assuming that the decision depends on only the immediately preceding decision • [first-order] Markov Model • representable by a finite state transition network • Tij = probability of a transition from state i to state j

  16. Finite State Network 0.30 dog: woof 0.50 0.30 start end 0.40 0.40 cat: meow 0.50 0.30 0.30

  17. Our bilingual pets • Suppose our cat learned to say “woof” and our dog “meow” • … they started chatting in the next room • … and we wanted to know who said what

  18. Hidden State Network woof meow woof meow dog start end cat

  19. How do we predict • When the cat is talking: ti = cat • When the dog is talking: ti = dog • We construct a probabilistic model of the phenomenon • And then seek the most likely state sequence S

  20. Hidden Markov Model • Assume current word depends only on current tag

  21. HMM for POS Tagging • We can use the same formulas for POS tagging states  POS tags NYU

  22. Training an HMM • Training an HMM is simple if we have a completely labeled corpus:   • have marked the POS of each word. • can directly estimate both P ( ti | ti-1 )and P ( wi | ti ) from corpus counts     • using the Maximum Likelihood Estimator. NYU

  23. Greedy Decoder • simplest decoder (tagger) assign tags deterministically from left to right • selects ti to maximize P(wi|ti) * P(ti|ti-1) • does not take advantage of right context • can we do better? NYU

  24. < Viterbi decoder > NYU

  25. Performance • Accuracy with good unknown-word model trained and tested on WSJ is 96.5% to 96.8% NYU

  26. Unknown words • Problem (as with NB) of zero counts … words not in the training corpus • simplest: assume all POS equally likely for unknown words • can make better estimate by observing unknown words are very likely open class words, and most likely nouns • base P(t|w) of unknown word on probability distribution of words which occur once in corpus NYU

  27. Unknown words, cont’d • can do even better by taking into account the form of a word • whether it is capitalized • whether it is hyphenated • its last few letters NYU

  28. Trigram Models • in some cases we need to look two tags back to find an informative context • e.g, conjunction (N and N, V and V, …) • but there’s not enough data for a pure trigram model • so combine unigram, bigram, and trigram • linear interpolation • backoff NYU

  29. Domain adaptation • Substantial loss in shifting to new domain • 8-10% loss in shift from WSJ to biology domain • adding small annotated sample (200-500 sentences) in new domain greatly reduces error • some reduction possible without annotated target data (Blitzer, Structured Correspondence Learning) NYU

  30. Jet Tagger • HMM–based • trained on WSJ • file pos_hmm.txt NYU

  31. Transformation-Based Learning • TBL provides a very different corpus-based approach to part-of-speech tagging • It learns a set of rules for tagging • the result is inspectable NYU

  32. TBL Model • TBL starts by assigning each word its most likely part of speech • Then it applies a series of transformations to the corpus • each transformation states some condition and some change to be made to the assigned POS if the condition is met • for example: • Change NN to VB if the preceding tag is TO. • Change VBP to VB if one of the previous 3 tags is MD. NYU

  33. Transformation Templates • Each transformation is based on one of a small number of templates, such as • Change tag x to y if the preceding tag is z. • Change tag x to y if one of the previous 2 tags is z. • Change tag x to y if one of the previous 3 tags is z. • Change tag x to y if the next tag is z. • Change tag x to y if one of the next 2 tags is z. • Change tag x to y if one of the next 3 tags is z. NYU

  34. Training the TBL Model • To train the tagger, using a hand-tagged corpus, we begin by assigning each word its most common POS. • We then try all possible rules (all instantiations of one of the templates) and keep the best rule -- the one which corrects the most errors. • We do this repeatedly until we can no longer find a rule which corrects some minimum number of errors. NYU

  35. Some Transformations the first 9 transformations found for WSJ corpus NYU

  36. TBL Performance • Performance competitive with good HMM • accuracy 96.6% on WSJ • Compared to HMM, much slower to train, but faster to apply NYU

More Related