1 / 21

CSA2050: Introduction to Computational Linguistics

CSA2050: Introduction to Computational Linguistics. Part of Speech (POS) Tagging I Introduction Tagsets Approaches. Acknowledgment. Most slides taken from Bonnie Dorr’s course notes: www.umiacs.umd.edu/~bonnie/courses/cmsc723-03 In turn based on Jurafsky & Martin Chapter 8. Bibliography.

mickey
Download Presentation

CSA2050: Introduction to Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSA2050:Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches

  2. Acknowledgment • Most slides taken from Bonnie Dorr’s course notes:www.umiacs.umd.edu/~bonnie/courses/cmsc723-03 • In turn based on Jurafsky & Martin Chapter 8 CLINT Lecture IV

  3. Bibliography • R. Weischedel , R. Schwartz , J. Palmucci , M. Meteer , L. Ramshaw, Coping with Ambiguity and Unknown Words through Probabilistic Models, Computational Linguistics 19.2, pp 359--382,1993 [pdf] • Samuelsson, C., Morphological tagging based entirely on Bayesian inference, in 9th Nordic Conference on Computational Linguistics, NODALIDA-93, Stockholm, 1993. (see [html]) • A. Ratnaparkhi, A maximum entropy model for part of speech tagging. Proceedings of the Conference on Empirical Methods in Natural Language, 1996 Processing [pdf]

  4. Outline • The tagging task • Tagsets • Three different approaches CLINT Lecture IV

  5. WORDS TAGS the girl kissed the boy on the cheek N V P DET Definition: PoS-Tagging “Part-of-Speech Tagging is the process of assigning a part-of-speech or other lexical class marker to each word in a corpus” (Jurafsky and Martin) CLINT Lecture IV

  6. Motivation • Corpus analysis of tagged corpora yields useful information • Speech synthesis — pronunciation CONtent (N) vs. conTENT (Adj) • Speech recognition — word class-based N-grams predict category of next word. • Information retrieval • stemming • selection of high-content words • Word-sense disambiguation CLINT Lecture IV

  7. English Parts of Speech • Pronoun: any substitute for a noun or noun phrase • Adjective: any qualifier of a noun • Verb: any action or state of being • Adverb: any qualifier of an adjective verb • Preposition: any establisher of relation and syntactic context • Conjunction: any syntactic connector • Interjection: any emotional greeting (or "exclamation"),

  8. Tagsets: how detailed? CLINT Lecture IV

  9. Penn Treebank Tagset PRP PRP$ CLINT Lecture IV

  10. Example of Penn Treebank Tagging of Brown Corpus Sentence The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. VB DT NN .Book that flight . VBZ DT NN VB NN ?Does that flight serve dinner ? CLINT Lecture IV

  11. 2 Problems • Multiple tags for the same word • Unknown words

  12. Multiple tags for the same word • He can can a can. • I canlight a fire and you canopen a can of beans. Now the can is open, and we can eat in the light of the fire. • Flying planes can be dangerous. CLINT Lecture IV

  13. Multiple tags for the same word • Words often belong to more than one word class: this • This is a nice day = PRP (pronoun) • This day is nice = DT (determiner) • You can go this far = RB (adverb) • Many of the most common words (by volume of text) are ambiguous CLINT Lecture IV

  14. How Hard is the Tagging Task? • In the Brown Corpus • 11.5% of word types are ambiguous • 40% of word tokens are ambiguous • Most words in English are unambiguous. • Many of the most common words are ambiguous. • Typically ambiguous tags are not equally probable. CLINT Lecture IV

  15. Unambiguous (1 tag): 35,340 types Ambiguous (2-7 tags): 4,100 types . Word Class Ambiguity(in the Brown Corpus) (Derose, 1988) CLINT Lecture IV

  16. 3 Approaches to Tagging • Rule-Based Tagger: ENCG Tagger(Voutilainen 1995,1999) • Stochastic Tagger: HMM-based Tagger • Transformation-Based Tagger: Brill Tagger(Brill 1995) CLINT Lecture IV

  17. Unknown Words • Assume all unknown word is ambiguous amongst all possible tags Advantage: simplicity Disadvantage: ignores the fact that unknown words are unlikely to be closed class • Assume that probability distribution of unknown words is same as words that have been seen just once. • Make use of morphological information

  18. Combining Features • The last method makes use of different features, e.g. ending in -ed (suggest verb) or initial capital (suggests proper noun). • Typically, a given tag is correlated with a combination of such features. These have to be incorporated into the statistical model.

  19. Combining Tag-Predicting Features in Unknown Words • HMM Models • Weischedel et. al. (1993): for each feature f and tag t (e.g. proper noun) build a probability estimator p(f|t). Assume independence and multiply probabilities together • Samuelsson (1993), rather than preselecting features, considers all possible suffixes up to length 10 as features for predicting tags

  20. Combining Tag-Predicting Features in Unknown Words • Maximum Entropy (ME) Models. • A ME model is a classifier which assigns a class to an observation by computing a probability from an exponential function of a weighted set of features of the observation • An MEMM uses the Viterbi Algorithm to extend the application of ME to labelling a sequence of observations. • For further details see Ratnaparkhi (1996)

  21. Summary • External parameters to the tagging task are (i) the size of the chosen tagset and (ii) the coverage of the lexicon which gives possible tags to words. • Two main problems: (i) disambiguation of tags and (ii) dealing with unknown words • Several methods are available for dealing with (ii): HMMs and MEMMs

More Related