1 / 37

CSA2050: Natural Language Processing

CSA2050: Natural Language Processing. Tagging 1 Tagging POS and Tagsets Ambiguities NLTK. Tagging 1 Lecture. Slides based on Mike Rosner and Marti Hearst notes Diane Litman’s version of Steven Bird’s notes Additions from NLTK tutorials. Tagging.

betty_james
Download Presentation

CSA2050: Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK CSA3050: Tagging I

  2. Tagging 1 Lecture • Slides based on Mike Rosner and Marti Hearst notes • Diane Litman’s version of Steven Bird’s notes • Additions from NLTK tutorials CSA3050: Tagging I

  3. Tagging Mr. Sherlock Holmes, who was usually very X, … What is the part of speech of X ? CSA3050: Tagging I

  4. Tagging Mr. Sherlock Holmes, who was usually very late/ADJ in the mornings, save upon those not infrequent occasions when he was up all night, was Y What is the part of speech of Y ? CSA3050: Tagging I

  5. Tagging Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated/VBN at the breakfast table CSA3050: Tagging I

  6. Tagging Terminology • Tagging • The process of associating labels with each token in a text • Tags • The labels • Tag Set • The collection of tags used for a particular task CSA3050: Tagging I

  7. Tagging Example Typically a tagged text is a sequence of white-space separated base/tag tokens: The/at Pantheon’s/np interior/nn ,/,still/rb in/in its/pp original/jj form/nn ,/, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn ./. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn ./. CSA3050: Tagging I

  8. What does tagging do? • Collapses Some Distinctions • Lexical identity may be discarded • e.g. all personal pronouns tagged with PRP • ….But Introduces Others • Ambiguities may be removed • e.g. deal tagged with NN or VB • e.g. deal tagged with DEAL1 or DEAL2 • Helps classification and prediction CSA3050: Tagging I

  9. Parts of Speech (POS) • A word’s POS tells us a lot about the word and its neighbors: • Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) • Helps in stemming • Limits the range of following words for Speech Recognition • Can help select nouns from a document for IR • Basis for partial parsing (chunked parsing) • Parsers can build trees directly on the POS tags instead of maintaining a lexicon CSA3050: Tagging I

  10. POS and Tagsets • The choice of tagset greatly affects the difficulty of the problem • Need to strike a balance between • Getting better information about context (best: introduce more distinctions) • Make it possible for classifiers to do their job (need to minimize distinctions) CSA3050: Tagging I

  11. Common Tagsets • Brown corpus: 87 tags • Penn Treebank: 45 tags • Lancaster UCREL C5 (used to tag the British National Corpus - BNC): 61 tags • Lancaster C7: 145 tags CSA3050: Tagging I

  12. Brown Corpus • The first digital corpus (1961) • Francis and Kucera, Brown University • Contents: 500 texts, each 2000 words long • From American books, newspapers, magazines • Representing genres: • Science fiction, romance fiction, press reportage scientific writing, popular lore CSA3050: Tagging I

  13. Penn Treebank • First syntactically annotated corpus • 1 million words from Wall Street Journal • Part of speech tags and syntax trees CSA3050: Tagging I

  14. Penn Treebank The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. VB DT NN .Book that flight . VBZ DT NN VB NN ?Does that flight serve dinner ? CSA3050: Tagging I

  15. Penn Treebank CSA3050: Tagging I

  16. Penn Treebank – Important Tags CSA3050: Tagging I

  17. Penn Treebank – Verb Tags CSA3050: Tagging I

  18. Penn Treebank Example (S (NP-SBJ-1 (DT The) (NNP Senate)) (VP (VBZ plans_ (S (NP-SBJ (-NONE- *-1)) (VP (TO to) (VP (VB take) (PRT (RP up)) (NP (DT the) (NN measure)) (ADV-TMP (RB quickly)))))) (. .)) CSA3050: Tagging I

  19. Tagging • Typically the set of tags is larger than basic parts of speech • Tags often contain some morphological information • Often referred to as “morphosyntactic labels” CSA3050: Tagging I

  20. Tagging Ambiguities N N-V V-IN DT N FRUIT FLIES LIKE A BANANA CSA3050: Tagging I

  21. Interpretation 1 S VP NP NP N N V DT N FRUIT FLIES LIKE A BANANA CSA3050: Tagging I

  22. Interpretation 2 S VP PP NP NP N V IN DT N FRUIT FLIES LIKE A BANANA CSA3050: Tagging I

  23. Lots of ambiguities… • He can can a can. • I canlight a fire and you canopen a can of beans. Now the can is open, and we can eat in the light of the fire. CSA3050: Tagging I

  24. Lots of ambiguities… • In the Brown Corpus • 11.5% of word types are ambiguous • 40% of word tokens are ambiguous • Most words in English are unambiguous. • Many of the most common words are ambiguous. • Typically ambiguous tags are not equally probable. CSA3050: Tagging I

  25. Lots of ambiguities… Brown Corpus Unambiguous (1 tag): 35,340 types Ambiguous (2-7 tags): 4,100 types (Table: Derose, 1988) CSA3050: Tagging I

  26. Approaches to Tagging • Tagger: ENGTWOL Tagger(Voutilainen 1995) • Stochastic Tagger: HMM-based Tagger • Transformation-Based Tagger: Brill Tagger(Brill 1995) CSA3050: Tagging I

  27. NLTK • Natural Language Toolkit (NLTK) • http://nltk.sourceforge.net/ • Please download and install! • Runs on Python CSA3050: Tagging I

  28. NLTK Introduction • The Natural Language Toolkit (NLTK) provides: • Basic classes for representing data relevant to natural language processing. • Standard interfaces for performing tasks, such as tokenization, tagging, and parsing. • Standard implementations of each task, which can be combined to solve complex problems. • Two versions: NLTK and NLTK-Lite CSA3050: Tagging I

  29. NLTK Modules • nltk.token: processing individual elements of text, such as words or sentences. • nltk.probability: modeling frequency distributions and probabilistic systems. • nltk.tagger: tagging tokens with supplemental information, such as parts of speech or wordnet sense tags. • nltk.parser: high-level interface for parsing texts. • nltk.chartparser: a chart-based implementation of the parser interface. • nltk.chunkparser: a regular-expression based surface parser. CSA3050: Tagging I

  30. Python for NLP • Python is a great language for NLP: • Simple • Easy to debug: • Exceptions • Interpreted language • Easy to structure • Modules • Object oriented programming • Powerful string manipulation CSA3050: Tagging I

  31. Python Modules and Packages • Python modules “package program code and data for reuse.” (Lutz) • Similar to library in C, package in Java. • Python packages are hierarchical modules (i.e., modules that contain other modules). • Three commands for accessing modules: • import • from…import • reload CSA3050: Tagging I

  32. Import Command • The importcommand loads a module: # Load the regular expression module >>> import re • To access the contents of a module, use dotted names: # Use the search method from the re module >>> re.search(‘\w+’, str) • To list the contents of a module, use dir: >>> dir(re) [‘DOTALL’, ‘I’, ‘IGNORECASE’,…] CSA3050: Tagging I

  33. from...import • The from…import command loads individual functions and objects from a module: # Load the search function from the re module >>> from re import search • Once an individual function or object is loaded with from…import,it can be used directly: # Use the search method from the re module >>> search (‘\w+’, str) CSA3050: Tagging I

  34. Import Keeps module functions separate from user functions. Requires the use of dotted names. Works with reload. Import vs. from...import from…import • Puts module functions and user functions together. • More convenient names. • Does not work with reload. CSA3050: Tagging I

  35. Reload • If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule ... >>> reload (mymodule) • The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import. CSA3050: Tagging I

  36. Reload • If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule ... >>> reload (mymodule) • The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import. CSA3050: Tagging I

  37. Next Sessions… • Rule-Based Tagging • Stochastic Tagging • Hidden Markov Models (HMMs) • N-Grams • Read Jurafsky and Marting Chapter 4 (PDF) • Install NLTK CSA3050: Tagging I

More Related