1 / 24

CSA3180: Natural Language Processing

CSA3180: Natural Language Processing. Text Processing 2 Python and NLTK Shallow Parsing and Chunking NLTK Lite Exercises. Python and NLTK. Natural Language Toolkit (NLTK) http://nltk.sourceforge.net/ NLTK Slides partly based on Diane Litman Lectures

randysbrown
Download Presentation

CSA3180: Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSA3180: Natural Language Processing Text Processing 2 Python and NLTK Shallow Parsing and Chunking NLTK Lite Exercises CSA3180: Text Processing II

  2. Python and NLTK • Natural Language Toolkit (NLTK) • http://nltk.sourceforge.net/ • NLTK Slides partly based on Diane Litman Lectures • Chunk parsing slides partly based on Marti Hearst Lectures CSA3180: Text Processing II

  3. Python for NLP • Python is a great language for NLP: • Simple • Easy to debug: • Exceptions • Interpreted language • Easy to structure • Modules • Object oriented programming • Powerful string manipulation CSA3180: Text Processing II

  4. Python Modules and Packages • Python modules “package program code and data for reuse.” (Lutz) • Similar to library in C, package in Java. • Python packages are hierarchical modules (i.e., modules that contain other modules). • Three commands for accessing modules: • import • from…import • reload CSA3180: Text Processing II

  5. Import Command • The importcommand loads a module: # Load the regular expression module >>> import re • To access the contents of a module, use dotted names: # Use the search method from the re module >>> re.search(‘\w+’, str) • To list the contents of a module, use dir: >>> dir(re) [‘DOTALL’, ‘I’, ‘IGNORECASE’,…] CSA3180: Text Processing II

  6. from...import • The from…import command loads individual functions and objects from a module: # Load the search function from the re module >>> from re import search • Once an individual function or object is loaded with from…import,it can be used directly: # Use the search method from the re module >>> search (‘\w+’, str) CSA3180: Text Processing II

  7. Import Keeps module functions separate from user functions. Requires the use of dotted names. Works with reload. from…import Puts module functions and user functions together. More convenient names. Does not work with reload. Import vs. from...import CSA3180: Text Processing II

  8. Reload • If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule ... >>> reload (mymodule) • The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import. CSA3180: Text Processing II

  9. NLTK Introduction • The Natural Language Toolkit (NLTK) provides: • Basic classes for representing data relevant to natural language processing. • Standard interfaces for performing tasks, such as tokenization, tagging, and parsing. • Standard implementations of each task, which can be combined to solve complex problems. • Two versions: NLTK and NLTK-Lite • Using NLTK-Lite for this course CSA3180: Text Processing II

  10. NLTK Example Modules • nltk.token: processing individual elements of text, such as words or sentences. • nltk.probability: modeling frequency distributions and probabilistic systems. • nltk.tagger: tagging tokens with supplemental information, such as parts of speech or wordnet sense tags. • nltk.parser: high-level interface for parsing texts. • nltk.chartparser: a chart-based implementation of the parser interface. • nltk.chunkparser: a regular-expression based surface parser. CSA3180: Text Processing II

  11. Shallow/Chunk Parsing Goal: divide a sentence into a sequence of chunks. • Chunks are non-overlapping regions of a text [I] saw [a tall man] in[the park]. • Chunks are non-recursive • A chunk can not contain other chunks • Chunks are non-exhaustive • Not all words are included in chunks CSA3180: Text Processing II

  12. Chunk Parsing Examples • Noun-phrase chunking: [I] saw [a tall man] in[the park]. • Verb-phrase chunking: The man who[was in the park] [saw me]. • Prosodic chunking: [I saw] [a tall man] [in the park]. • Question answering: • What[Spanish explorer]discovered [the Mississippi River]? CSA3180: Text Processing II

  13. Motivation • Locating information • e.g., text retrieval • Index a document collection on its noun phrases • Ignoring information • Generalize in order to study higher-level patterns • e.g. phrases involving “gave” in Penn treebank: • gave NP; gave up NP in NP; gave NP up; gave NP help; gave NP to NP • Sometimes a full parse has too much structure • Too nested • Chunks usually are not recursive CSA3180: Text Processing II

  14. Representation • BIO (or IOB)Trees CSA3180: Text Processing II

  15. Comparison with Full Parsing • Parsing is usually an intermediate stage • Builds structures that are used by later stages of processing • Full parsing is a sufficient but not necessary intermediate stage for many NLP tasks • Parsing often provides more information than we need • Shallow parsing is an easier problem • Less word-order flexibility within chunks than between chunks • More locality: • Fewer long-range dependencies • Less context-dependence • Less ambiguity CSA3180: Text Processing II

  16. Chunks and Constituency Constituents: [[a tall man] [in [the park]]]. Chunks:[a tall man] in[the park]. • A constituent is part of some higher unit in the hierarchical syntactic parse • Chunks are not constituents • Constituents are recursive • But, chunks are typically subsequences of constituents • Chunks do not cross major constituent boundaries CSA3180: Text Processing II

  17. Chunk Parsing in NLTK • Chunk parsers usually ignore lexical content • Only need to look at part-of-speech tags • Possible steps in chunk parsing • Chunking, unchunking • Chinking • Merging, splitting • Evaluation • Compare to a Baseline • Evaluate in terms of • Precision, Recall, F-Measure • Missed (False Negative), Incorrect (False Positive) CSA3180: Text Processing II

  18. Chunk Parsing in NLTK • Define a regular expression that matches the sequences of tags in a chunk A simple noun phrase chunk regexp: (Note that <NN.*> matches any tag starting with NN) <DT>? <JJ>* <NN.?> • Chunk all matching subsequences: the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN [the/DTlittle/JJcat/NN] sat/VBD on/IN[the/DTmat/NN] • If matching subsequences overlap, first 1 gets priority CSA3180: Text Processing II

  19. Unchunking • Remove any chunk with a given pattern • e.g., unChunkRule(‘<NN|DT>+’, ‘Unchunk NNDT’) • Combine with Chunk Rule <NN|DT|JJ>+ • Chunk all matching subsequences: • Input: the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN • Apply chunk rule [the/DTlittle/JJcat/NN] sat/VBD on/IN[the/DTmat/NN] • Apply unchunk rule [the/DTlittle/JJcat/NN] sat/VBD on/INthe/DTmat/NN CSA3180: Text Processing II

  20. Chinking • A chink is a subsequence of the text that is not a chunk. • Define a regular expression that matches the sequences of tags in a chink A simple chink regexp for finding NP chunks: (<VB.?>|<IN>)+ • First apply chunk rule to chunk everything • Input: the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN • ChunkRule('<.*>+', ‘Chunk everything’) [the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN] • Apply Chink rule above: [the/DTlittle/JJcat/NN]sat/VBD on/IN[the/DTmat/NN] CSA3180: Text Processing II

  21. Merging • Combine adjacent chunks into a single chunk • Define a regular expression that matches the sequences of tags on both sides of the point to be merged • Example: • Merge a chunk ending in JJ with a chunk starting with NN MergeRule(‘<JJ>’, ‘<NN>’, ‘Merge adjs and nouns’) [the/DTlittle/JJ][cat/NN] sat/VBD on/IN the/DT mat/NN [the/DTlittle/JJcat/NN] sat/VBD on/IN the/DT mat/NN • Splitting is the opposite of merging CSA3180: Text Processing II

  22. Merging • Combine adjacent chunks into a single chunk • Define a regular expression that matches the sequences of tags on both sides of the point to be merged • Example: • Merge a chunk ending in JJ with a chunk starting with NN MergeRule(‘<JJ>’, ‘<NN>’, ‘Merge adjs and nouns’) [the/DTlittle/JJ][cat/NN] sat/VBD on/IN the/DT mat/NN [the/DTlittle/JJcat/NN] sat/VBD on/IN the/DT mat/NN • Splitting is the opposite of merging CSA3180: Text Processing II

  23. NLTK Exercises for Next Week • Series of tutorials by Steven Bird, Edward Klein and Edward Loper • http://nltk.sourceforge.net/lite/doc/en/ • University of Pennsylvania • By next lecture please read and do exercises in: • Introduction • Programming • Tokenize • Tag CSA3180: Text Processing II

  24. Next Sessions… • Natural Language Toolkit (NLTK) Exercises • http://nltk.sourceforge.net/ • Discovery of Word Associations • Text Classification • Clustering/Data Mining • TF.IDF • Linear and Non-Linear Classification • Binary Classification • Multi-Class Classification CSA3180: Text Processing II

More Related