Python for nlp and the natural language toolkit l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 47

Python for NLP and the Natural Language Toolkit PowerPoint PPT Presentation


  • 262 Views
  • Uploaded on
  • Presentation posted in: General

Python for NLP and the Natural Language Toolkit. CS1573: AI Application Development, Spring 2003 (modified from Edward Loper’s notes). Outline.

Download Presentation

Python for NLP and the Natural Language Toolkit

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Python for nlp and the natural language toolkit l.jpg

Python for NLP and the Natural Language Toolkit

CS1573: AI Application Development, Spring 2003

(modified from Edward Loper’s notes)


Outline l.jpg

Outline

Review: Introduction to NLP (knowledge of language, ambiguity, representations and algorithms, applications)

HW 2 discussion

Tutorials: Basics, Probability


Python and natural language processing l.jpg

Python and Natural Language Processing

  • Python is a great language for NLP:

    • Simple

    • Easy to debug:

      • Exceptions

      • Interpreted language

    • Easy to structure

      • Modules

      • Object oriented programming

    • Powerful string manipulation


Modules and packages l.jpg

Modules and Packages

Python modules “package program code and data for reuse.” (Lutz)

Similar to library in C, package in Java.

Python packages are hierarchical modules (i.e., modules that contain other modules).

Three commands for accessing modules:

import

from…import

reload


Modules and packages import l.jpg

Modules and Packages: import

  • The importcommand loads a module:

    # Load the regular expression module

    >>> import re

  • To access the contents of a module, use dotted names:

    # Use the search method from the re module

    >>> re.search(‘\w+’, str)

  • To list the contents of a module, use dir:

    >>> dir(re)

    [‘DOTALL’, ‘I’, ‘IGNORECASE’,…]


Modules and packages from import l.jpg

Modules and Packagesfrom…import

  • The from…import command loads individual functions and objects from a module:

    # Load the search function from the re module

    >>> from re import search

  • Once an individual function or object is loaded with from…import,it can be used directly:

    # Use the search method from the re module

    >>> search (‘\w+’, str)


Import vs from import l.jpg

Import

Keeps module functions separate from user functions.

Requires the use of dotted names.

Works with reload.

from…import

Puts module functions and user functions together.

More convenient names.

Does not work with reload.

Import vs. from…import


Modules and packages reload l.jpg

Modules and Packages: reload

  • If you edit a module, you must use the reload command before the changes become visible in Python:

    >>> import mymodule

    ...

    >>> reload (mymodule)

  • The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import.


Introduction to nltk l.jpg

Introduction to NLTK

  • The Natural Language Toolkit (NLTK) provides:

    • Basic classes for representing data relevant to natural language processing.

    • Standard interfaces for performing tasks, such as tokenization, tagging, and parsing.

    • Standard implementations of each task, which can be combined to solve complex problems.


Nltk example modules l.jpg

NLTK: Example Modules

  • nltk.token: processing individual elements of text, such as words or sentences.

  • nltk.probability: modeling frequency distributions and probabilistic systems.

  • nltk.tagger: tagging tokens with supplemental information, such as parts of speech or wordnet sense tags.

  • nltk.parser: high-level interface for parsing texts.

  • nltk.chartparser: a chart-based implementation of the parser interface.

  • nltk.chunkparser: a regular-expression based surface parser.


Nltk top level organization l.jpg

NLTK: Top-Level Organization

  • NLTK is organized as a flat hierarchy of packages and modules.

  • Each module provides the tools necessary to address a specific task

  • Modules contain two types of classes:

    • Data-oriented classes are used to represent information relevant to natural language processing.

    • Task-oriented classes encapsulate the resources and methods needed to perform a specific task.


To the first tutorials l.jpg

To the First Tutorials

  • Tokens and Tokenization

  • Frequency Distributions


The token module l.jpg

The Token Module

  • It is often useful to think of a text in terms of smaller elements, such as words or sentences.

  • The nltk.token module defines classes for representing and processing these smaller elements.

  • What might be other useful smaller elements?


Tokens and types l.jpg

Tokens and Types

  • The term word can be used in two different ways:

    • To refer to an individual occurrence of a word

    • To refer to an abstract vocabulary item

  • For example, the sentence “my dog likes his dog” contains five occurrences of words, but four vocabulary items.

  • To avoid confusion use more precise terminology:

    • Word token: an occurrence of a word

    • Word Type: a vocabulary item


Tokens and types continued l.jpg

Tokens and Types (continued)

  • In NLTK, tokens are constructed from their types using the Token constructor: >>> from nltk.token import * >>> my_word_type = 'dog' 'dog' >>> my_word_token =Token(my_word_type) ‘dog'@[?]

  • Token member functions include type and loc


Text locations l.jpg

Text Locations

  • A text location @ [s:e] specifies a region of a text:

    • s is the start index

    • e is the end index

  • The text location @ [s:e]specifies the text beginning at s, and including everything up to (but not including) the text at e.

  • This definition is consistent with Python slice.

  • Think of indices as appearing between elements: I saw a man 01234

  • Shorthand notation when location width = 1.


Text locations continued l.jpg

Text Locations(continued)

  • Indices can be based on different units:

    • character

    • word

    • sentence

  • Locations can be tagged with sources (files, other text locations – e.g., the first word of the first sentence in the file)

  • Location member functions:

    • start

    • end

    • unit

    • source


Tokenization l.jpg

Tokenization

  • The simplest way to represent a text is with a single string.

  • Difficult to process text in this format.

  • Often, it is more convenient to work with a list of tokens.

  • The task of converting a text from a single string to a list of tokens is known as tokenization.


Tokenization continued l.jpg

Tokenization (continued)

  • Tokenization is harder that it seems

    I’ll see you in New York.

    The aluminum-export ban.

  • The simplest approach is to use “graphic words” (i.e., separate words using whitespace)

  • Another approach is to use regular expressions to specify which substrings are valid words.

  • NLTK provides a generic tokenization interface: TokenizerI


Tokenizeri l.jpg

TokenizerI

  • Defines a single method, tokenize, which takes a string and returns a list of tokens

  • Tokenize is independent of the level of tokenization and the implementation algorithm


Example l.jpg

Example

  • from nltk.token import WSTokenizer from nltk.draw.plot import Plot #Extract a list of words from the corpus corpus = open('corpus.txt').read() tokens = WSTokenizer().tokenize(corpus) # Count up how many times each word length occurs wordlen_count_list = [] for token in tokens: wordlen = len(token.type()) # Add zeros until wordlen_count_list is long enough while wordlen >= len(wordlen_count_list): wordlen_count_list.append(0) # Increment the count for this word length wordlen_count_list[wordlen] += 1 Plot(wordlen_count_list)


Next tutorial probability l.jpg

Next Tutorial: Probability

  • An experiment is any process which leads to a well-defined outcome

  • A sample is any possible outcome of a given experiment

  • Rolling a die?


Outline23 l.jpg

Outline

Review Basics

Probability

Experiments and Samples

Frequency Distributions

Conditional Frequency Distributions


Review nltk goals l.jpg

Review: NLTK Goals

Classes for NLP data

Interfaces for NLP tasks

Implementations, easily combined (what is an example?)


Accessing nltk l.jpg

Accessing NLTK

What is the relation to Python?


Words l.jpg

Words

Types and Tokens

Text Locations

Member Functions


Tokenization27 l.jpg

Tokenization

TokenizerI

Implementations

>>> tokenizer = WSTokenizer()

>>> tokenizer.tokenize(text_str) ['Hello'@[0w], 'world.'@[1w], 'This'@[2w], 'is'@[3w], 'a'@[4w], 'test'@[5w], 'file.'@[6w]]


Word length freq distribution example l.jpg

Word Length Freq. Distribution Example

from nltk.token import WSTokenizer from nltk.probability import SimpleFreqDist # Extract a list of words from the corpus corpus = open('corpus.txt').read() tokens = WSTokenizer().tokenize(corpus) # Construct a frequency distribution of word lengths wordlen_freqs = SimpleFreqDist() for token in tokens: wordlen_freqs.inc(len(token.type())) # Extract the set of word lengths found in the corpus wordlens = wordlen_freqs.samples()


Frequency distributions l.jpg

Frequency Distributions

  • A frequency distribution records the number of times each outcome of an experiment has occurred

  • >>> freq_dist = FreqDist() >>> for token in document: ... freq_dist.inc(token.type())

  • Constructor, then initialization by storing experimental outcomes


Methods l.jpg

Methods

  • The freq method returns the frequencey of a given sample.

  • We can find the number of times a given sample occured with the count method

  • We can find the total number of sample outcomes recorded by a frequency distribution with the N method

  • The samples method returns a list of all samples that have been recorded as outcomes by a frequency distribution

  • We can find the sample with the greatest number of outcomes with the max method


Examples of methods l.jpg

Examples of Methods

  • >>> freq_dist.count('the') 6

  • >>> freq_dist.freq('the') 0.012

  • >>> freq_dist.N() 500

  • >>> freq_dist.max() ‘the’


Simple word length example l.jpg

Simple Word Length Example

  • >>> from nltk.token import WSTokenizer >>> from nltk.probability import FreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) # What is the distribution of word lengths in a corpus? >>> freq_dist = FreqDist() >>> for token in tokens: ... freq_dist.inc(len(token.type()))

    • What is the "outcome" for our experiment?


Simple word length example33 l.jpg

Simple Word Length Example

  • >>> from nltk.token import WSTokenizer >>> from nltk.probability import FreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) # What is the distribution of word lengths in a corpus? >>> freq_dist = FreqDist() >>> for token in tokens: ... freq_dist.inc(len(token.type()))

    • This length is the "outcome" for our experiment, so we use inc() to increment its count in a frequency distribution.


Complex word length example l.jpg

Complex Word Length Example

  • # define vowels as "a", "e", "i", "o", and "u" >>> VOWELS = ('a', 'e', 'i', 'o', 'u') # distribution for words ending in vowels? >>> freq_dist = FreqDist() >>> for token in tokens: ... if token.type()[-1].lower() in VOWELS: ... freq_dist.inc(len(token.type()))

  • What is the condition?


More complex example l.jpg

More Complex Example

  • # What is the distribution of word lengths for # words following words that end in vowels? >>> ended_in_vowel = 0#Did last word end in vowel? >>> freq_dist = FreqDist() >>> for token in tokens: ... if ended_in_vowel: ... Freq_dist.inc(len(token.type()))

    ... ended_in_vowel=token.type()[-1].lower() in VOWELS


Conditional frequency distributions l.jpg

Conditional Frequency Distributions

  • A condition specifies the context in which an experiment is performed

  • A conditional frequency distribution is a collection of frequency distribtuions for the same experiment, run under different conditions

  • The individual frequency distributions are indexed by the condition.

  • NLTK ConditionalFreqDist class

  • >>> cfdist = ConditionalFreqDist() <ConditionalFreqDist with 0 conditions>


Conditional frequency distributions continued l.jpg

Conditional Frequency Distributions (continued)

  • To access the frequency distribution for a condition, use the indexing operator : >>> cfdist['a'] <FreqDist with 0 outcomes>

  • # Record lengths of some words starting with 'a' >>> for word in 'apple and arm'.split(): ... cfdist['a'].inc(len(word))

  • # How many are 3 characters long? >>> cfdist['a'].freq(3) 0.66667

  • To list accessed conditions, use the conditions method:

    >>> cfdist.conditions() ['a']


Example conditioning on a word s initial letter l.jpg

Example: Conditioning on a Word’s Initial Letter

  • >>> from nltk.token import WSTokenizer >>> from nltk.probability import ConditionalFreqDist >>> from nltk.draw.plot import Plot # >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) >>> cfdist = ConditionalFreqDist()


Example continued l.jpg

Example (continued)

  • # How does initial letter affect word length? >>> for token in tokens: ... outcome = len(token.type())

    ... condition = token.type()[0].lower() ... cfdist[condition].inc(outcome)

  • What are the condition and the outcome?


Example continued40 l.jpg

Example (continued)

  • # How does initial letter affect word length? >>> for token in tokens: ... outcome = len(token.type())

    ... condition = token.type()[0].lower() ... cfdist[condition].inc(outcome)

  • What are the condition and the outcome?

  • Condition = the initial letter of the token

  • Outcome = its word length


Prediction l.jpg

Prediction

  • Prediction is the problem of deciding a likely outcome for a given run of an experiment.

  • To predict the outcome, we first examine a training corpus.

  • Training corpus

    • The context and outcome for each run are known

    • Given a new run, we choose the outcome that occurred most frequently for the context

    • Conditional frequency distribution finds the most frequent occurrrence


Prediction example outline l.jpg

Prediction Example: Outline

Record each outcome in the training corpus, using the context that the experiment was under as the condition

Access the frequency distribution for a given context with the indexing operator

Use the max() method to find the most likely outcome


Example predicting words l.jpg

Example: Predicting Words

  • Predict word's type, based on preceding word type

  • >>> from nltk.token import WSTokenizer

    >>> from nltk.probability import ConditionalFreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) >>> cfdist = ConditionalFreqDist() #empty


Example continued44 l.jpg

Example (continued)

  • >>> context = None # The type of the preceding word >>> for token in tokens: ... outcome = token.type() ... cfdist[context].inc(outcome) ... context = token.type()


Example continued45 l.jpg

Example (continued)

  • >>> cfdist['prediction'].max() 'problems' >>> cfdist['problems'].max() 'in' >>> cfdist['in'].max() 'the‘

  • What are we predicting here?


Example continued46 l.jpg

Example (continued)

We predict the most likely word for any context

Generation application:

>>> word = 'prediction' >>> for i in range(15): ... print word, ... word = cfdist[word].max()

prediction problems in the frequency distribution of the frequency distribution of the frequency distribution of


For next time l.jpg

For Next Time

HW3

To run NLTK from unixs.cis.pitt.edu, you should add /afs/cs.pitt.edu/projects/nltk/bin to your search path

Regular Expressions (J&M handout, NLTK tutorial)


  • Login