nltk python day 5 l.
Download
Skip this Video
Download Presentation
NLTK & Python Day 5

Loading in 2 Seconds...

play fullscreen
1 / 18

NLTK & Python Day 5 - PowerPoint PPT Presentation


  • 136 Views
  • Uploaded on

NLTK & Python Day 5. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. I have requested that Python and NLTK be installed on the computers in this room. NLPP. §1.2 A Closer Look at Python: Texts as Lists of Words. Variables. variable = expression

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'NLTK & Python Day 5' - walker


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
nltk python day 5

NLTK & PythonDay 5

LING 681.02

Computational Linguistics

Harry Howard

Tulane University

course organization
Course organization
  • I have requested that Python and NLTK be installed on the computers in this room.

LING 681.02, Prof. Howard, Tulane University

slide3

NLPP

§1.2 A Closer Look at Python: Texts as Lists of Words

variables
Variables
  • variable = expression

>>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',

... 'forth', 'from', 'Camelot', '.']

>>> noun_phrase = my_sent[1:4]

>>> noun_phrase

['bold', 'Sir', 'Robin']

>>> wOrDs = sorted(noun_phrase)

>>> wOrDs

['Robin', 'Sir', 'bold']

LING 681.02, Prof. Howard, Tulane University

how to name variables
How to name variables
  • Valid names (or identifiers) …
    • must start with a letter, optionally followed by digits or letters;
    • are case-sensitive;
    • cannot contain whitespace (use an underscore) or a dash (means minus);
    • cannot be a reserved word.

LING 681.02, Prof. Howard, Tulane University

strings
Strings
  • Strings are individual words, i.e. a single element list.
  • Some methods for strings

>>> name = 'Monty'

>>> name[0]

'M'

>>> name[:4]

'Mont'

>>> name * 2

'MontyMonty'

>>> name + '!'

'Monty!'

>>> ' '.join(['Monty', 'Python'])

'Monty Python'

>>> 'Monty Python'.split()

['Monty', 'Python']

LING 681.02, Prof. Howard, Tulane University

slide7

NLPP

§1.3. Computing with Language: Simple Statistics

frequency distribution
Frequency distribution
  • What is a frequency distribution?
    • It tells us the frequency of each vocabulary item in a text.
    • It is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items.
  • What function in NLTK calculates it?
    • FreqDist(text_name)
  • What expression lists the tokens with their distribution?
    • text_name.keys()

LING 681.02, Prof. Howard, Tulane University

very frequent words
Very frequent words
  • How would you describe the 50 most frequent elements in Moby Dick?

>>>fdist1.plot(50, cumulative=True)

LING 681.02, Prof. Howard, Tulane University

very infrequent words
Very infrequent words
  • Words that occur only once are called hapaxes.
    • >>>fdist1.hapaxes()
    • In Moby Dick, "lexicographer, cetological, contraband, expostulations", and about 9,000 others.
    • How would you describe them?

LING 681.02, Prof. Howard, Tulane University

summary
Summary

LING 681.02, Prof. Howard, Tulane University

question
Question
  • Which group would you look in to find words that help you understand what the text is about?
    • Neither.

LING 681.02, Prof. Howard, Tulane University

fine grained word selection
Fine-grained word selection
  • Some Python expressions are based on set theory.
    • {w | w ∈ V & P(w)}
    • [w for w in V if p(w)], though this returns a list, not a set. (What's the difference?)
  • Real NLTK

>>> V = set(text1)

>>> long_words = [w for w in V if len(w) > 15]

LING 681.02, Prof. Howard, Tulane University

finding words that characterize a text
Finding words that characterize a text
  • Not too short (>?) and not too infrequent (>?)
    • >>> informative_words = [w for w in V if len(w) > 7 and FreqDist(V) > 7]

LING 681.02, Prof. Howard, Tulane University

finding groups of words
Finding groups of words
  • What is the name for a sequence of two words?
    • Bigram ~ bigrams()

>>> bigrams(['more', 'is', 'said', 'than', 'done'])

[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

  • What is the name for a sequence of words that occur together unusually often?
    • Collocation ~ collocations()
    • They are essentially bigrams that occur more often than we would expect based on the frequency of individual words.

LING 681.02, Prof. Howard, Tulane University

example
Example
  • >>> text4.collocations()
  • Building collocations list
  • United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; National Government; United Nations; public money

LING 681.02, Prof. Howard, Tulane University

counting other things
Counting Other Things

LING 681.02, Prof. Howard, Tulane University

next time

Next time

First quiz/project

NLPP: finish §1 and do all exercises;

do up to Ex 8 in §2