nltk python day 5
Download
Skip this Video
Download Presentation
NLTK & Python Day 5

Loading in 2 Seconds...

play fullscreen
1 / 18

NLTK Python Day 5 - PowerPoint PPT Presentation


  • 135 Views
  • Uploaded on

NLTK & Python Day 5. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. I have requested that Python and NLTK be installed on the computers in this room. NLPP. §1.2 A Closer Look at Python: Texts as Lists of Words. Variables. variable = expression

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'NLTK Python Day 5' - walker


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
nltk python day 5

NLTK & PythonDay 5

LING 681.02

Computational Linguistics

Harry Howard

Tulane University

course organization
Course organization
  • I have requested that Python and NLTK be installed on the computers in this room.

LING 681.02, Prof. Howard, Tulane University

slide3

NLPP

§1.2 A Closer Look at Python: Texts as Lists of Words

variables
Variables
  • variable = expression

>>> my_sent = [\'Bravely\', \'bold\', \'Sir\', \'Robin\', \',\', \'rode\',

... \'forth\', \'from\', \'Camelot\', \'.\']

>>> noun_phrase = my_sent[1:4]

>>> noun_phrase

[\'bold\', \'Sir\', \'Robin\']

>>> wOrDs = sorted(noun_phrase)

>>> wOrDs

[\'Robin\', \'Sir\', \'bold\']

LING 681.02, Prof. Howard, Tulane University

how to name variables
How to name variables
  • Valid names (or identifiers) …
    • must start with a letter, optionally followed by digits or letters;
    • are case-sensitive;
    • cannot contain whitespace (use an underscore) or a dash (means minus);
    • cannot be a reserved word.

LING 681.02, Prof. Howard, Tulane University

strings
Strings
  • Strings are individual words, i.e. a single element list.
  • Some methods for strings

>>> name = \'Monty\'

>>> name[0]

\'M\'

>>> name[:4]

\'Mont\'

>>> name * 2

\'MontyMonty\'

>>> name + \'!\'

\'Monty!\'

>>> \' \'.join([\'Monty\', \'Python\'])

\'Monty Python\'

>>> \'Monty Python\'.split()

[\'Monty\', \'Python\']

LING 681.02, Prof. Howard, Tulane University

slide7

NLPP

§1.3. Computing with Language: Simple Statistics

frequency distribution
Frequency distribution
  • What is a frequency distribution?
    • It tells us the frequency of each vocabulary item in a text.
    • It is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items.
  • What function in NLTK calculates it?
    • FreqDist(text_name)
  • What expression lists the tokens with their distribution?
    • text_name.keys()

LING 681.02, Prof. Howard, Tulane University

very frequent words
Very frequent words
  • How would you describe the 50 most frequent elements in Moby Dick?

>>>fdist1.plot(50, cumulative=True)

LING 681.02, Prof. Howard, Tulane University

very infrequent words
Very infrequent words
  • Words that occur only once are called hapaxes.
    • >>>fdist1.hapaxes()
    • In Moby Dick, "lexicographer, cetological, contraband, expostulations", and about 9,000 others.
    • How would you describe them?

LING 681.02, Prof. Howard, Tulane University

summary
Summary

LING 681.02, Prof. Howard, Tulane University

question
Question
  • Which group would you look in to find words that help you understand what the text is about?
    • Neither.

LING 681.02, Prof. Howard, Tulane University

fine grained word selection
Fine-grained word selection
  • Some Python expressions are based on set theory.
    • {w | w ∈ V & P(w)}
    • [w for w in V if p(w)], though this returns a list, not a set. (What\'s the difference?)
  • Real NLTK

>>> V = set(text1)

>>> long_words = [w for w in V if len(w) > 15]

LING 681.02, Prof. Howard, Tulane University

finding words that characterize a text
Finding words that characterize a text
  • Not too short (>?) and not too infrequent (>?)
    • >>> informative_words = [w for w in V if len(w) > 7 and FreqDist(V) > 7]

LING 681.02, Prof. Howard, Tulane University

finding groups of words
Finding groups of words
  • What is the name for a sequence of two words?
    • Bigram ~ bigrams()

>>> bigrams([\'more\', \'is\', \'said\', \'than\', \'done\'])

[(\'more\', \'is\'), (\'is\', \'said\'), (\'said\', \'than\'), (\'than\', \'done\')]

  • What is the name for a sequence of words that occur together unusually often?
    • Collocation ~ collocations()
    • They are essentially bigrams that occur more often than we would expect based on the frequency of individual words.

LING 681.02, Prof. Howard, Tulane University

example
Example
  • >>> text4.collocations()
  • Building collocations list
  • United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; National Government; United Nations; public money

LING 681.02, Prof. Howard, Tulane University

counting other things
Counting Other Things

LING 681.02, Prof. Howard, Tulane University

next time

Next time

First quiz/project

NLPP: finish §1 and do all exercises;

do up to Ex 8 in §2

ad