Nltk python day 5 l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

NLTK & Python Day 5 PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on
  • Presentation posted in: General

NLTK & Python Day 5. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. I have requested that Python and NLTK be installed on the computers in this room. NLPP. §1.2 A Closer Look at Python: Texts as Lists of Words. Variables. variable = expression

Download Presentation

NLTK & Python Day 5

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Nltk python day 5 l.jpg

NLTK & PythonDay 5

LING 681.02

Computational Linguistics

Harry Howard

Tulane University


Course organization l.jpg

Course organization

  • I have requested that Python and NLTK be installed on the computers in this room.

LING 681.02, Prof. Howard, Tulane University


Slide3 l.jpg

NLPP

§1.2 A Closer Look at Python: Texts as Lists of Words


Variables l.jpg

Variables

  • variable = expression

    >>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',

    ... 'forth', 'from', 'Camelot', '.']

    >>> noun_phrase = my_sent[1:4]

    >>> noun_phrase

    ['bold', 'Sir', 'Robin']

    >>> wOrDs = sorted(noun_phrase)

    >>> wOrDs

    ['Robin', 'Sir', 'bold']

LING 681.02, Prof. Howard, Tulane University


How to name variables l.jpg

How to name variables

  • Valid names (or identifiers) …

    • must start with a letter, optionally followed by digits or letters;

    • are case-sensitive;

    • cannot contain whitespace (use an underscore) or a dash (means minus);

    • cannot be a reserved word.

LING 681.02, Prof. Howard, Tulane University


Strings l.jpg

Strings

  • Strings are individual words, i.e. a single element list.

  • Some methods for strings

    >>> name = 'Monty'

    >>> name[0]

    'M'

    >>> name[:4]

    'Mont'

    >>> name * 2

    'MontyMonty'

    >>> name + '!'

    'Monty!'

    >>> ' '.join(['Monty', 'Python'])

    'Monty Python'

    >>> 'Monty Python'.split()

    ['Monty', 'Python']

LING 681.02, Prof. Howard, Tulane University


Slide7 l.jpg

NLPP

§1.3. Computing with Language: Simple Statistics


Frequency distribution l.jpg

Frequency distribution

  • What is a frequency distribution?

    • It tells us the frequency of each vocabulary item in a text.

    • It is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items.

  • What function in NLTK calculates it?

    • FreqDist(text_name)

  • What expression lists the tokens with their distribution?

    • text_name.keys()

LING 681.02, Prof. Howard, Tulane University


Very frequent words l.jpg

Very frequent words

  • How would you describe the 50 most frequent elements in Moby Dick?

    >>>fdist1.plot(50, cumulative=True)

LING 681.02, Prof. Howard, Tulane University


Very infrequent words l.jpg

Very infrequent words

  • Words that occur only once are called hapaxes.

    • >>>fdist1.hapaxes()

    • In Moby Dick, "lexicographer, cetological, contraband, expostulations", and about 9,000 others.

    • How would you describe them?

LING 681.02, Prof. Howard, Tulane University


Summary l.jpg

Summary

LING 681.02, Prof. Howard, Tulane University


Question l.jpg

Question

  • Which group would you look in to find words that help you understand what the text is about?

    • Neither.

LING 681.02, Prof. Howard, Tulane University


Fine grained word selection l.jpg

Fine-grained word selection

  • Some Python expressions are based on set theory.

    • {w | w ∈ V & P(w)}

    • [w for w in V if p(w)], though this returns a list, not a set. (What's the difference?)

  • Real NLTK

    >>> V = set(text1)

    >>> long_words = [w for w in V if len(w) > 15]

LING 681.02, Prof. Howard, Tulane University


Finding words that characterize a text l.jpg

Finding words that characterize a text

  • Not too short (>?) and not too infrequent (>?)

    • >>> informative_words = [w for w in V if len(w) > 7 and FreqDist(V) > 7]

LING 681.02, Prof. Howard, Tulane University


Finding groups of words l.jpg

Finding groups of words

  • What is the name for a sequence of two words?

    • Bigram ~ bigrams()

      >>> bigrams(['more', 'is', 'said', 'than', 'done'])

      [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

  • What is the name for a sequence of words that occur together unusually often?

    • Collocation ~ collocations()

    • They are essentially bigrams that occur more often than we would expect based on the frequency of individual words.

LING 681.02, Prof. Howard, Tulane University


Example l.jpg

Example

  • >>> text4.collocations()

  • Building collocations list

  • United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; National Government; United Nations; public money

LING 681.02, Prof. Howard, Tulane University


Counting other things l.jpg

Counting Other Things

LING 681.02, Prof. Howard, Tulane University


Next time l.jpg

Next time

First quiz/project

NLPP: finish §1 and do all exercises;

do up to Ex 8 in §2


  • Login