1 / 14

NLTK & Python Day 7

NLTK & Python Day 7. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. I have requested that NLTK be installed on the computers in this room. NLPP §2 Accessing text corpora and lexical resources. §2.1 Accessing text corpora. What's that word.

dionne
Download Presentation

NLTK & Python Day 7

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NLTK & PythonDay 7 LING 681.02 Computational Linguistics Harry Howard Tulane University

  2. Course organization • I have requested that NLTK be installed on the computers in this room. LING 681.02, Prof. Howard, Tulane University

  3. NLPP §2 Accessing text corpora and lexical resources §2.1 Accessing text corpora

  4. What's that word • What is a corpus/corpora? • "large bodies of linguistic data" LING 681.02, Prof. Howard, Tulane University

  5. Some corpora in NLTK • The Project Gutenberg electronic text archive • 25k free electronic books at http://www.gutenberg.org/ • Web and chat text • The Brown corpus • First 1M word e-corpus, from 500 sources • The Reuters corpus • The Inaugural Address corpus • Annotated text corpora • Corpora in other languages LING 681.02, Prof. Howard, Tulane University

  6. Using corpora in NLTK • Only the corpora in the nltk.book corpus are formatted as lists and so can be arguments to NLTK functions. • To convert another corpus into a list, use: your_text_name = nltk.Text(corpus_name) LING 681.02, Prof. Howard, Tulane University

  7. Basic corpus functionsTable 2.3 LING 681.02, Prof. Howard, Tulane University

  8. Basic corpus functionsTable 2.3 LING 681.02, Prof. Howard, Tulane University

  9. Code to get started >>> from nltk.corpus import gutenberg >>> >>> emma = gutenberg.words('austen-emma.txt') >>> >>> emma = nltk.Text(emma) >>> >>> emma.collocations() Frank Churchill; Miss Woodhouse; Miss Bates; Jane Fairfax; Miss Fairfax; young man; great deal; John Knightley; Maple Grove; Miss Smith; Miss Taylor; Robert Martin; Colonel Campbell; Box Hill; Harriet Smith; William Larkins; Brunswick Square; young lady; young woman; Miss Hawkins LING 681.02, Prof. Howard, Tulane University

  10. Loading your own corpusTable 2.3 LING 681.02, Prof. Howard, Tulane University

  11. NLPP §2 Accessing text corpora and lexical resources §2.2 Conditional frequency distributions

  12. Back to frequency • FreqDist(mylist) calculates the number of occurrences of each item in 'mylist'. • ConditionalFreqDist(mypairs) calculates the number of occurrences of each pair of items in 'mypairs', • where the pairing might be of author & word, genre & word, topic & word, etc.: condition & text LING 681.02, Prof. Howard, Tulane University

  13. An example >>> from nltk.corpus import brown >>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ... for genre in brown.categories() ... for word in brown.words(categories=genre)) LING 681.02, Prof. Howard, Tulane University

  14. Next time NLPP: §2.3ff Do "Your Turn" up to p. 55 Exercises 2.8.2-4, 2.8.8

More Related