1 / 18

NLTK & Python Day 5

NLTK & Python Day 5. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. I have requested that Python and NLTK be installed on the computers in this room. NLPP. §1.2 A Closer Look at Python: Texts as Lists of Words. Variables. variable = expression

walker
Download Presentation

NLTK & Python Day 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NLTK & PythonDay 5 LING 681.02 Computational Linguistics Harry Howard Tulane University

  2. Course organization • I have requested that Python and NLTK be installed on the computers in this room. LING 681.02, Prof. Howard, Tulane University

  3. NLPP §1.2 A Closer Look at Python: Texts as Lists of Words

  4. Variables • variable = expression >>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode', ... 'forth', 'from', 'Camelot', '.'] >>> noun_phrase = my_sent[1:4] >>> noun_phrase ['bold', 'Sir', 'Robin'] >>> wOrDs = sorted(noun_phrase) >>> wOrDs ['Robin', 'Sir', 'bold'] LING 681.02, Prof. Howard, Tulane University

  5. How to name variables • Valid names (or identifiers) … • must start with a letter, optionally followed by digits or letters; • are case-sensitive; • cannot contain whitespace (use an underscore) or a dash (means minus); • cannot be a reserved word. LING 681.02, Prof. Howard, Tulane University

  6. Strings • Strings are individual words, i.e. a single element list. • Some methods for strings >>> name = 'Monty' >>> name[0] 'M' >>> name[:4] 'Mont' >>> name * 2 'MontyMonty' >>> name + '!' 'Monty!' >>> ' '.join(['Monty', 'Python']) 'Monty Python' >>> 'Monty Python'.split() ['Monty', 'Python'] LING 681.02, Prof. Howard, Tulane University

  7. NLPP §1.3. Computing with Language: Simple Statistics

  8. Frequency distribution • What is a frequency distribution? • It tells us the frequency of each vocabulary item in a text. • It is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items. • What function in NLTK calculates it? • FreqDist(text_name) • What expression lists the tokens with their distribution? • text_name.keys() LING 681.02, Prof. Howard, Tulane University

  9. Very frequent words • How would you describe the 50 most frequent elements in Moby Dick? >>>fdist1.plot(50, cumulative=True) LING 681.02, Prof. Howard, Tulane University

  10. Very infrequent words • Words that occur only once are called hapaxes. • >>>fdist1.hapaxes() • In Moby Dick, "lexicographer, cetological, contraband, expostulations", and about 9,000 others. • How would you describe them? LING 681.02, Prof. Howard, Tulane University

  11. Summary LING 681.02, Prof. Howard, Tulane University

  12. Question • Which group would you look in to find words that help you understand what the text is about? • Neither. LING 681.02, Prof. Howard, Tulane University

  13. Fine-grained word selection • Some Python expressions are based on set theory. • {w | w ∈ V & P(w)} • [w for w in V if p(w)], though this returns a list, not a set. (What's the difference?) • Real NLTK >>> V = set(text1) >>> long_words = [w for w in V if len(w) > 15] LING 681.02, Prof. Howard, Tulane University

  14. Finding words that characterize a text • Not too short (>?) and not too infrequent (>?) • >>> informative_words = [w for w in V if len(w) > 7 and FreqDist(V) > 7] LING 681.02, Prof. Howard, Tulane University

  15. Finding groups of words • What is the name for a sequence of two words? • Bigram ~ bigrams() >>> bigrams(['more', 'is', 'said', 'than', 'done']) [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] • What is the name for a sequence of words that occur together unusually often? • Collocation ~ collocations() • They are essentially bigrams that occur more often than we would expect based on the frequency of individual words. LING 681.02, Prof. Howard, Tulane University

  16. Example • >>> text4.collocations() • Building collocations list • United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; National Government; United Nations; public money LING 681.02, Prof. Howard, Tulane University

  17. Counting Other Things LING 681.02, Prof. Howard, Tulane University

  18. Next time First quiz/project NLPP: finish §1 and do all exercises; do up to Ex 8 in §2

More Related