NLTK & Python Day 5

NLTK & PythonDay 5 LING 681.02 Computational Linguistics Harry Howard Tulane University

Course organization • I have requested that Python and NLTK be installed on the computers in this room. LING 681.02, Prof. Howard, Tulane University

NLPP §1.2 A Closer Look at Python: Texts as Lists of Words

Variables • variable = expression >>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode', ... 'forth', 'from', 'Camelot', '.'] >>> noun_phrase = my_sent[1:4] >>> noun_phrase ['bold', 'Sir', 'Robin'] >>> wOrDs = sorted(noun_phrase) >>> wOrDs ['Robin', 'Sir', 'bold'] LING 681.02, Prof. Howard, Tulane University

How to name variables • Valid names (or identifiers) … • must start with a letter, optionally followed by digits or letters; • are case-sensitive; • cannot contain whitespace (use an underscore) or a dash (means minus); • cannot be a reserved word. LING 681.02, Prof. Howard, Tulane University

Strings • Strings are individual words, i.e. a single element list. • Some methods for strings >>> name = 'Monty' >>> name[0] 'M' >>> name[:4] 'Mont' >>> name * 2 'MontyMonty' >>> name + '!' 'Monty!' >>> ' '.join(['Monty', 'Python']) 'Monty Python' >>> 'Monty Python'.split() ['Monty', 'Python'] LING 681.02, Prof. Howard, Tulane University

NLPP §1.3. Computing with Language: Simple Statistics

Frequency distribution • What is a frequency distribution? • It tells us the frequency of each vocabulary item in a text. • It is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items. • What function in NLTK calculates it? • FreqDist(text_name) • What expression lists the tokens with their distribution? • text_name.keys() LING 681.02, Prof. Howard, Tulane University

Very frequent words • How would you describe the 50 most frequent elements in Moby Dick? >>>fdist1.plot(50, cumulative=True) LING 681.02, Prof. Howard, Tulane University

Very infrequent words • Words that occur only once are called hapaxes. • >>>fdist1.hapaxes() • In Moby Dick, "lexicographer, cetological, contraband, expostulations", and about 9,000 others. • How would you describe them? LING 681.02, Prof. Howard, Tulane University

Summary LING 681.02, Prof. Howard, Tulane University

Question • Which group would you look in to find words that help you understand what the text is about? • Neither. LING 681.02, Prof. Howard, Tulane University

Fine-grained word selection • Some Python expressions are based on set theory. • {w | w ∈ V & P(w)} • [w for w in V if p(w)], though this returns a list, not a set. (What's the difference?) • Real NLTK >>> V = set(text1) >>> long_words = [w for w in V if len(w) > 15] LING 681.02, Prof. Howard, Tulane University

Finding words that characterize a text • Not too short (>?) and not too infrequent (>?) • >>> informative_words = [w for w in V if len(w) > 7 and FreqDist(V) > 7] LING 681.02, Prof. Howard, Tulane University

Finding groups of words • What is the name for a sequence of two words? • Bigram ~ bigrams() >>> bigrams(['more', 'is', 'said', 'than', 'done']) [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] • What is the name for a sequence of words that occur together unusually often? • Collocation ~ collocations() • They are essentially bigrams that occur more often than we would expect based on the frequency of individual words. LING 681.02, Prof. Howard, Tulane University

Example • >>> text4.collocations() • Building collocations list • United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; National Government; United Nations; public money LING 681.02, Prof. Howard, Tulane University

Counting Other Things LING 681.02, Prof. Howard, Tulane University

Next time First quiz/project NLPP: finish §1 and do all exercises; do up to Ex 8 in §2

NLTK & Python Day 5

NLTK & Python Day 5

Presentation Transcript

Guide to Programming with Python

Python Programming: An Introduction to Computer Science

Introduction to Computing and Programming in Python: A Multimedia Approach

Python Programming: An Introduction to Computer Science

The MonetDB Architecture

Let them configure!

LING / C SC 439/539 Statistical Natural Language Processing

Drawing and Image Processing in Python with Myro Graphics

Reverse Engineering

Python Fundamentals

CHAPTER 18

(Python) Fundamentals

Tools for Scientific Computing in Python

Totally Awesome Computing

Getting Started with Python

CS 5 Today

Introduction to Computing and Programming in Python: A Multimedia Approach

Introduction to Computing and Programming in Python: A Multimedia Approach

Python+selenium 自动化测试入门

Zope for Content Managers

NLTK &amp; Python Day 5

NLTK &amp; Python Day 5

Presentation Transcript

Guide to Programming with Python

Python Programming: An Introduction to Computer Science

Introduction to Computing and Programming in Python: A Multimedia Approach

Python Programming: An Introduction to Computer Science

The MonetDB Architecture

Let them configure!

LING / C SC 439/539 Statistical Natural Language Processing

Drawing and Image Processing in Python with Myro Graphics

Reverse Engineering

Python Fundamentals

CHAPTER 18

(Python) Fundamentals

Tools for Scientific Computing in Python

Totally Awesome Computing

Getting Started with Python

CS 5 Today

Introduction to Computing and Programming in Python: A Multimedia Approach

Introduction to Computing and Programming in Python: A Multimedia Approach

Python+selenium 自动化测试入门

Zope for Content Managers

NLTK & Python Day 5

NLTK & Python Day 5