1 / 42

Python 3

Python 3. March 15, 2011. NLTK. i mport nltk n ltk.download(). NLTK. 1. Look at the lists of available texts. import nltk from nltk.book import * texts(). NLTK. 2. Check out what the text1 (Moby Dick) object looks like. import nltk from nltk.book import * print text1[0:50]. NLTK.

marek
Download Presentation

Python 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Python 3 March 15, 2011

  2. NLTK import nltk nltk.download()

  3. NLTK 1. Look at the lists of available texts import nltk from nltk.book import * texts()

  4. NLTK • 2. Check out what the text1 (Moby Dick) object looks like import nltk from nltk.book import * print text1[0:50]

  5. NLTK • 2. Check out what the text1 (Moby Dick) object looks like import nltk from nltk.book import * print text1[0:50] Looks like a list of word tokens

  6. NLTK • 3. Get list of top most frequent word TOKENS import nltk from nltk.book import * fd=FreqDist(text1) print fd.keys()[0:10]

  7. NLTK • 3. Get list of top most frequent word TOKENS import nltk from nltk.book import * fd=FreqDist(text1) print fd.keys()[0:10] FreqDist is an object defined by NLTK http://www.opendocs.net/nltk/0.9.5/api/nltk.probability.FreqDist-class.html Give it a list of word tokens It will be automatically sorted. Print the first 10 keys

  8. NLTK • 4. Now get a concordance of the third most common word import nltk from nltk.book import * text1.concordance("and")

  9. NLTK • 4. Now get a concordance of the third most common word import nltk from nltk.book import * text1.concordance("and") concordance is method defined for an nltk text http://nltk.googlecode.com/svn/trunk/doc/api/nltk.text.Text-class.html#concordance concordance(self, word, width=79, lines=25) Print a concordance for word with the specified context window.

  10. String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10]

  11. String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens

  12. String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick

  13. String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list…

  14. String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each , with nothing

  15. String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list… Copy the token into the new list, except replace each , with nothing Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)

  16. String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list… Make a new FreqDist with the new list of tokens, call it fd Copy the token into the new list, except replace each , with nothing Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)

  17. String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10] Make a new list of tokens Call it mobyDick For each token x in the original list… Make a new FreqDist with the new list of tokens, call it fd Copy the token into the new list, except replace each , with nothing Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty) Print it like before

  18. String Operations • 5. What if you don't want punctuation in your list? • First, simple way to fix it: import nltk from nltk.book import * mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1] fd=FreqDist(mobyDick) print fd.keys()[0:10]

  19. Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10]

  20. Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Import regular expression module

  21. Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Compile a regular expression

  22. Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] The RegEx will match any of the characters inside the brackets

  23. Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Call the “sub” function associated with the RegEx named punctuation

  24. Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Replace anything that matches the RegEx with nothing

  25. Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] As before, do this to each token in the text1 list

  26. Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Call this new list punctuationRemoved

  27. Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Get a FreqDist of all tokens with length >1

  28. Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Print the top 10 word tokens as usual

  29. Regular Expressions • 6. Now the more complicated, but less typing way: import nltk from nltk.book import * import re punctuation = re.compile("[,.; '-]") punctuationRemoved=[punctuation.sub("",x) for x in text1] fd=FreqDist([x for x in punctuationRemoved if len(x)>1]) print fd.keys()[0:10] Regular Expressions are Really Powerful and Useful!

  30. Quick Diversion • 7. What if you wanted to see the least common word tokens? import nltk from nltk.book import * import re print fd.keys()[-10:]

  31. Quick Diversion • 7. What if you wanted to see the least common word tokens? import nltk from nltk.book import * import re print fd.keys()[-10:] Print the tokens from position -10 to the end

  32. Quick Diversion • 8. And what if you wanted to see the frequencies with the words? import nltk from nltk.book import * import re print [(k, fd[k]) for k in fd.keys()[0:10]] For each key “k” in the FreqDist, print it and look up its value (fd[k])

  33. Back to Regular Expressions • 9. Another simple example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString)

  34. Back to Regular Expressions • 9. Another simple example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) Looks similar to the RegEx that matched punctuation before

  35. Back to Regular Expressions • 9. Another simple example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) This RegEx matches the substring “blue” or the substring “red” or the substring “green”

  36. Back to Regular Expressions • 9. Another simple example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” colorsRegEx=re.compile("blue|red|green") print colorsRegEx.sub("color",myString) Here, substitute anything that matches the RegEx with the string “color”

  37. Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” What if we wanted to identify all of the phone numbers in the string?

  38. Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) This is a start. Output: ['18005551234'] Note that \d is a digit, and {11} matches 11 digits in a row

  39. Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) findall will return a list of all substrings of myString that match the RegEx

  40. Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile('\d{11}') print phoneNumbersRegEx.findall(myString) Also will need to know: “?” will match 0 or 1 repetitions of the previous element Note: find lots more information on regular expressions here: http://docs.python.org/library/re.html

  41. Back to Regular Expressions • 10. A more interesting example import re myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.” phoneNumbersRegEx=re.compile(''1?-?\(?\d{3}\)?-?\d{3}-?\d{4}'') print phoneNumbersRegEx.findall(myString) Answer is here, but let’s derive it together

  42. Homework • Webpage Identifying Information Write two regular expressions to match: 1. Email addresses 2. Phone numbers List, remove, or tag those found in the webpage: https://jshare.johnshopkins.edu/kchurch4/public_html/teaching/103/Spring2011/ Hint: Use part 2 of the last homework (and urllib) and two regular expressions. For phone numbers, go ahead and use the example from this class! As always, email answers to Ken (Kenneth.Church@jhu.edu) and Ann (annirvine@gmail.com) by dawn Thursday, March 17

More Related