CSCE 590 Web Scraping – NLTK

CSCE 590 Web Scraping – NLTK • Topics • The Natural Language Tool Kit (NLTK) • Readings: • Online Book– • http://www.nltk.org/book/ March 23, 2017

Natural Language Tool Kit (NLTK) • Part of speech taggers • Statistical libraries • Parsers • corpora

Installing NLTK • http://www.nltk.org/ • Mac/Unix • Install NLTK: run sudo pip install -U nltk • Install Numpy (optional): run sudo pip install -U numpy • Test installation: run python then type import nltk • For older versions of Python it might be necessary to install setuptools (see http://pypi.python.org/pypi/setuptools) and to install pip (sudo easy_install pip).

nltk.download() • >>> import nltk • >>> nltk.download()

Test of download • >>> from nltk.corpus import brown • >>> brown.words() • ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] • >>> len(brown.words()) • 1161192

Examples from the NLTK Book • Loading text1, ..., text9 and sent1, ..., sent9 • Type: 'texts()' or 'sents()' to list the materials. • text1: Moby Dick by Herman Melville 1851 • text2: Sense and Sensibility by Jane Austen 1811 • text3: The Book of Genesis text4: Inaugural Address Corpus • text5: Chat Corpus • text6: Monty Python and the Holy Grail • text7: Wall Street Journal • text8: Personals Corpus • text9: The Man Who Was Thursday by G . K . Chesterton 1908 • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3364-3367). O'Reilly Media. Kindle Edition.

Simple Statistical analysis using NLTK • > > > len( text6)/ len( set( text6)) • 7.833333333333333 • > > > from nltk import FreqDist • > > > fdist = FreqDist( text6) • > > > fdist.most_common( 10) • [(':', 1197), ('.', 816), ('!', 801), (',', 731), ("'", 421), ('[', 3 19), (']', 312), (' the', 299), (' I', 255), (' ARTHUR', 225)] • > > > fdist[" Grail"] • 34 • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3375-3385). O'Reilly Media. Kindle Edition.

Bigrams - ngrams • from nltk.book import * • from nltk import ngrams • fourgrams = ngrams( text6, 4) • for fourgram in fourgrams: • if fourgram[ 0] = = "coconut": • print( fourgram) • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3407-3412). O'Reilly Media. Kindle Edition.

nltkFreqDist.py – BeautifulSoup + NLTK example • from nltk import FreqDist • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html") • bsObj = BeautifulSoup(html.read(), "lxml") • #print(bsObj.h1) • mytext = bsObj.get_text() • fdist = FreqDist(mytext) • print(fdist.most_common(10))

FreqDist of ngrams (bigrams) • > > > from nltk import ngrams • > > > fourgrams = ngrams( text6, 4) • > > > fourgramsDist = FreqDist( fourgrams) • > > > fourgramsDist[(" father", "smelt", "of", "elderberries")] • 1 • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3398-3403). O'Reilly Media. Kindle Edition.

Penn Tree Bank Tagging (default)

POS tagging

NltkAnalysis.py • from nltk import word_tokenize, sent_tokenize, pos_tag • sentences = sent_tokenize("Google is one of the best companies in the world. I constantly google myself to see what I'm up to.") • nouns = ['NN', 'NNS', 'NNP', 'NNPS'] • for sentence in sentences: • if "google" in sentence.lower(): • taggedWords = pos_tag(word_tokenize(sentence)) • for word in taggedWords: • if word[0].lower() == "google" and word[1] in nouns: • print(sentence)

CSCE 590 Web Scraping – NLTK