1 / 16

CSCE 590 Web Scraping – NLTK

CSCE 590 Web Scraping – NLTK. Topics The Natural Language Tool Kit (NLTK) Readings: Online Book– http://www.nltk.org/book /. March 23, 2017. Natural Language Tool Kit (NLTK). Part of speech taggers Statistical libraries Parsers corpora. Installing NLTK. http://www.nltk.org/

peggycurtis
Download Presentation

CSCE 590 Web Scraping – NLTK

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCE 590 Web Scraping – NLTK • Topics • The Natural Language Tool Kit (NLTK) • Readings: • Online Book– • http://www.nltk.org/book/ March 23, 2017

  2. Natural Language Tool Kit (NLTK) • Part of speech taggers • Statistical libraries • Parsers • corpora

  3. Installing NLTK • http://www.nltk.org/ • Mac/Unix • Install NLTK: run sudo pip install -U nltk • Install Numpy (optional): run sudo pip install -U numpy • Test installation: run python then type import nltk • For older versions of Python it might be necessary to install setuptools (see http://pypi.python.org/pypi/setuptools) and to install pip (sudo easy_install pip).

  4. nltk.download() • >>> import nltk • >>> nltk.download()

  5. Test of download • >>> from nltk.corpus import brown • >>> brown.words() • ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] • >>> len(brown.words()) • 1161192

  6. Examples from the NLTK Book • Loading text1, ..., text9 and sent1, ..., sent9 • Type: 'texts()' or 'sents()' to list the materials. • text1: Moby Dick by Herman Melville 1851 • text2: Sense and Sensibility by Jane Austen 1811 • text3: The Book of Genesis text4: Inaugural Address Corpus • text5: Chat Corpus • text6: Monty Python and the Holy Grail • text7: Wall Street Journal • text8: Personals Corpus • text9: The Man Who Was Thursday by G . K . Chesterton 1908 • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3364-3367). O'Reilly Media. Kindle Edition.

  7. Simple Statistical analysis using NLTK • > > > len( text6)/ len( set( text6)) • 7.833333333333333 • > > > from nltk import FreqDist • > > > fdist = FreqDist( text6) • > > > fdist.most_common( 10) • [(':', 1197), ('.', 816), ('!', 801), (',', 731), ("'", 421), ('[', 3 19), (']', 312), (' the', 299), (' I', 255), (' ARTHUR', 225)] • > > > fdist[" Grail"] • 34 • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3375-3385). O'Reilly Media. Kindle Edition.

  8. Bigrams - ngrams • from nltk.book import * • from nltk import ngrams • fourgrams = ngrams( text6, 4) • for fourgram in fourgrams: • if fourgram[ 0] = = "coconut": • print( fourgram) • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3407-3412). O'Reilly Media. Kindle Edition.

  9. nltkFreqDist.py – BeautifulSoup + NLTK example • from nltk import FreqDist • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html") • bsObj = BeautifulSoup(html.read(), "lxml") • #print(bsObj.h1) • mytext = bsObj.get_text() • fdist = FreqDist(mytext) • print(fdist.most_common(10))

  10. FreqDist of ngrams (bigrams) • > > > from nltk import ngrams • > > > fourgrams = ngrams( text6, 4) • > > > fourgramsDist = FreqDist( fourgrams) • > > > fourgramsDist[(" father", "smelt", "of", "elderberries")] • 1 • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3398-3403). O'Reilly Media. Kindle Edition.

  11. Penn Tree Bank Tagging (default)

  12. POS tagging

  13. NltkAnalysis.py • from nltk import word_tokenize, sent_tokenize, pos_tag • sentences = sent_tokenize("Google is one of the best companies in the world. I constantly google myself to see what I'm up to.") • nouns = ['NN', 'NNS', 'NNP', 'NNPS'] • for sentence in sentences: • if "google" in sentence.lower(): • taggedWords = pos_tag(word_tokenize(sentence)) • for word in taggedWords: • if word[0].lower() == "google" and word[1] in nouns: • print(sentence)

More Related