1 / 28

Web Scraping and NLTK for Information Extraction (IE)

Learn about web scraping and NLTK's Information Extraction (IE) capabilities, including extracting relations and identifying patterns in text.

jlonnie
Download Presentation

Web Scraping and NLTK for Information Extraction (IE)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCE 590 Web Scraping – NLTK IE • Topics • NLTK– Information Extraction (IE) • Readings: • Online Book– • http://www.nltk.org/book/chapter7 March 28, 2017

  2. NLTK Information Extraction

  3. Relations • """ • Created on Mon Mar 27 16:50:27 2017 • @author: http://www.nltk.org/book/ch07.html • """ • locs = [('Omnicom', 'IN', 'New York'), • ('DDB Needham', 'IN', 'New York'), • ('Kaplan Thaler Group', 'IN', 'New York'), • ('BBDO South', 'IN', 'Atlanta'), • ('Georgia-Pacific', 'IN', 'Atlanta')] • query = [e1 for (e1, rel, e2) in locs if e2=='Atlanta'] • print(query) • ###['BBDO South', 'Georgia-Pacific']

  4. Beautifulsoup + NLTK • import nltk • from nltk import FreqDist • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen("http://www.nltk.org/book/ch07.html") • bsObj = BeautifulSoup(html.read(), "lxml") • mytext = bsObj.get_text() • sentences = nltk.sent_tokenize(mytext) • sentences = [nltk.word_tokenize(sent) for sent in sentences] • sentences2 = [nltk.pos_tag(sent) for sent in sentences] • #print(sentences2)

  5. tag patterns • The rules that make up a chunk grammar use tag patterns to describe sequences of tagged words. • A tag pattern is a sequence of part-of-speech tags delimited using angle brackets, e.g. <DT>?<JJ>*<NN>. • Tag patterns are similar to regular expression patterns • Noun Phrases from Wall Street Journal: • another/DT sharp/JJ dive/NN • trade/NN figures/NNS • any/DT new/JJ policy/NN measures/NNS • earlier/JJR stages/NNS • Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP

  6. Chunk Parsers • grammar = "NP: {<DT>?<JJ>*<NN>}" • cp = nltk.RegexpParser(grammar) • result = cp.parse(sentences2[1]) • print(result)

  7. grammar = "NP: {<DT>?<JJ>*<NN>}" • cp = nltk.RegexpParser(grammar) • sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), • ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), • ("the", "DT"), ("cat", "NN")] • result = cp.parse(sentence) • print(result) • result.draw()

  8. grammar = r""" • NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, • # and adjectives and noun • {<NNP>+} # chunk sequences of proper nouns • """ • cp = nltk.RegexpParser(grammar) • sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), [1] • ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")] • result = cp.parse(sentence) • print(result) • result.draw()

  9. Noun Noun phrases • # as in “money market fund” • nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")] • grammar = "NP: {<NN><NN>} # Chunk two consecutive nouns" • cp = nltk.RegexpParser(grammar) • result = cp.parse(nouns)

  10. Infinitive phrases in Brown Corpus • cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}') • brown = nltk.corpus.brown • for sent in brown.tagged_sents(): • tree = cp.parse(sent) • for subtree in tree.subtrees(): • if subtree.label() == 'CHUNK': print(subtree) • _________________________________________________________ • (CHUNK started/VBD to/TO take/VB) • (CHUNK began/VBD to/TO fascinate/VB) • (CHUNK left/VBD to/TO spend/VB) • (CHUNK trying/VBG to/TO talk/VB) • (CHUNK lied/VBD to/TO shorten/VB) • …

  11. 2.5 Chinking • Sometimes it is easier to define what we want to exclude from a chunk. • We can define a chink to be a sequence of tokens that is not included in a chunk. • In the following example, barked/VBD at/IN is a chink: • [ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]

  12. Chunks in – Chinks out • grammar = r""" • NP: • {<.*>+} # Chunk everything • }<VBD|IN>+{ # Chink sequences of VBD and IN • """ • sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), • ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] • cp = nltk.RegexpParser(grammar) • print(cp.parse(sentence))

  13. >>> print(cp.parse(sentence)) • (S • (NP the/DT little/JJ yellow/JJ dog/NN) • barked/VBD • at/IN • (NP the/DT cat/NN))

  14. Tag versus tree representations

  15. 3 Developing and Evaluating Chunkers • Precision • Recall • F-measure • False-positives • False-negatives

  16. IOB Format and the CoNLL 2000 Corpus • text = ''' • he PRP B-NP • accepted VBD B-VP • the DT B-NP • position NN I-NP • of IN B-PP • vice NN B-NP • chairman NN I-NP • of IN B-PP • … • . . O • ''' • nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

  17. from nltk.corpus import conll2000 • print(conll2000.chunked_sents('train.txt')[99]) • print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99]) • nltk.chu • (S • (PP Over/IN) • (NP a/DT cup/NN) • (PP of/IN) • (NP coffee/NN) • ,/, • (NP Mr./NNP Stone/NNP) • (VP told/VBD) • (NP his/PRP$ story/NN) • ./.)

  18. from nltk.corpus import conll2000 • cp = nltk.RegexpParser("") • test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) • print(cp.evaluate(test_sents))

  19. Unigram • class UnigramChunker(nltk.ChunkParserI): • def __init__(self, train_sents): [1] • train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] • for sent in train_sents] • self.tagger = nltk.UnigramTagger(train_data) [2] • def parse(self, sentence): [3] • pos_tags = [pos for (word,pos) in sentence] • tagged_pos_tags = self.tagger.tag(pos_tags) • chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags] • conlltags = [(word, pos, chunktag) for ((word,pos),chunktag) • in zip(sentence, chunktags)] • return nltk.chunk.conlltags2tree(conlltags)

  20. >>> test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) • >>> train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP']) • >>> unigram_chunker = UnigramChunker(train_sents) • >>> print(unigram_chunker.evaluate(test_sents)) • ChunkParse score: • IOB Accuracy: 92.9% • Precision: 79.9% • Recall: 86.8% • F-Measure: 83.2%

  21. Bigram • import nltk • from nltk.corpus import conll2000 • class BigramChunker(nltk.ChunkParserI): • def __init__(self, train_sents): • train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] • for sent in train_sents] • self.tagger = nltk.BigramTagger(train_data) • def parse(self, sentence): • pos_tags = [pos for (word,pos) in sentence] • tagged_pos_tags = self.tagger.tag(pos_tags) • chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags] • conlltags = [(word, pos, chunktag) for ((word,pos),chunktag) • in zip(sentence, chunktags)] • return nltk.chunk.conlltags2tree(conlltags)

  22. test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) • train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP']) • bigram_chunker = BigramChunker(train_sents) • print(bigram_chunker.evaluate(test_sents))

More Related