1 / 31

Lecture on NLTK POS Tagging Part 3 - Rule Based, Probabilistic, and Transformation Based Taggers

This lecture provides an overview of Part of Speech (POS) tagging using rule-based, probabilistic, and transformation-based taggers in NLTK. The lecture covers topics such as taggers, supervised learning, and best practices.

lisarichard
Download Presentation

Lecture on NLTK POS Tagging Part 3 - Rule Based, Probabilistic, and Transformation Based Taggers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 10 NLTK POS Tagging Part 3 CSCE 771 Natural Language Processing • Topics • Taggers • Rule Based Taggers • Probabilistic Taggers • Transformation Based Taggers - Brill • Supervised learning • Readings: Chapter 5.4-? February 18, 2013

  2. Overview • Last Time • Overview of POS Tags • Today • Part of Speech Tagging • Parts of Speech • Rule Based taggers • Stochastic taggers • Transformational taggers • Readings • Chapter 5.4-5.?

  3. brown_lrnd_tagged = brown.tagged_words(categories='learned', simplify_tags=True) • tags = [b[1] for (a, b) in nltk.ibigrams(brown_lrnd_tagged) if a[0] == 'often'] • fd = nltk.FreqDist(tags) • print fd.tabulate() • VN V VD ADJ DET ADV P , CNJ . TO VBZ VG WH • 15 12 8 5 5 4 4 3 3 1 1 1 1 1

  4. highly ambiguous words • >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) • >>> data = nltk.ConditionalFreqDist((word.lower(), tag) ... for (word, tag) in brown_news_tagged) • >>> for word in data.conditions(): • ... if len(data[word]) > 3: • ... tags = data[word].keys() • ... print word, ' '.join(tags) • ... • best ADJ ADV NP V • better ADJ ADV V DET • ….

  5. Tag Package • http://nltk.org/api/nltk.tag.html#module-nltk.tag

  6. Python's Dictionary Methods: • .

  7. 5.4   Automatic Tagging • Training set • Test set • ### setup • import nltk, re, pprint • from nltk.corpus import brown • brown_tagged_sents = brown.tagged_sents(categories='news') • brown_sents = brown.sents(categories='news')

  8. Default.tagger  NN • tags = [tag for (word, tag) in brown.tagged_words(categories='news')] • print nltk.FreqDist(tags).max() • raw = 'I do not like green eggs and ham, I …Sam I am!' • tokens = nltk.word_tokenize(raw) • default_tagger = nltk.DefaultTagger('NN') • print default_tagger.tag(tokens) • [('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), … • print default_tagger.evaluate(brown_tagged_sents) • 0.130894842572

  9. Tagger2: regexp_tagger • patterns = [ • (r'.*ing$', 'VBG'), # gerunds • (r'.*ed$', 'VBD'), # simple past • (r'.*es$', 'VBZ'), # 3rd singular present • (r'.*ould$', 'MD'), # modals • (r'.*\'s$', 'NN$'), # possessive nouns • (r'.*s$', 'NNS'), # plural nouns • (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers • (r'.*', 'NN') # nouns (default) • ] • regexp_tagger = nltk.RegexpTagger(patterns)

  10. Evaluate regexp_tagger • regexp_tagger = nltk.RegexpTagger(patterns) • print regexp_tagger.tag(brown_sents[3]) • [('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), … • print regexp_tagger.evaluate(brown_tagged_sents) • 0.203263917895

  11. Unigram Tagger: 100 Most Freq tag • fd = nltk.FreqDist(brown.words(categories='news')) • cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) • most_freq_words = fd.keys()[:100] • likely_tags = dict((word, cfd[word].max()) for word in most_freq_words) • baseline_tagger = nltk.UnigramTagger(model=likely_tags) • print baseline_tagger.evaluate(brown_tagged_sents) • 0.455784951369

  12. Likely_tags; Backoff to NN • sent = brown.sents(categories='news')[3] • baseline_tagger.tag(sent) • ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'NN'), • baseline_tagger = nltk.UnigramTagger(model=likely_tags, • backoff=nltk.DefaultTagger('NN')) • print baseline_tagger.tag(sent) • 'Only', 'NN'), ('a', 'AT'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'IN'), • print baseline_tagger.evaluate(brown_tagged_sents) • 0.581776955666

  13. Performance of Easy Taggers • .

  14. def performance(cfd, wordlist): • lt = dict((word, cfd[word].max()) for word in wordlist) • baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN')) • return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))

  15. Display • def display(): • import pylab • words_by_freq = list(nltk.FreqDist(brown.words(categories='news'))) • cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) • sizes = 2 ** pylab.arange(15) • perfs = [performance(cfd, words_by_freq[:size]) for size in sizes] • pylab.plot(sizes, perfs, '-bo') • pylab.title('Lookup Tagger Perf. vs Model Size') • pylab.xlabel('Model Size') • pylab.ylabel('Performance') • pylab.show()

  16. Error !? • Traceback (most recent call last): • File "C:/Users/mmm/Documents/Courses/771/Python771/ch05.4.py", line 70, in <module> • import pylab • ImportError: No module named pylab • google (download pylab)  scipy ??

  17. 5.5 N-gram Tagging • from nltk.corpus import brown • brown_tagged_sents = brown.tagged_sents(categories='news') • brown_sents = brown.sents(categories='news') • unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) • print unigram_tagger.tag(brown_sents[2007]) • [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), • print unigram_tagger.evaluate(brown_tagged_sents) • 0.934900650397

  18. Dividing into Training/Test Sets • size = int(len(brown_tagged_sents) * 0.9) • print size • 4160 • train_sents = brown_tagged_sents[:size] • test_sents = brown_tagged_sents[size:] • unigram_tagger = nltk.UnigramTagger(train_sents) • print unigram_tagger.evaluate(test_sents) • 0.811023622047

  19. bigram_tagger 1rst try -- • bigram_tagger = nltk.BigramTagger(train_sents) • print "bigram_tagger.tag-2007", bigram_tagger.tag(brown_sents[2007]) • bigram_tagger.tag-2007 [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER') • unseen_sent = brown_sents[4203] • print "bigram_tagger.tag-4203", bigram_tagger.tag(unseen_sent) • bigram_tagger.tag-4203 [('The', 'AT'), ('is', 'BEZ'), ('13.5', None), ('million', None), (',', None), ('divided', None), • print bigram_tagger.evaluate(test_sents) • 0.102162862554 ---not too good

  20. Backoff bigramunigram NN • t0 = nltk.DefaultTagger('NN') • t1 = nltk.UnigramTagger(train_sents, backoff=t0) • t2 = nltk.BigramTagger(train_sents, backoff=t1) • print t2.evaluate(test_sents) • 0.844712448919

  21. Your turn: tribiuni NN

  22. Tagging Unknown Words • Our approach to tagging unknown words still uses backoff to a regular-expression tagger or a default tagger. These are unable to make use of context. Thus, if our tagger encountered the word blog, not seen during training, it would assign it the same tag, regardless of whether this word appeared in the context the blog or to blog. How can we do better with these unknown words, or out-of-vocabulary items? • A useful method to tag unknown words based on context is to limit the vocabulary of a tagger to the most frequent n words, and to replace every other word with a special word UNK using the method shown in 5.3. During training, a unigram tagger will probably learn that UNK is usually a noun. However, the n-gram taggers will detect contexts in which it has some other tag. For example, if the preceding word is to (tagged TO), then UNK will probably be tagged as a verb.

  23. Serialization = pickle Saving Loading from cPickle import load input = open('t2.pkl', 'rb') tagger = load(input) input.close() • Object serialization • from cPickle import dump • output=open('t2.pkl', 'wb') • dump(t2, output, -1) • output.close()

  24. Performance Limitations

  25. text = """The board's action shows what free enterprise • is up against in our complex maze of regulatory laws .""" • tokens = text.split() • tagger.tag(tokens) • cfd = nltk.ConditionalFreqDist( • ((x[1], y[1], z[0]), z[1]) • for sent in brown_tagged_sents • for x, y, z in nltk.trigrams(sent)) • ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1] • print sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()

  26. Confusion Matrix • test_tags = [tag for sent in brown.sents(categories='editorial') • for (word, tag) in t2.tag(sent)] • gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')] • print nltk.ConfusionMatrix(gold_tags, test_tags) • overwhelming output

  27. nltk.tag.brill.demo() • Loading tagged data... • Done loading. • Training unigram tagger: • [accuracy: 0.832151] • Training bigram tagger: • [accuracy: 0.837930] • Training Brill tagger on 1600 sentences... • Finding initial useful rules... • Found 9757 useful rules.

  28. S F r O | Score = Fixed - Broken • c i o t | R Fixed = num tags changed incorrect -> correct • o x k h | u Broken = num tags changed correct -> incorrect • r e ee | l Other = num tags changed incorrect -> incorrect • e d n r | e • ------------------+------------------------------------------------------- • 11 15 4 0 | WDT -> IN if the tag of words i+1...i+2 is 'DT' • 10 12 2 0 | IN -> RB if the text of the following word is • | 'well' • 9 9 0 0 | WDT -> IN if the tag of the preceding word is • | 'NN', and the tag of the following word is 'NNP' • 7 9 2 0 | RBR -> JJR if the tag of words i+1...i+2 is 'NNS' • 7 10 3 0 | WDT -> IN if the tag of words i+1...i+2 is 'NNS'

  29. 5 5 0 0 | WDT -> IN if the tag of the preceding word is • | 'NN', and the tag of the following word is 'PRP' • 4 4 0 1 | WDT -> IN if the tag of words i+1...i+3 is 'VBG' • 3 3 0 0 | RB -> IN if the tag of the preceding word is 'NN', • | and the tag of the following word is 'DT' • 3 3 0 0 | RBR -> JJR if the tag of the following word is • | 'NN' • 3 3 0 0 | VBP -> VB if the tag of words i-3...i-1 is 'MD' • 3 3 0 0 | NNS -> NN if the text of the preceding word is • | 'one' • 3 3 0 0 | RP -> RB if the text of words i-3...i-1 is 'were' • 3 3 0 0 | VBP -> VB if the text of words i-2...i-1 is "n't" • Brill accuracy: 0.839156 • Done; rules and errors saved to rules.yaml and errors.out.

More Related