1 / 36

Lecture 12 Classifiers Part 2

Lecture 12 Classifiers Part 2. CSCE 771 Natural Language Processing. Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter C hapter 6, 7.1. February 25, 2013. Overview. Last Time Confusion Matrix

blancheb
Download Presentation

Lecture 12 Classifiers Part 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 12 Classifiers Part 2 CSCE 771 Natural Language Processing • Topics • Classifiers • Maxent Classifiers • Maximum Entropy Markov Models • Information Extraction and chunking intro • Readings: Chapter Chapter 6, 7.1 February 25, 2013

  2. Overview • Last Time • Confusion Matrix • Brill Demo • NLTK Ch 6 - Text Classification • Today • Confusion Matrix • Brill Demo • NLTK Ch 6 - Text Classification • Readings • NLTK Ch 6

  3. Evaluation of classifiers again • Last time • Recall • Precision • F value • Accuracy

  4. Reuters Data set • 21578 documents • 118 categories • document can be in multiple classes • 118 binary classifiers

  5. Confusion matrix • Cij – documents that are really Ci that are classified as Cj. • Cii– documents that are really Ci that correctly classified

  6. Micro averaging vs Macro Averaging • Macro Averaging – average performance of individual classifiers (average of averages) • Micro averaging sum up all correct and all fp and fn

  7. Training, Development and Test Sets

  8. nltk.tag • Classes • AffixTaggerBigramTaggerBrillTaggerBrillTaggerTrainerDefaultTaggerFastBrillTaggerTrainerHiddenMarkovModelTaggerHiddenMarkovModelTrainerNgramTaggerRegexpTaggerTaggerITrigramTaggerUnigramTagger • Functions • batch_pos_tagpos_taguntag

  9. Module nltk.tag.hmm • Source Code for Module nltk.tag.hmm • import nltk • nltk.tag.hmm.demo() • nltk.tag.hmm.demo_pos() • nltk.tag.hmm.demo_pos_bw()

  10. HMM demo • import nltk • nltk.tag.hmm.demo() • nltk.tag.hmm.demo_pos() • nltk.tag.hmm.demo_pos_bw()

  11. Common Suffixes • from nltk.corpus import brown • suffix_fdist = nltk.FreqDist() • for word in brown.words(): • word = word.lower() • suffix_fdist.inc(word[-1:]) • suffix_fdist.inc(word[-2:]) • suffix_fdist.inc(word[-3:]) • common_suffixes = suffix_fdist.keys()[:100] • print common_suffixes

  12. rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33] • extractor = nltk.RTEFeatureExtractor(rtepair) • print extractor.text_words • set(['Russia', 'Organisation', 'Shanghai', … • print extractor.hyp_words • set(['member', 'SCO', 'China']) • print extractor.overlap('word') • set([ ]) • print extractor.overlap('ne') • set(['SCO', 'China']) • print extractor.hyp_extra('word') • set(['member'])

  13. tagged_sents = list(brown.tagged_sents(categories='news')) • random.shuffle(tagged_sents) • size = int(len(tagged_sents) * 0.1) • train_set, test_set = tagged_sents[size:], tagged_sents[:size] • file_ids = brown.fileids(categories='news') • size = int(len(file_ids) * 0.1) • train_set = brown.tagged_sents(file_ids[size:]) • test_set = brown.tagged_sents(file_ids[:size]) • train_set = brown.tagged_sents(categories='news') • test_set = brown.tagged_sents(categories='fiction') • classifier = nltk.NaiveBayesClassifier.train(train_set)

  14. Traceback (most recent call last): • File "C:\Users\mmm\Documents\Courses\771\Python771\ch06\ch06d.py", line 80, in <module> • classifier = nltk.NaiveBayesClassifier.train(train_set) • File "C:\Python27\lib\site-packages\nltk\classify\naivebayes.py", line 191, in train • for featureset, label in labeled_featuresets: • ValueError: too many values to unpack

  15. from nltk.corpus import brown • brown_tagged_sents = brown.tagged_sents(categories='news') • size = int(len(brown_tagged_sents) * 0.9) • train_sents = brown_tagged_sents[:size] • test_sents = brown_tagged_sents[size:] • t0 = nltk.DefaultTagger('NN') • t1 = nltk.UnigramTagger(train_sents, backoff=t0) • t2 = nltk.BigramTagger(train_sents, backoff=t1)

  16. deftag_list(tagged_sents): • return [tag for sent in tagged_sents for (word, tag) in sent] • defapply_tagger(tagger, corpus): • return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus] • gold = tag_list(brown.tagged_sents(categories='editorial')) • test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial'))) • cm = nltk.ConfusionMatrix(gold, test) • print cm.pp(sort_by_count=True, show_percents=True, truncate=9)

  17. | N | • | N I A J N V N | • | N N T J . S , B P | • ----+----------------------------------------------------------------+ • NN | <11.8%> 0.0% . 0.2% . 0.0% . 0.3% 0.0% | • IN | 0.0% <9.0%> . . . 0.0% . . . | • AT | . . <8.6%> . . . . . . | • JJ | 1.7% . . <3.9%> . . . 0.0% 0.0% | • . | . . . . <4.8%> . . . . | • NNS | 1.5% . . . . <3.2%> . . 0.0% | • , | . . . . . . <4.4%> . . | • VB | 0.9% . . 0.0% . . . <2.4%> . | • NP | 1.0% . . 0.0% . . . . <1.8%>| • ----+----------------------------------------------------------------+ • (row = reference; col = test)

  18. Entropy • import math • def entropy(labels): • freqdist = nltk.FreqDist(labels) • probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)] • return -sum([p * math.log(p,2) for p in probs])

  19. print entropy(['male', 'male', 'male', 'male']) • -0.0 • print entropy(['male', 'female', 'male', 'male']) • 0.811278124459 • print entropy(['female', 'male', 'female', 'male']) • 1.0 • print entropy(['female', 'female', 'male', 'female']) • 0.811278124459 • print entropy(['female', 'female', 'female', 'female']) • -0.0

  20. The Rest of NLTK Chapter 06 • 6.5 Naïve Bayes Classifiers • 6.6 Maximum Entropy Classifiers • nltk.classify.maxent.BinaryMaxentFeatureEncoding(labels, mapping, unseen_features=False, alwayson_features=False) • 6.7 Modeling Linguistic Patterns • 6.8 Summary • But no more Code?!?

  21. Maximum Entropy Models (again) • features are elements of evidence that connect observations d with categories c • f: C X D  R • Example feature • f(c,d) = { c = LOCATION & w-1 = IN & is Capitalized(w)} • An “input-feature” is a property of an unlabeled token. • A “joint-feature” is a property of a labeled token.

  22. Feature-Based Liner Classifiers • p(c |d, lambda)=

  23. Maxent Model revisited

  24. Maximum Entropy Markov Models (MEMM) • repeatedly use Maxent classifier to iteratively apply to a sequence

  25. Named Entity Recognition (NER) • enities – • a:being, existence; especially: independent, separate, or self-contained existence b : the existence of a thing as contrasted with its attributes • : something that has separate and distinct existence and objective or conceptual reality • : an organization (as a business or governmental unit) that has an identity separate from those of its members • one of those with a name • http://nlp.stanford.edu/software/CRF-NER.shtml

  26. Classes of Named Entities • Person (PERS) • Location (LOC) • Organization (ORG) • DATE • Example: Jim bought 300 shares of Acme Corp. in 2006. And producing an annotated block of text, such as this one: • <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought <NUMEX TYPE="QUANTITY">300</NUMEX>shares of<ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>. http://nlp.stanford.edu/software/CRF-NER.shtml

  27. IOB tagging

  28. .

  29. Chunking - partial parsing

  30. NLTK ch07.py • defie_preprocess(document): • sentences = nltk.sent_tokenize(document) • sentences = [nltk.word_tokenize(sent) for sent in sentences] • sentences = [nltk.pos_tag(sent) for sent in sentences] • sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), # [_chunkex-sent] ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] • grammar = "NP: {<DT>?<JJ>*<NN>}" # [_chunkex-grammar] • cp = nltk.RegexpParser(grammar) • result = cp.parse(sentence) • print result

  31. (S • (NP the/DT little/JJ yellow/JJ dog/NN) • barked/VBD • at/IN • (NP the/DT cat/NN)) • (S • (NP the/DT little/JJ yellow/JJ dog/NN) • barked/VBD • at/IN • (NP the/DT cat/NN)) • (S (NP money/NN market/NN) fund/NN)

  32. (CHUNK combined/VBN to/TO achieve/VB) • (CHUNK continue/VB to/TO place/VB) • (CHUNK serve/VB to/TO protect/VB) • (CHUNK wanted/VBD to/TO wait/VB)

  33. from nltk.corpus import conll2000 • print conll2000.chunked_sents('train.txt')[99] • print " B********************************************" • print conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99] • print " C********************************************" • from nltk.corpus import conll2000 • cp = nltk.RegexpParser("") • test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) • print cp.evaluate(test_sents)

  34. Information extraction • Step towards understanding • Find named entities • Figure out what is being said about them; actually just relations of named entities http://en.wikipedia.org/wiki/Information_extraction

  35. Outline of natural language processing http://en.wikipedia.org/wiki/Natural_language_processing_toolkits

More Related