1 / 18

SET (5)

SET (5). Prof. Dragomir R. Radev radev@umich.edu. SET Fall 2013. … 9. Text classification Naïve Bayesian classifiers Decision trees …. Introduction. Text classification: assigning documents to predefined categories: topics, languages, users A given set of classes C

moral
Download Presentation

SET (5)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SET(5) Prof. Dragomir R. Radev radev@umich.edu

  2. SET Fall 2013 … 9. Text classification Naïve Bayesian classifiers Decision trees …

  3. Introduction • Text classification: assigning documents to predefined categories: topics, languages, users • A given set of classes C • Given x, determine its class in C • Hierarchical vs. flat • Overlapping (soft) vs non-overlapping (hard)

  4. Introduction • Ideas: manual classification using rules (e.g., Columbia AND University  EducationColumbia AND “South Carolina”  Geography • Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression) • Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) • Discriminative: model p(y|x) directly.

  5. Bayes formula Full probability

  6. Example (performance enhancing drug) • Drug(D) with values y/n • Test(T) with values +/- • P(D=y) = 0.001 • P(T=+|D=y)=0.8 • P(T=+|D=n)=0.01 • Given: athlete tests positive • P(D=y|T=+)= P(T=+|D=y)P(D=y) / (P(T=+|D=y)P(D=y)+P(T=+|D=n)P(D=n)=(0.8x0.001)/(0.8x0.001+0.01x0.999)=0.074

  7. Naïve Bayesian classifiers • Naïve Bayesian classifier • Assuming statistical independence • Features = words (or phrases) typically

  8. Example • p(well)=0.9, p(cold)=0.05, p(allergy)=0.05 • p(sneeze|well)=0.1 • p(sneeze|cold)=0.9 • p(sneeze|allergy)=0.9 • p(cough|well)=0.1 • p(cough|cold)=0.8 • p(cough|allergy)=0.7 • p(fever|well)=0.01 • p(fever|cold)=0.7 • p(fever|allergy)=0.4 Example from Ray Mooney

  9. Example (cont’d) • Features: sneeze, cough, no fever • P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e) • P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e) • P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e) • P(e) = 0.0089+0.01+0.019=0.379 • P(well|e)=.23 • P(cold|e)=.26 • P(allergy|e)=.50 Example from Ray Mooney

  10. Issues with NB • Where do we get the values – use maximum likelihood estimation (Ni/N) • Same for the conditionals – these are based on a multinomial generator and the MLE estimator is (Tji/STji) • Smoothing is needed – why? • Laplace smoothing ((Tji+1)/S(Tji+1)) • Implementation: how to avoid floating point underflow

  11. Spam recognition Return-Path: <ig_esq@rediffmail.com> X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" <ig_esq@rediffmail.com> Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Date: Tue, 14 Jan 2003 21:06:26 -0800 Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A

  12. SpamAssassin • http://spamassassin.apache.org/ • http://spamassassin.apache.org/tests_3_1_x.html

  13. Feature selection: The 2 test • For a term t: • C=class, it = feature • Testing for independence:P(C=0,It=0) should be equal to P(C=0) P(It=0) • P(C=0) = (k00+k01)/n • P(C=1) = 1-P(C=0) = (k10+k11)/n • P(It=0) = (k00+K10)/n • P(It=1) = 1-P(It=0) = (k01+k11)/n

  14. Feature selection: The 2 test • High values of 2 indicate lower belief in independence. • In practice, compute 2 for all words and pick the top k among them.

  15. Feature selection: mutual information • No document length scaling is needed • Documents are assumed to be generated according to the multinomial model • Measures amount of information: if the distribution is the same as the background distribution, then MI=0 • X = word; Y = class

  16. Well-known datasets • 20 newsgroups • http://qwone.com/~jason/20Newsgroups/ • Reuters-21578 • http://www.daviddlewis.com/resources/testcollections/reuters21578/ • Cats: grain, acquisitions, corn, crude, wheat, trade… • WebKB • http://www-2.cs.cmu.edu/~webkb/ • course, student, faculty, staff, project, dept, other • NB performance (2000) • P=26,43,18,6,13,2,94 • R=83,75,77,9,73,100,35

  17. Evaluation of text classification • Microaveraging – average over classes • Macroaveraging – uses pooled table

  18. Readings • MRS18 • MRS17, MRS19

More Related