520 likes | 633 Views
This project focuses on implementing specific search-engine-related tasks in Information Retrieval (IR) using text classification techniques like Naïve Bayes classifiers and decision trees. The deliverables include a software system accompanied by documentation, examples, or a research paper with code and data. Additionally, we emphasize the importance of poster presentations and web pages that encapsulate project findings for the IR community. We aim to highlight successful works and encourage submissions to relevant IR-related conferences.
E N D
Information RetrievalSearch Engine Technology(5&6)http://tangra.si.umich.edu/clair/ir09 Prof. Dragomir R. Radev radev@umich.edu
Final projects • Two formats: • A software system that performs a specific search-engine related task. We will create a web page with all such code and make it available to the IR community. • A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR-related conferences. • Deliverables: • System (code + documentation + examples) or Paper (+ code, data) • Poster (to be presented in class) • Web page that describes the project.
SET/IR – W/S 2009 … 9. Text classification Naïve Bayesian classifiers Decision trees …
Introduction • Text classification: assigning documents to predefined categories: topics, languages, users • A given set of classes C • Given x, determine its class in C • Hierarchical vs. flat • Overlapping (soft) vs non-overlapping (hard)
Introduction • Ideas: manual classification using rules (e.g., Columbia AND University EducationColumbia AND “South Carolina” Geography • Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression) • Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) • Discriminative: model p(y|x) directly.
Bayes formula Full probability
Example (performance enhancing drug) • Drug(D) with values y/n • Test(T) with values +/- • P(D=y) = 0.001 • P(T=+|D=y)=0.8 • P(T=+|D=n)=0.01 • Given: athlete tests positive • P(D=y|T=+)= P(T=+|D=y)P(D=y) / (P(T=+|D=y)P(D=y)+P(T=+|D=n)P(D=n)=(0.8x0.001)/(0.8x0.001+0.01x0.999)=0.074
Naïve Bayesian classifiers • Naïve Bayesian classifier • Assuming statistical independence • Features = words (or phrases) typically
Example • p(well)=0.9, p(cold)=0.05, p(allergy)=0.05 • p(sneeze|well)=0.1 • p(sneeze|cold)=0.9 • p(sneeze|allergy)=0.9 • p(cough|well)=0.1 • p(cough|cold)=0.8 • p(cough|allergy)=0.7 • p(fever|well)=0.01 • p(fever|cold)=0.7 • p(fever|allergy)=0.4 Example from Ray Mooney
Example (cont’d) • Features: sneeze, cough, no fever • P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e) • P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e) • P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e) • P(e) = 0.0089+0.01+0.019=0.379 • P(well|e)=.23 • P(cold|e)=.26 • P(allergy|e)=.50 Example from Ray Mooney
Issues with NB • Where do we get the values – use maximum likelihood estimation (Ni/N) • Same for the conditionals – these are based on a multinomial generator and the MLE estimator is (Tji/STji) • Smoothing is needed – why? • Laplace smoothing ((Tji+1)/S(Tji+1)) • Implementation: how to avoid floating point underflow
Spam recognition Return-Path: <ig_esq@rediffmail.com> X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" <ig_esq@rediffmail.com> Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Date: Tue, 14 Jan 2003 21:06:26 -0800 Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A
SpamAssassin • http://spamassassin.apache.org/ • http://spamassassin.apache.org/tests_3_1_x.html
Feature selection: The 2 test • For a term t: • C=class, it = feature • Testing for independence:P(C=0,It=0) should be equal to P(C=0) P(It=0) • P(C=0) = (k00+k01)/n • P(C=1) = 1-P(C=0) = (k10+k11)/n • P(It=0) = (k00+K10)/n • P(It=1) = 1-P(It=0) = (k01+k11)/n
Feature selection: The 2 test • High values of 2 indicate lower belief in independence. • In practice, compute 2 for all words and pick the top k among them.
Feature selection: mutual information • No document length scaling is needed • Documents are assumed to be generated according to the multinomial model • Measures amount of information: if the distribution is the same as the background distribution, then MI=0 • X = word; Y = class
Well-known datasets • 20 newsgroups • http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/ • Reuters-21578 • http://www.daviddlewis.com/resources/testcollections/reuters21578/ • Cats: grain, acquisitions, corn, crude, wheat, trade… • WebKB • http://www-2.cs.cmu.edu/~webkb/ • course, student, faculty, staff, project, dept, other • NB performance (2000) • P=26,43,18,6,13,2,94 • R=83,75,77,9,73,100,35
Evaluation of text classification • Microaveraging – average over classes • Macroaveraging – uses pooled table
Vector space classification x2 topic2 topic1 x1
Decision surfaces x2 topic2 topic1 x1
Decision trees x2 topic2 topic1 x1
Classification usingdecision trees • Expected information need • I (s1, s2, …, sm) = - pi log (pi) • s = data samples • m = number of classes S
Decision tree induction • I(s1,s2)= I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 == 0.940
Entropy and information gain S S1j + … + smj • E(A) = I (s1j,…,smj) s Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s1,s2,…,sm) – E(A)
Entropy • Age <= 30s11 = 2, s21 = 3, I(s11, s21) = 0.971 • Age in 31 .. 40s12 = 4, s22 = 0, I (s12,s22) = 0 • Age > 40s13 = 3, s23 = 2, I (s13,s23) = 0.971
Entropy (cont’d) • E (age) =5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694 • Gain (age) = I (s1,s2) – E(age) = 0.246 • Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048
Final decision tree age > 40 31 .. 40 student credit yes excellent fair no yes no yes no yes
Other techniques • Bayesian classifiers • X: age <=30, income = medium, student = yes, credit = fair • P(yes) = 9/14 = 0.643 • P(no) = 5/14 = 0.357
Example • P (age < 30 | yes) = 2/9 = 0.222P (age < 30 | no) = 3/5 = 0.600P (income = medium | yes) = 4/9 = 0.444P (income = medium | no) = 2/5 = 0.400P (student = yes | yes) = 6/9 = 0.667P (student = yes | no) = 1/5 = 0.200P (credit = fair | yes) = 6/9 = 0.667P (credit = fair | no) = 2/5 = 0.400
Example (cont’d) • P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 • P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019 • P (X | yes) P (yes) = 0.044 x 0.643 = 0.028 • P (X | no) P (no) = 0.019 x 0.357 = 0.007 • Answer: yes/no?
SET/IR – W/S 2009 … 10. Linear classifiers Kernel methods Support vector machines …
Linear boundary x2 topic2 topic1 x1
Vector space classifiers • Using centroids • Boundary = line that is equidistant from two centroids
Generative models: knn • Assign each element to the closest cluster • K-nearest neighbors • Very easy to program • Tessellation; nonlinearity • Issues: choosing k, b? • Demo: • http://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html
Linear separators • Two-dimensional line: w1x1+w2x2=b is the linear separator w1x1+w2x2>b for the positive class • In n-dimensional spaces:
Example 1 x2 topic2 topic1 x1
Example 2 • Classifier for “interest” in Reuters-21578 • b=0 • If the document is “rate discount dlr world”, its score will be0.67*1+0.46*1+(-0.71)*1+(-0.35)*1= 0.05>0 Example from MSR
Example: perceptron algorithm Input: Algorithm: Output:
Linear classifiers • What is the major shortcoming of a perceptron? • How to determine the dimensionality of the separator? • Bias-variance tradeoff (example) • How to deal with multiple classes? • Any-of: build multiple classifiers for each class • One-of: harder (as J hyperplanes do not divide RM into J regions), instead: use class complements and scoring
Support vector machines • Introduced by Vapnik in the early 90s.
Issues with SVM • Soft margins (inseparability) • Kernels – non-linearity
The kernel idea before after
Example (mapping to a higher-dimensional space)
The kernel trick Polynomial kernel: Sigmoid kernel: RBF kernel: Many other kernels are useful for IR:e.g., string kernels, subsequence kernels, tree kernels, etc.
SVM (Cont’d) • Evaluation: • SVM > knn > decision tree > NB • Implementation • Quadratic optimization • Use toolkit (e.g., Thorsten Joachims’s svmlight)
Semi-supervised learning • EM • Co-training • Graph-based
Exploiting Hyperlinks – Co-training • Each document instance has two sets of alternate view (Blum and Mitchell 1998) • terms in the document, x1 • terms in the hyperlinks that point to the document, x2 • Each view is sufficient to determine the class of the instance • Labeling function that classifies examples is the same applied to x1 or x2 • x1 and x2 are conditionally independent, given the class [Slide from Pierre Baldi]
Co-training Algorithm • Labeled data are used to infer two Naïve Bayes classifiers, one for each view • Each classifier will • examine unlabeled data • pick the most confidently predicted positive and negative examples • add these to the labeled examples • Classifiers are now retrained on the augmented set of labeled examples [Slide from Pierre Baldi]