Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Information RetrievalSearch Engine Technology(5&6)http://tangra.si.umich.edu/clair/ir09 Prof. Dragomir R. Radev radev@umich.edu

Final projects • Two formats: • A software system that performs a specific search-engine related task. We will create a web page with all such code and make it available to the IR community. • A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR-related conferences. • Deliverables: • System (code + documentation + examples) or Paper (+ code, data) • Poster (to be presented in class) • Web page that describes the project.

SET/IR – W/S 2009 … 9. Text classification Naïve Bayesian classifiers Decision trees …

Introduction • Text classification: assigning documents to predefined categories: topics, languages, users • A given set of classes C • Given x, determine its class in C • Hierarchical vs. flat • Overlapping (soft) vs non-overlapping (hard)

Introduction • Ideas: manual classification using rules (e.g., Columbia AND University  EducationColumbia AND “South Carolina”  Geography • Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression) • Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) • Discriminative: model p(y|x) directly.

Bayes formula Full probability

Naïve Bayesian classifiers • Naïve Bayesian classifier • Assuming statistical independence • Features = words (or phrases) typically

Example (cont’d) • Features: sneeze, cough, no fever • P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e) • P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e) • P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e) • P(e) = 0.0089+0.01+0.019=0.379 • P(well|e)=.23 • P(cold|e)=.26 • P(allergy|e)=.50 Example from Ray Mooney

Issues with NB • Where do we get the values – use maximum likelihood estimation (Ni/N) • Same for the conditionals – these are based on a multinomial generator and the MLE estimator is (Tji/STji) • Smoothing is needed – why? • Laplace smoothing ((Tji+1)/S(Tji+1)) • Implementation: how to avoid floating point underflow

Spam recognition Return-Path: <ig_esq@rediffmail.com> X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" <ig_esq@rediffmail.com> Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Date: Tue, 14 Jan 2003 21:06:26 -0800 Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A

SpamAssassin • http://spamassassin.apache.org/ • http://spamassassin.apache.org/tests_3_1_x.html

Feature selection: The 2 test • For a term t: • C=class, it = feature • Testing for independence:P(C=0,It=0) should be equal to P(C=0) P(It=0) • P(C=0) = (k00+k01)/n • P(C=1) = 1-P(C=0) = (k10+k11)/n • P(It=0) = (k00+K10)/n • P(It=1) = 1-P(It=0) = (k01+k11)/n

Feature selection: The 2 test • High values of 2 indicate lower belief in independence. • In practice, compute 2 for all words and pick the top k among them.

Feature selection: mutual information • No document length scaling is needed • Documents are assumed to be generated according to the multinomial model • Measures amount of information: if the distribution is the same as the background distribution, then MI=0 • X = word; Y = class

Well-known datasets • 20 newsgroups • http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/ • Reuters-21578 • http://www.daviddlewis.com/resources/testcollections/reuters21578/ • Cats: grain, acquisitions, corn, crude, wheat, trade… • WebKB • http://www-2.cs.cmu.edu/~webkb/ • course, student, faculty, staff, project, dept, other • NB performance (2000) • P=26,43,18,6,13,2,94 • R=83,75,77,9,73,100,35

Evaluation of text classification • Microaveraging – average over classes • Macroaveraging – uses pooled table

Vector space classification x2 topic2 topic1 x1

Decision surfaces x2 topic2 topic1 x1

Decision trees x2 topic2 topic1 x1

Classification usingdecision trees • Expected information need • I (s1, s2, …, sm) = - pi log (pi) • s = data samples • m = number of classes S

Decision tree induction • I(s1,s2)= I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 == 0.940

Entropy and information gain S S1j + … + smj • E(A) = I (s1j,…,smj) s Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s1,s2,…,sm) – E(A)

Entropy • Age <= 30s11 = 2, s21 = 3, I(s11, s21) = 0.971 • Age in 31 .. 40s12 = 4, s22 = 0, I (s12,s22) = 0 • Age > 40s13 = 3, s23 = 2, I (s13,s23) = 0.971

Entropy (cont’d) • E (age) =5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694 • Gain (age) = I (s1,s2) – E(age) = 0.246 • Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048

Final decision tree age > 40 31 .. 40 student credit yes excellent fair no yes no yes no yes

Other techniques • Bayesian classifiers • X: age <=30, income = medium, student = yes, credit = fair • P(yes) = 9/14 = 0.643 • P(no) = 5/14 = 0.357

Example (cont’d) • P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 • P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019 • P (X | yes) P (yes) = 0.044 x 0.643 = 0.028 • P (X | no) P (no) = 0.019 x 0.357 = 0.007 • Answer: yes/no?

SET/IR – W/S 2009 … 10. Linear classifiers Kernel methods Support vector machines …

Linear boundary x2 topic2 topic1 x1

Vector space classifiers • Using centroids • Boundary = line that is equidistant from two centroids

Generative models: knn • Assign each element to the closest cluster • K-nearest neighbors • Very easy to program • Tessellation; nonlinearity • Issues: choosing k, b? • Demo: • http://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html

Linear separators • Two-dimensional line: w1x1+w2x2=b is the linear separator w1x1+w2x2>b for the positive class • In n-dimensional spaces:

Example 1 x2 topic2 topic1 x1

Example 2 • Classifier for “interest” in Reuters-21578 • b=0 • If the document is “rate discount dlr world”, its score will be0.67*1+0.46*1+(-0.71)*1+(-0.35)*1= 0.05>0 Example from MSR

Example: perceptron algorithm Input: Algorithm: Output:

[Slide from Chris Bishop]

Linear classifiers • What is the major shortcoming of a perceptron? • How to determine the dimensionality of the separator? • Bias-variance tradeoff (example) • How to deal with multiple classes? • Any-of: build multiple classifiers for each class • One-of: harder (as J hyperplanes do not divide RM into J regions), instead: use class complements and scoring

Support vector machines • Introduced by Vapnik in the early 90s.

Issues with SVM • Soft margins (inseparability) • Kernels – non-linearity

The kernel idea before after

Example (mapping to a higher-dimensional space)

The kernel trick Polynomial kernel: Sigmoid kernel: RBF kernel: Many other kernels are useful for IR:e.g., string kernels, subsequence kernels, tree kernels, etc.

SVM (Cont’d) • Evaluation: • SVM > knn > decision tree > NB • Implementation • Quadratic optimization • Use toolkit (e.g., Thorsten Joachims’s svmlight)

Semi-supervised learning • EM • Co-training • Graph-based

Exploiting Hyperlinks – Co-training • Each document instance has two sets of alternate view (Blum and Mitchell 1998) • terms in the document, x1 • terms in the hyperlinks that point to the document, x2 • Each view is sufficient to determine the class of the instance • Labeling function that classifies examples is the same applied to x1 or x2 • x1 and x2 are conditionally independent, given the class [Slide from Pierre Baldi]

Co-training Algorithm • Labeled data are used to infer two Naïve Bayes classifiers, one for each view • Each classifier will • examine unlabeled data • pick the most confidently predicted positive and negative examples • add these to the labeled examples • Classifiers are now retrained on the augmented set of labeled examples [Slide from Pierre Baldi]

Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09