340 likes | 480 Views
SET (6). Prof. Dragomir R. Radev radev@umich.edu. Vector space classification. x2. topic2. topic1. x1. Decision surfaces. x2. topic2. topic1. x1. Decision trees. x2. topic2. topic1. x1. Classification using decision trees. Expected information need
E N D
SET(6) Prof. Dragomir R. Radev radev@umich.edu
Vector space classification x2 topic2 topic1 x1
Decision surfaces x2 topic2 topic1 x1
Decision trees x2 topic2 topic1 x1
Classification usingdecision trees • Expected information need • I (s1, s2, …, sm) = - pi log (pi) • s = data samples • m = number of classes S
Decision tree induction • I(s1,s2)= I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 == 0.940
Entropy and information gain S S1j + … + smj • E(A) = I (s1j,…,smj) s Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s1,s2,…,sm) – E(A)
Entropy • Age <= 30s11 = 2, s21 = 3, I(s11, s21) = 0.971 • Age in 31 .. 40s12 = 4, s22 = 0, I (s12,s22) = 0 • Age > 40s13 = 3, s23 = 2, I (s13,s23) = 0.971
Entropy (cont’d) • E (age) =5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694 • Gain (age) = I (s1,s2) – E(age) = 0.246 • Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048
Final decision tree age > 40 31 .. 40 student credit yes excellent fair no yes no yes no yes
Other techniques • Bayesian classifiers • X: age <=30, income = medium, student = yes, credit = fair • P(yes) = 9/14 = 0.643 • P(no) = 5/14 = 0.357
Example • P (age < 30 | yes) = 2/9 = 0.222P (age < 30 | no) = 3/5 = 0.600P (income = medium | yes) = 4/9 = 0.444P (income = medium | no) = 2/5 = 0.400P (student = yes | yes) = 6/9 = 0.667P (student = yes | no) = 1/5 = 0.200P (credit = fair | yes) = 6/9 = 0.667P (credit = fair | no) = 2/5 = 0.400
Example (cont’d) • P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 • P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019 • P (X | yes) P (yes) = 0.044 x 0.643 = 0.028 • P (X | no) P (no) = 0.019 x 0.357 = 0.007 • Answer: yes/no?
SET Fall 2013 … 10. Linear classifiers Kernel methods Support vector machines …
Linear boundary x2 topic2 topic1 x1
Vector space classifiers • Using centroids • Boundary = line that is equidistant from two centroids
Generative models: knn • Assign each element to the closest cluster • K-nearest neighbors • Very easy to program • Tessellation; nonlinearity • Issues: choosing k, b? • Demo: • http://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html
Linear separators • Two-dimensional line: w1x1+w2x2=b is the linear separator w1x1+w2x2>b for the positive class • In n-dimensional spaces:
Example 1 x2 topic2 topic1 x1
Example 2 • Classifier for “interest” in Reuters-21578 • b=0 • If the document is “rate discount dlr world”, its score will be0.67*1+0.46*1+(-0.71)*1+(-0.35)*1= 0.05>0 Example from MSR
Example: perceptron algorithm Input: Algorithm: Output:
Linear classifiers • What is the major shortcoming of a perceptron? • How to determine the dimensionality of the separator? • Bias-variance tradeoff (example) • How to deal with multiple classes? • Any-of: build multiple classifiers for each class • One-of: harder (as J hyperplanes do not divide RM into J regions), instead: use class complements and scoring
Support vector machines • Introduced by Vapnik in the early 90s.
Issues with SVM • Soft margins (inseparability) • Kernels – non-linearity
The kernel idea before after
Example (mapping to a higher-dimensional space)
The kernel trick Polynomial kernel: Sigmoid kernel: RBF kernel: Many other kernels are useful for IR:e.g., string kernels, subsequence kernels, tree kernels, etc.
SVM (Cont’d) • Evaluation: • SVM > knn > decision tree > NB • Implementation • Quadratic optimization • Use toolkit (e.g., Thorsten Joachims’s svmlight)
Semi-supervised learning • EM • Co-training • Graph-based
Exploiting Hyperlinks – Co-training • Each document instance has two sets of alternate view (Blum and Mitchell 1998) • terms in the document, x1 • terms in the hyperlinks that point to the document, x2 • Each view is sufficient to determine the class of the instance • Labeling function that classifies examples is the same applied to x1 or x2 • x1 and x2 are conditionally independent, given the class [Slide from Pierre Baldi]
Co-training Algorithm • Labeled data are used to infer two Naïve Bayes classifiers, one for each view • Each classifier will • examine unlabeled data • pick the most confidently predicted positive and negative examples • add these to the labeled examples • Classifiers are now retrained on the augmented set of labeled examples [Slide from Pierre Baldi]
Conclusion • SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters. • NB also good in many circumstances