SET (6)

SET(6) Prof. Dragomir R. Radev radev@umich.edu

Vector space classification x2 topic2 topic1 x1

Decision surfaces x2 topic2 topic1 x1

Decision trees x2 topic2 topic1 x1

Classification usingdecision trees • Expected information need • I (s1, s2, …, sm) = - pi log (pi) • s = data samples • m = number of classes S

Decision tree induction • I(s1,s2)= I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 == 0.940

Entropy and information gain S S1j + … + smj • E(A) = I (s1j,…,smj) s Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s1,s2,…,sm) – E(A)

Entropy • Age <= 30s11 = 2, s21 = 3, I(s11, s21) = 0.971 • Age in 31 .. 40s12 = 4, s22 = 0, I (s12,s22) = 0 • Age > 40s13 = 3, s23 = 2, I (s13,s23) = 0.971

Entropy (cont’d) • E (age) =5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694 • Gain (age) = I (s1,s2) – E(age) = 0.246 • Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048

Final decision tree age > 40 31 .. 40 student credit yes excellent fair no yes no yes no yes

Other techniques • Bayesian classifiers • X: age <=30, income = medium, student = yes, credit = fair • P(yes) = 9/14 = 0.643 • P(no) = 5/14 = 0.357

Example (cont’d) • P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 • P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019 • P (X | yes) P (yes) = 0.044 x 0.643 = 0.028 • P (X | no) P (no) = 0.019 x 0.357 = 0.007 • Answer: yes/no?

SET Fall 2013 … 10. Linear classifiers Kernel methods Support vector machines …

Linear boundary x2 topic2 topic1 x1

Vector space classifiers • Using centroids • Boundary = line that is equidistant from two centroids

Generative models: knn • Assign each element to the closest cluster • K-nearest neighbors • Very easy to program • Tessellation; nonlinearity • Issues: choosing k, b? • Demo: • http://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html

Linear separators • Two-dimensional line: w1x1+w2x2=b is the linear separator w1x1+w2x2>b for the positive class • In n-dimensional spaces:

Example 1 x2 topic2 topic1 x1

Example 2 • Classifier for “interest” in Reuters-21578 • b=0 • If the document is “rate discount dlr world”, its score will be0.67*1+0.46*1+(-0.71)*1+(-0.35)*1= 0.05>0 Example from MSR

Example: perceptron algorithm Input: Algorithm: Output:

[Slide from Chris Bishop]

Linear classifiers • What is the major shortcoming of a perceptron? • How to determine the dimensionality of the separator? • Bias-variance tradeoff (example) • How to deal with multiple classes? • Any-of: build multiple classifiers for each class • One-of: harder (as J hyperplanes do not divide RM into J regions), instead: use class complements and scoring

Support vector machines • Introduced by Vapnik in the early 90s.

Issues with SVM • Soft margins (inseparability) • Kernels – non-linearity

The kernel idea before after

Example (mapping to a higher-dimensional space)

The kernel trick Polynomial kernel: Sigmoid kernel: RBF kernel: Many other kernels are useful for IR:e.g., string kernels, subsequence kernels, tree kernels, etc.

SVM (Cont’d) • Evaluation: • SVM > knn > decision tree > NB • Implementation • Quadratic optimization • Use toolkit (e.g., Thorsten Joachims’s svmlight)

Semi-supervised learning • EM • Co-training • Graph-based

Exploiting Hyperlinks – Co-training • Each document instance has two sets of alternate view (Blum and Mitchell 1998) • terms in the document, x1 • terms in the hyperlinks that point to the document, x2 • Each view is sufficient to determine the class of the instance • Labeling function that classifies examples is the same applied to x1 or x2 • x1 and x2 are conditionally independent, given the class [Slide from Pierre Baldi]

Co-training Algorithm • Labeled data are used to infer two Naïve Bayes classifiers, one for each view • Each classifier will • examine unlabeled data • pick the most confidently predicted positive and negative examples • add these to the labeled examples • Classifiers are now retrained on the augmented set of labeled examples [Slide from Pierre Baldi]

Conclusion • SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters. • NB also good in many circumstances

SET (6)

SET (6)

Presentation Transcript

Set

Set #6 Vocabulary

Set

Set 6 HONORS

Set

Set

Set

Set

Vocabulary Set 6

Set

Set

ICS312 Set 6

Set

Set

Set

Set

Set

Set

Set # 6

VOCABULARY SET #6

Ready, Set, Go!

Set Coordinates