STUDENT RESEARCH SYMPOSIUM 2005

STUDENT RESEARCH SYMPOSIUM 2005 Title: Strategically using Pairwise Classification to Improve Category Prediction Presenter: Pinar Donmez Advisors: Carolyn Penstein Rosé, Jaime Carbonell LTI, SCS Carnegie Mellon University

Outline • Problem Definition • Overview: Multi-label text classification methods • Motivation for Ensemble Approaches • Technical Details: Selective Concentration Classifiers • Formal Evaluation • Conclusions and Future Work

Problem Definition • Multi-label Text Classification (TC): 1-1 mapping of documents to pre-defined categories • Problems with TC: • Limited training data • Poor choice of features • Flaws in the learning process • Goal: Improve the predictions on the unseen data

Multi-label TC Methods • ECOC • Boosting • Pairwise Coupling and Latent Variable Approach

ECOC • Recall Problems with TC: • Limited training data • Poor choice of features • Flaws in the learning process • Result: Poor classification performance • ECOC: • Encode each class in a code vector • Encode each example in a code vector • Calculate the probability of each bit being 1 using i.e. decision trees, neural networks, etc. • Combine these prob’s in a vector • To classify the given ex, calculate the distance between this vector and each of the codewords of classes

Boosting • Main idea: Evolve a set of weights over the training set

Pairwise Coupling • K classes, N observations: • X = (f1, f2, f3, …, fp) is an observation with p features • K=2 case is generally easier than K>2 cases – since only one decision boundary has to be learned • Friedman’s rule for K-class problem (K>2): Max-wins Rule

Latent Variable Approach* • Usage of hidden variables that tell whether the corresponding model is good at capturing particular patterns of the data • Decision is based on the posterior probability: is the likelihood that ith model should be used for class prediction given input x where is the probability of y given input x and ith model and * Y. Liu, J. Carbonell, and R. Jin. A pairwise ensemble approach for accurate genre classification. In ECML ’03, 2003.

For each class Ci, Liu et.al., builds a structure like the following: • Liu et.al., builds a structure like above for each class • Compute the corresponding score of each test example • Assign the example to the class with the highest score

Intuition Behind Our Method • Multiple classes => single decision boundary is not powerful enough • Ensemble notion: • Partition the data into focused subsets • Learn a classification model on each subset • Combine the predictions of each model • What is the problem with ensemble techniques? • When the category space is large, time complexity to build models on subsets becomes intractable • Our method addresses this problem. But how?

Technical Details • Build one-vs-all classifiers iteratively • At each iteration choose which sub-classifiers to build based on an analysis of error distributions • Idea: Focus on the classes that are highly confusable • Similar to Boosting • Boosting modifies the weights of misclassified examples to penalize inaccurate models • In decision stage: • If a confusable class is chosen for prediction of a test example, predictions of the sub-classifiers for that class are also taken into account

Train K one-vs-allmodels B … F … H ConfusionMatrix A … D D-vs-{F and H} classifier A vs B classifier B F H Build A-vs-B and D-vs-{F and H} A D ….. Confusion Matrix Note: Continue to build sub-classifiers until either there is no need or you can not divide any further!

How to choose subclassifiers? • fi(λ) = λ*µi + (1- λ)*ơ2i • gi(β) = µi + β* ơ2i where µi = avg number of false positives for class i ơ2i = stdev of false positives for class i • Focus on classes which fi(λ) > T ( T _ predefined threshold) • For every i for which the above inequality is true: • Choose all classes j where C(i,j) > gi(β) • C(i,j) = entry in the confusion matrix where i is the predicted class and j is the true class • 3 parameters: λ , β, and T • Tuned on a held-out set

Analysis of error distribution for some classifiers I • Analysis on the 20newsgroups dataset: • These errors are more uniformly distributed • The avg number of false positives are not very high • Two criteria aren’t met: • Skewed error distribution • Large # of errors

Analysis of error distribution for some classifiers II • Common in all three: • Skewed distribution of errors (false +’s) • These peaks will form the sub-classifiers

Implications of our method • Objective: Obtain high accuracy by choosing a small set of sub-classifiers within a small number of iterations • Pros: • Strategically choosing sub-classifiers reduce training time compared to building one-vs-one classifiers • O(nlogn) number of classifiers on the average • Sub-classifiers are trained on more focused sets, so they are likely to do a better job • Cons: • We focus on the problematic classes to distinguish for obtaining sub-classifiers. Hence, performance might be hurt as we increase the number of iterations

Evaluation • Dataset: 20 newsgroups* • Evaluation on two versions: • Original 20 Newsgroups (19,997 documents evenly distributed across 20 classes) • Cleaned version (not include headers + stopwords + words occur only once) • Vocabulary size ~ 62,000 *J. Rennie and R. Rafkin, Improving Multiclass Text Classification with SVM. MIT, AI Memo AIM-2001-026. 2001

Comparison of Results I • Results are based on the evaluation of the cleaned version of 20newsgroups dataset • Selective Concentration did as comparably well as the Latent Variable Approach • Selective Concentration uses O(nlogn) classifiers on the average while Latent Variable Approach uses O(n2) classifiers

Comparison of Results II • Results are based on the original version of 20newsgroup data • Selective Concentration method is significantly better than the baseline • Difference between the number of classifiers in both methods are not very high

Conclusion and Future Work • We can achieve comparable accuracy with less training time by strategically selecting subclassifiers • O(nlogn) vs O(n2) • Continued formalization of how different error distributions affect the advantage of this approach • Application to semantic role labeling

STUDENT RESEARCH SYMPOSIUM 2005

STUDENT RESEARCH SYMPOSIUM 2005

Presentation Transcript

2005 symposium listeria hysteria

Research Ethics Symposium 2007

Nasa Student symposium

Arkansas Student Success Symposium

Education Research Symposium

STUDENT RESEARCH SYMPOSIUM ( srs ) 2014

First Annual Office of Student Research Symposium

Spring 2005 Symposium March 30, 2005 WRDC

UWCISA Symposium 2005

Presented by Robert Carpenter UCI Undergraduate Research Symposium May 14, 2005

INTERNATIONAL HELICOPTER SYMPOSIUM 2005

2005 AMTA Symposium Oct. 31 – Nov. 4, 2005

Student Leadership Development Symposium

STScI May Symposium 2005

Riata Fellowship Research Symposium

International Helicopter Safety Symposium 2005

ACCLAIM Online Research Symposium

2005 SRC Summer Internship Symposium

Illinois Education Research Symposium

Student Research

Lake Arrowhead Symposium October 2005