1 / 20

STUDENT RESEARCH SYMPOSIUM 2005

STUDENT RESEARCH SYMPOSIUM 2005. Title: Strategically using Pairwise Classification to Improve Category Prediction Presenter: Pinar Donmez Advisors: Carolyn Penstein Ros é , Jaime Carbonell LTI, SCS Carnegie Mellon University. Outline. Problem Definition

ferguson
Download Presentation

STUDENT RESEARCH SYMPOSIUM 2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STUDENT RESEARCH SYMPOSIUM 2005 Title: Strategically using Pairwise Classification to Improve Category Prediction Presenter: Pinar Donmez Advisors: Carolyn Penstein Rosé, Jaime Carbonell LTI, SCS Carnegie Mellon University

  2. Outline • Problem Definition • Overview: Multi-label text classification methods • Motivation for Ensemble Approaches • Technical Details: Selective Concentration Classifiers • Formal Evaluation • Conclusions and Future Work

  3. Problem Definition • Multi-label Text Classification (TC): 1-1 mapping of documents to pre-defined categories • Problems with TC: • Limited training data • Poor choice of features • Flaws in the learning process • Goal: Improve the predictions on the unseen data

  4. Multi-label TC Methods • ECOC • Boosting • Pairwise Coupling and Latent Variable Approach

  5. ECOC • Recall Problems with TC: • Limited training data • Poor choice of features • Flaws in the learning process • Result: Poor classification performance • ECOC: • Encode each class in a code vector • Encode each example in a code vector • Calculate the probability of each bit being 1 using i.e. decision trees, neural networks, etc. • Combine these prob’s in a vector • To classify the given ex, calculate the distance between this vector and each of the codewords of classes

  6. Boosting • Main idea: Evolve a set of weights over the training set

  7. Pairwise Coupling • K classes, N observations: • X = (f1, f2, f3, …, fp) is an observation with p features • K=2 case is generally easier than K>2 cases – since only one decision boundary has to be learned • Friedman’s rule for K-class problem (K>2): Max-wins Rule

  8. Latent Variable Approach* • Usage of hidden variables that tell whether the corresponding model is good at capturing particular patterns of the data • Decision is based on the posterior probability: is the likelihood that ith model should be used for class prediction given input x where is the probability of y given input x and ith model and * Y. Liu, J. Carbonell, and R. Jin. A pairwise ensemble approach for accurate genre classification. In ECML ’03, 2003.

  9. For each class Ci, Liu et.al., builds a structure like the following: • Liu et.al., builds a structure like above for each class • Compute the corresponding score of each test example • Assign the example to the class with the highest score

  10. Intuition Behind Our Method • Multiple classes => single decision boundary is not powerful enough • Ensemble notion: • Partition the data into focused subsets • Learn a classification model on each subset • Combine the predictions of each model • What is the problem with ensemble techniques? • When the category space is large, time complexity to build models on subsets becomes intractable • Our method addresses this problem. But how?

  11. Technical Details • Build one-vs-all classifiers iteratively • At each iteration choose which sub-classifiers to build based on an analysis of error distributions • Idea: Focus on the classes that are highly confusable • Similar to Boosting • Boosting modifies the weights of misclassified examples to penalize inaccurate models • In decision stage: • If a confusable class is chosen for prediction of a test example, predictions of the sub-classifiers for that class are also taken into account

  12. Train K one-vs-allmodels B … F … H ConfusionMatrix A … D D-vs-{F and H} classifier A vs B classifier B F H Build A-vs-B and D-vs-{F and H} A D ….. Confusion Matrix Note: Continue to build sub-classifiers until either there is no need or you can not divide any further!

  13. How to choose subclassifiers? • fi(λ) = λ*µi + (1- λ)*ơ2i • gi(β) = µi + β* ơ2i where µi = avg number of false positives for class i ơ2i = stdev of false positives for class i • Focus on classes which fi(λ) > T ( T _ predefined threshold) • For every i for which the above inequality is true: • Choose all classes j where C(i,j) > gi(β) • C(i,j) = entry in the confusion matrix where i is the predicted class and j is the true class • 3 parameters: λ , β, and T • Tuned on a held-out set

  14. Analysis of error distribution for some classifiers I • Analysis on the 20newsgroups dataset: • These errors are more uniformly distributed • The avg number of false positives are not very high • Two criteria aren’t met: • Skewed error distribution • Large # of errors

  15. Analysis of error distribution for some classifiers II • Common in all three: • Skewed distribution of errors (false +’s) • These peaks will form the sub-classifiers

  16. Implications of our method • Objective: Obtain high accuracy by choosing a small set of sub-classifiers within a small number of iterations • Pros: • Strategically choosing sub-classifiers reduce training time compared to building one-vs-one classifiers • O(nlogn) number of classifiers on the average • Sub-classifiers are trained on more focused sets, so they are likely to do a better job • Cons: • We focus on the problematic classes to distinguish for obtaining sub-classifiers. Hence, performance might be hurt as we increase the number of iterations

  17. Evaluation • Dataset: 20 newsgroups* • Evaluation on two versions: • Original 20 Newsgroups (19,997 documents evenly distributed across 20 classes) • Cleaned version (not include headers + stopwords + words occur only once) • Vocabulary size ~ 62,000 *J. Rennie and R. Rafkin, Improving Multiclass Text Classification with SVM. MIT, AI Memo AIM-2001-026. 2001

  18. Comparison of Results I • Results are based on the evaluation of the cleaned version of 20newsgroups dataset • Selective Concentration did as comparably well as the Latent Variable Approach • Selective Concentration uses O(nlogn) classifiers on the average while Latent Variable Approach uses O(n2) classifiers

  19. Comparison of Results II • Results are based on the original version of 20newsgroup data • Selective Concentration method is significantly better than the baseline • Difference between the number of classifiers in both methods are not very high

  20. Conclusion and Future Work • We can achieve comparable accuracy with less training time by strategically selecting subclassifiers • O(nlogn) vs O(n2) • Continued formalization of how different error distributions affect the advantage of this approach • Application to semantic role labeling

More Related