Boosting to Correct Inductive Bias in Text Classification

CIKM’02 Boosting to Correct Inductive Bias in Text Classification Yan Liu, Yiming Yang and Jaime Carbonell School of Computer Science Carnegie Mellon University Nov 1, 2002 Yan Liu, Yiming Yang and Jaime Carbonell

Introduction to Boosting • Boosting • Running weak learning algorithm on sampled examples • Combining the classifiers produced by the weak learners into a single composite classifier • Characteristics • Error-driven based sampling • Combination strategies • Variant • AdaBoost VS. Adaptive Resampling Yan Liu, Yiming Yang and Jaime Carbonell

Boosting Algorithm -AdaBoost • AdaBoost Algorithm (By Freund and Schapire) • Sampling strategies • Combination strategies Yan Liu, Yiming Yang and Jaime Carbonell

Boosting Algorithm -AdaBoost • Theoretical Analysis • Bound on Error • Training error drops exponentially fast [Schapire] • A qualitative bound on the generalization error • Connections • Logistic regression [Friedman] • Game theory and linear programming [Schapire] • Exponential model [Lebanon & Lafferty] • Applications Yan Liu, Yiming Yang and Jaime Carbonell

Boosting Algorithm – Adaptive Resampling • Adaptive Resampling (By Weiss et al.) • Sampling strategies • Combination strategies • Linear combination: unweighted voting • Theoretical Basis • Resampling with any technique that can increase the likelihood of the misclassified examples will achieve improvement Yan Liu, Yiming Yang and Jaime Carbonell

Task Identification • Perspective • How boosting reacts to the inductive bias of different classifiers? • Main Focus • How well boosting works for “non-weak” learning Algorithms? • Decision Tree, Naïve Bayes, Support Vector Machines and Rocchio-based classifier Yan Liu, Yiming Yang and Jaime Carbonell

Inductive Bias • Inductive Learning • Inducing classification functions from a set of training examples • Inductive Bias • The underlying assumptions in the inductive inferences • Restriction bias VS. Preference bias • Search Space vs. Search Strategy Yan Liu, Yiming Yang and Jaime Carbonell

Boosting Decision Tree Yan Liu, Yiming Yang and Jaime Carbonell

Boosting Naïve Bayes Yan Liu, Yiming Yang and Jaime Carbonell

Boosting SVMs Yan Liu, Yiming Yang and Jaime Carbonell

Boosting Rocchio Yan Liu, Yiming Yang and Jaime Carbonell

Experiments • Data Collection • Reuters-21578 Corpus • 90 Categories, 7769 training examples and 3019 testing examples • Pre-processing • Removal of stopwords and Stemming • Measurement • Micro Averaged F1 VS. Macro Averaged F1 Yan Liu, Yiming Yang and Jaime Carbonell

Experiment Results: Boosting SVM • Micro-Averaged F1 • Highest score: 0.875 (with Adaptive resampling) • Overfit problems • Macro-Averaged F1 • Achieve 17% improvement over the results by Yang & Liu • More effective for rare class than for common class Yan Liu, Yiming Yang and Jaime Carbonell

Experiment Results: Boosting Decision Tree • Micro-Averaged F1 • Only slight improvement • Macro-Averaged F1 • Achieve 13% improvement over baseline • More effective for rare class than for common class Yan Liu, Yiming Yang and Jaime Carbonell

Experiment Results: Boosting Naïve Bayes • Micro-Averaged F1 • Only marginal improvement • Macro-Averaged F1 • Decrease the performance • Not effective for rare class problems of this dataset • Open Question Yan Liu, Yiming Yang and Jaime Carbonell

Discussion • Boosting is effective to correct the inductive bias for SVMs and Dtree of rare categories Yan Liu, Yiming Yang and Jaime Carbonell

Conclusion • Rare Categories: effectiveness to correct the inductive bias varies across classifiers • good for SVMs and DTree • We achieve 13-17% improvement in Macro F1 measure by boosting SVMs and Dtree • Common Categories: not significantly effective to correct the inductive bias • However, we achieve the best micro-averaged F1 by boosting SVMs Yan Liu, Yiming Yang and Jaime Carbonell

Boosting to Correct Inductive Bias in Text Classification

Boosting to Correct Inductive Bias in Text Classification

Presentation Transcript

Automatic Text Classification

Vector Space Text Classification

Inductive Classification

Text Classification

Naïve Bayes Text Classification

CS 391L: Machine Learning: Inductive Classification

Inductive Bias: How to generalize on novel data

A Boosting Algorithm for Classification of Semi-Structured Text

TEXT CLASSIFICATION

Text Classification

Text Classification

Text Classification

Text Classification

Introduction to Automatic Text Classification

Remarks on Inductive Bias

Text Classification

Text Classification

Classification Text

Text Classification

TEXT CLASSIFICATION