Text Categorization With Support Vector Machines: Learning With Many Relevant Features

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore

Goal of Text Categorization • Classify documents into a number of pre-defined categories. • Documents can be in multiple categories • Documents can be in none of the categories

Applications of Text Categorization • Categorization of news stories for online retrieval • Finding interesting information from the WWW • Guiding a user's search through hypertext

Representation of Text • Removal of stop words • Reduction of word to its stem • Preparation of feature vector

Representation of Text ....................... ...................... ...................... ...................... ...................... ...................... ..................... 2 Comput 1 Process 2 Buy 3 Memory .... This is a Document Vector

What's Next... • Appropriateness of support vector machines for this application • Support vector machine theory • Conventional learning methods • Experiments • Results • Conclusions

Why SVMs? • High dimensional input space • Few irrelevant features • Sparse document vectors • Text categorization problems are linearly separable

Support Vector Machines Visualization of a Support Vector Machine

Support Vector Machines • Structural risk minimization

Support Vector Machines • We define a structure of hypothesis spaces Hi such that their respective VC dimensions di increases

Support Vector Machines • Lemma [Vapnik, 1982] Consider hyperplanes As hypotheses

Support Vector Machines If all example vectors are contained in A hypersphere of radius R and it is Required that

Support Vector Machines • Then this set of hyperplane has a VC dimension d bounded by

Support Vector Machines • Minimize • Such that

Conventional Learning Methods • Naïve Bayes classifier • Rocchio algorithm • K-nearest Neighbors • Decision tree classifier

Naïve Bayes Classifier • Consider a document vector with attributes a1, a2… an with target values v • Bayesian approach:

Naïve Bayes Classifier • We can rewrite that using Bayes theorem as

Naïve Bayes Classifier • Naïve Bayes method assumes that the attributes are independent

Experiments • Datasets • Performance measures • Results

Datasets • Reuters-21578 dataset • 9603 training examples • 3299 testing documents • Ohsumed Corpus • 10000 training documents • 10000 testing examples

Performance Measures • Precision • Probability that a document predicted to be in class ‘x’ truly belongs to that class • Recall • Probability that a document belonging to class ‘x’ is classified into that class • Precision/recall breakeven point

Results Precision/recall break-even point on Ohsumed dataset

Results Precision/recall break-even point on Reuters dataset

Conclusions • Introduces SVMs for text categorization • Theoretical and empirical evidence that SVMs are well suited for text categorization • Consistent improvement in accuracy over other methods

Text Categorization With Support Vector Machines: Learning With Many Relevant Features

Text Categorization With Support Vector Machines: Learning With Many Relevant Features

Presentation Transcript

Ch – 35 AC Circuits

Simple Machines

Multiple Kernel Learning

Modeling the Internet and the Web: Text Analysis

Review of Vector Analysis

Text Categorization and Images

CS276 Information Retrieval and Web Search

PROBLEM-BASED LEARNING

Question Classification II

Chapter 3: Supervised Learning

Title

Chapter 6. Classification and Prediction

Text Mining

Similarities, Distances and Manifold Learning

Text Categorization

Text Classification

CS276 Information Retrieval and Web Search

Relevant Costs for Decision Making