240 likes | 353 Views
This document explores the use of Support Vector Machines (SVM) for text categorization, addressing the challenges of classifying documents into multiple predefined categories or none at all. Key applications include categorizing news stories, enhancing information retrieval from the web, and guiding user searches. We discuss the theoretical foundations of SVM, compare it with traditional learning methods like Naïve Bayes and K-Nearest Neighbors, and present empirical results demonstrating SVM's superior accuracy across various datasets. The findings underline SVM's efficacy in managing high-dimensional input spaces with few irrelevant features.
E N D
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore
Goal of Text Categorization • Classify documents into a number of pre-defined categories. • Documents can be in multiple categories • Documents can be in none of the categories
Applications of Text Categorization • Categorization of news stories for online retrieval • Finding interesting information from the WWW • Guiding a user's search through hypertext
Representation of Text • Removal of stop words • Reduction of word to its stem • Preparation of feature vector
Representation of Text ....................... ...................... ...................... ...................... ...................... ...................... ..................... 2 Comput 1 Process 2 Buy 3 Memory .... This is a Document Vector
What's Next... • Appropriateness of support vector machines for this application • Support vector machine theory • Conventional learning methods • Experiments • Results • Conclusions
Why SVMs? • High dimensional input space • Few irrelevant features • Sparse document vectors • Text categorization problems are linearly separable
Support Vector Machines Visualization of a Support Vector Machine
Support Vector Machines • Structural risk minimization
Support Vector Machines • We define a structure of hypothesis spaces Hi such that their respective VC dimensions di increases
Support Vector Machines • Lemma [Vapnik, 1982] Consider hyperplanes As hypotheses
Support Vector Machines If all example vectors are contained in A hypersphere of radius R and it is Required that
Support Vector Machines • Then this set of hyperplane has a VC dimension d bounded by
Support Vector Machines • Minimize • Such that
Conventional Learning Methods • Naïve Bayes classifier • Rocchio algorithm • K-nearest Neighbors • Decision tree classifier
Naïve Bayes Classifier • Consider a document vector with attributes a1, a2… an with target values v • Bayesian approach:
Naïve Bayes Classifier • We can rewrite that using Bayes theorem as
Naïve Bayes Classifier • Naïve Bayes method assumes that the attributes are independent
Experiments • Datasets • Performance measures • Results
Datasets • Reuters-21578 dataset • 9603 training examples • 3299 testing documents • Ohsumed Corpus • 10000 training documents • 10000 testing examples
Performance Measures • Precision • Probability that a document predicted to be in class ‘x’ truly belongs to that class • Recall • Probability that a document belonging to class ‘x’ is classified into that class • Precision/recall breakeven point
Results Precision/recall break-even point on Ohsumed dataset
Results Precision/recall break-even point on Reuters dataset
Conclusions • Introduces SVMs for text categorization • Theoretical and empirical evidence that SVMs are well suited for text categorization • Consistent improvement in accuracy over other methods