Text categorization with support vector machines learning with many relevant features
Download
1 / 24

Text Categorization With Support Vector Machines: Learning With Many Relevant Features - PowerPoint PPT Presentation


  • 221 Views
  • Uploaded on

Text Categorization With Support Vector Machines: Learning With Many Relevant Features. By Thornsten Joachims Presented By Meghneel Gore. Goal of Text Categorization. Classify documents into a number of pre-defined categories. Documents can be in multiple categories

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Text Categorization With Support Vector Machines: Learning With Many Relevant Features' - ashley


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Text categorization with support vector machines learning with many relevant features l.jpg
Text Categorization With Support Vector Machines: Learning With Many Relevant Features

By Thornsten Joachims

Presented By Meghneel Gore


Goal of text categorization l.jpg
Goal of Text Categorization With Many Relevant Features

  • Classify documents into a number of pre-defined categories.

    • Documents can be in multiple categories

    • Documents can be in none of the categories


Applications of text categorization l.jpg
Applications of Text Categorization With Many Relevant Features

  • Categorization of news stories for online retrieval

  • Finding interesting information from the WWW

  • Guiding a user's search through hypertext


Representation of text l.jpg
Representation of Text With Many Relevant Features

  • Removal of stop words

  • Reduction of word to its stem

  • Preparation of feature vector


Representation of text5 l.jpg
Representation of Text With Many Relevant Features

.......................

......................

......................

......................

......................

......................

.....................

2 Comput

1 Process

2 Buy

3 Memory

....

This is a Document Vector


What s next l.jpg
What's Next... With Many Relevant Features

  • Appropriateness of support vector machines for this application

  • Support vector machine theory

  • Conventional learning methods

  • Experiments

  • Results

  • Conclusions


Why svms l.jpg
Why SVMs? With Many Relevant Features

  • High dimensional input space

  • Few irrelevant features

  • Sparse document vectors

  • Text categorization problems are linearly separable


Support vector machines l.jpg
Support Vector Machines With Many Relevant Features

Visualization of a Support Vector Machine


Support vector machines9 l.jpg
Support Vector Machines With Many Relevant Features

  • Structural risk minimization


Support vector machines10 l.jpg
Support Vector Machines With Many Relevant Features

  • We define a structure of hypothesis spaces Hi such that their respective VC dimensions di increases


Support vector machines11 l.jpg
Support Vector Machines With Many Relevant Features

  • Lemma [Vapnik, 1982]

    Consider hyperplanes

As hypotheses


Support vector machines12 l.jpg
Support Vector Machines With Many Relevant Features

If all example vectors are contained in

A hypersphere of radius R and it is

Required that


Support vector machines13 l.jpg
Support Vector Machines With Many Relevant Features

  • Then this set of hyperplane has a VC dimension d bounded by


Support vector machines14 l.jpg
Support Vector Machines With Many Relevant Features

  • Minimize

  • Such that


Conventional learning methods l.jpg
Conventional Learning Methods With Many Relevant Features

  • Naïve Bayes classifier

  • Rocchio algorithm

  • K-nearest Neighbors

  • Decision tree classifier


Na ve bayes classifier l.jpg
Naïve Bayes Classifier With Many Relevant Features

  • Consider a document vector with attributes a1, a2… an with target values v

  • Bayesian approach:


Na ve bayes classifier17 l.jpg
Naïve Bayes Classifier With Many Relevant Features

  • We can rewrite that using Bayes theorem as


Na ve bayes classifier18 l.jpg
Naïve Bayes Classifier With Many Relevant Features

  • Naïve Bayes method assumes that the attributes are independent


Experiments l.jpg
Experiments With Many Relevant Features

  • Datasets

  • Performance measures

  • Results


Datasets l.jpg
Datasets With Many Relevant Features

  • Reuters-21578 dataset

    • 9603 training examples

    • 3299 testing documents

  • Ohsumed Corpus

    • 10000 training documents

    • 10000 testing examples


Performance measures l.jpg
Performance Measures With Many Relevant Features

  • Precision

    • Probability that a document predicted to be in class ‘x’ truly belongs to that class

  • Recall

    • Probability that a document belonging to class ‘x’ is classified into that class

  • Precision/recall breakeven point


Results l.jpg
Results With Many Relevant Features

Precision/recall break-even point on Ohsumed dataset


Results23 l.jpg
Results With Many Relevant Features

Precision/recall break-even point on Reuters dataset


Conclusions l.jpg
Conclusions With Many Relevant Features

  • Introduces SVMs for text categorization

  • Theoretical and empirical evidence that SVMs are well suited for text categorization

  • Consistent improvement in accuracy over other methods


ad