Support vector machines classification with a very large scale taxonomy
This presentation is the property of its rightful owner.
Sponsored Links
1 / 36

Support Vector Machines Classification with A Very Large-scale Taxonomy PowerPoint PPT Presentation


  • 52 Views
  • Uploaded on
  • Presentation posted in: General

Support Vector Machines Classification with A Very Large-scale Taxonomy. Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma. SIGKDD , 2004. Outline. Motivation Objective Introduction

Download Presentation

Support Vector Machines Classification with A Very Large-scale Taxonomy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Support vector machines classification with a very large scale taxonomy

Support Vector Machines Classification with A Very Large-scale Taxonomy

Advisor :Dr. Hsu

Presenter: Chien-Shing Chen

Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun

Zeng, Zheng Chen, and Wei-Ying Ma

SIGKDD , 2004


Support vector machines classification with a very large scale taxonomy

Outline

  • Motivation

  • Objective

  • Introduction

  • Dataset characteristic

  • Complexity Analysis

  • Effectiveness Analysis

  • Experimental Settings

  • Conclusions

  • Personal Opinion


Motivation

Motivation

  • very large-scale classification taxonomies

    • Hundreds of thousands of categories

    • Deep hierarchies

    • Skewed category distribution over documents

  • open question whether the state-of-the-art technologies in text categorization

  • evaluation of SVM in web-page classification over the full taxonomy of the Yahoo! categories


Objective

Objective

  • scalability and effectiveness

  • a data analysis on the Yahoo! Taxonomy

  • development of a scalable system for large-scale text categorization

  • theoretical analysis and experimental evaluation of SVMs in hierarchical and non-hierarchical setting for classification

  • threshold tuning algorithms with respect to time complexity and accuracy of SVMs


Introduction

Introduction

  • TC (Text categorization), SVMs, KNN, NB,…

  • in recent years, the scale of TC problems to become larger and larger

  • Answer this question from the views of scalability and effectiveness


Support vector machines classification with a very large scale taxonomy

SVM

  • flat SVMs , hierarchical SVMs

    • structure of the taxonomy tree


Support vector machines classification with a very large scale taxonomy

SVM


Support vector machines classification with a very large scale taxonomy

SVM


Support vector machines classification with a very large scale taxonomy

SVM

  • Optimal separating hyperplane between the two classes by max the margin between the classes’ closest points


Support vector machines classification with a very large scale taxonomy

SVM


Support vector machines classification with a very large scale taxonomy

SVM

  • Multi-class classification

    • basically, SVMs can only solve binary classification problems

    • fit all binary sub-classifiers

  • one-against-all

    • N two-class (true class and false class)

  • one-against-one

    • N(N-1)/2 classifiers


Support vector machines classification with a very large scale taxonomy

SVM

  • be a set of n labeled training documents

  • linear discriminant function

  • a corresponding classification function as

  • margin of a weight vector


Support vector machines classification with a very large scale taxonomy

SVM

  • Optimal separation

  • soft-margin multiclass formulation


Support vector machines classification with a very large scale taxonomy

SVM


Support vector machines classification with a very large scale taxonomy

SVM


Database first characteristic

DATABASE-first characteristic

  • The full domain of the Yahoo! Directory

    • 292,216 categories

    • 792,601 documents


Database second characteristic

DATABASE-second characteristic

  • Over 76% of the Yahoo! Categories have fewer than 5 labeled documents

  • As “rare categories” increases at deeper hierarchy levels

    • 36% are rare categories at deep levels


Database third characteristic

DATABASE-third characteristic

  • many documents have multiple labels

    • average has 2.23 labels

    • the largest number of labels for a single document is 31


Database

DATABASE

  • Yahoo! Directory into a training set and a testing set with a ratio of 7:3

  • Remove those categories containing only one labeled document

    • 132,199 categories

    • 492,617 training documents 275,364 testing documents


Complexity and effectiveness

Complexity and Effectiveness

  • Flat SVMs, with one-against-rest strategy

  • N is the number of training documents

  • M is the number of categories

  • denotes the average training time per SVM model

  • model


Complexity and effectiveness1

Complexity and Effectiveness

  • Hierarchical

  • mi is the number of categories defined at the i-th level

  • j is the size-based rank of the categories

  • nij is the number of training documents for the j-th category at the i-th level

  • ni1 is the number of training document for the most common category at the i-th level

  • is a level-specific parameter


Complexity and effectiveness2

Complexity and Effectiveness

  • was used to approximate the number of categories at the i-th level


Complexity and effectiveness3

Complexity and Effectiveness


Complexity and effectiveness4

Complexity and Effectiveness


Complexity and effectiveness5

Complexity and Effectiveness


Complexity and effectiveness6

Complexity and Effectiveness

  • For the testing phase of hierarchical SVMs

  • Pachinko-machine search: 從根部做起,每次從當前類中選擇一個最可能的子類打開,直到遇到葉子為止


Complexity of svm classification with threshold tuning

Complexity of SVM Classification with Threshold Tuning


Complexity of svm classification with threshold tuning1

Complexity of SVM Classification with Threshold Tuning

  • SCut

    • Optimal performance of the classifier is obtained for the category

    • Fix the per-category thresholds when applying the classifier to new documents in the test set

  • RCut

    • Sort categories by score and assign YES to each of the t top-ranking categories


Effectiveness analysis

Effectiveness Analysis

  • Compared to scalability analysis, classification effectiveness is not as clear and predictable

    • be affected by many other factors

  • Potential problems of SVM

    • noisy, imbalanced

  • Can’t expect the performance of hierarchical SVM to be very good


Experimental results

Experimental Results

  • 10 machines, each with four 3GHz CPUs and 4 GB of memory


Experimental results1

Experimental Results


Experimental results2

Experimental Results


Experimental results3

Experimental Results


Experimental results4

Experimental Results


Conclusions

Conclusions

  • Text categorization algorithms to very large problems, especially large-scale Web taxonomies


Conclusions1

Conclusions

  • Drawback

    • Lower performance in deep level

  • Application

    • combine SVMs with concept hierarchical tree

    • Application to Text, or others domain

    • Pachinko-machine search…

  • Future Work

    • learn SVMs kernel to implement ?


  • Login