Support vector machines classification with a very large scale taxonomy
Download
1 / 36

Support Vector Machines Classification with A Very Large-scale Taxonomy - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Support Vector Machines Classification with A Very Large-scale Taxonomy. Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma. SIGKDD , 2004. Outline. Motivation Objective Introduction

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Support Vector Machines Classification with A Very Large-scale Taxonomy' - dooley


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Support vector machines classification with a very large scale taxonomy

Support Vector Machines Classification with A Very Large-scale Taxonomy

Advisor :Dr. Hsu

Presenter: Chien-Shing Chen

Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun

Zeng, Zheng Chen, and Wei-Ying Ma

SIGKDD , 2004


Outline Large-scale Taxonomy

  • Motivation

  • Objective

  • Introduction

  • Dataset characteristic

  • Complexity Analysis

  • Effectiveness Analysis

  • Experimental Settings

  • Conclusions

  • Personal Opinion


Motivation
Motivation Large-scale Taxonomy

  • very large-scale classification taxonomies

    • Hundreds of thousands of categories

    • Deep hierarchies

    • Skewed category distribution over documents

  • open question whether the state-of-the-art technologies in text categorization

  • evaluation of SVM in web-page classification over the full taxonomy of the Yahoo! categories


Objective
Objective Large-scale Taxonomy

  • scalability and effectiveness

  • a data analysis on the Yahoo! Taxonomy

  • development of a scalable system for large-scale text categorization

  • theoretical analysis and experimental evaluation of SVMs in hierarchical and non-hierarchical setting for classification

  • threshold tuning algorithms with respect to time complexity and accuracy of SVMs


Introduction
Introduction Large-scale Taxonomy

  • TC (Text categorization), SVMs, KNN, NB,…

  • in recent years, the scale of TC problems to become larger and larger

  • Answer this question from the views of scalability and effectiveness


SVM Large-scale Taxonomy

  • flat SVMs , hierarchical SVMs

    • structure of the taxonomy tree


SVM Large-scale Taxonomy


SVM Large-scale Taxonomy


SVM Large-scale Taxonomy

  • Optimal separating hyperplane between the two classes by max the margin between the classes’ closest points


SVM Large-scale Taxonomy


SVM Large-scale Taxonomy

  • Multi-class classification

    • basically, SVMs can only solve binary classification problems

    • fit all binary sub-classifiers

  • one-against-all

    • N two-class (true class and false class)

  • one-against-one

    • N(N-1)/2 classifiers


SVM Large-scale Taxonomy

  • be a set of n labeled training documents

  • linear discriminant function

  • a corresponding classification function as

  • margin of a weight vector


SVM Large-scale Taxonomy

  • Optimal separation

  • soft-margin multiclass formulation


SVM Large-scale Taxonomy


SVM Large-scale Taxonomy


Database first characteristic
DATABASE-first characteristic Large-scale Taxonomy

  • The full domain of the Yahoo! Directory

    • 292,216 categories

    • 792,601 documents


Database second characteristic
DATABASE-second characteristic Large-scale Taxonomy

  • Over 76% of the Yahoo! Categories have fewer than 5 labeled documents

  • As “rare categories” increases at deeper hierarchy levels

    • 36% are rare categories at deep levels


Database third characteristic
DATABASE-third characteristic Large-scale Taxonomy

  • many documents have multiple labels

    • average has 2.23 labels

    • the largest number of labels for a single document is 31


Database
DATABASE Large-scale Taxonomy

  • Yahoo! Directory into a training set and a testing set with a ratio of 7:3

  • Remove those categories containing only one labeled document

    • 132,199 categories

    • 492,617 training documents 275,364 testing documents


Complexity and effectiveness
Complexity and Effectiveness Large-scale Taxonomy

  • Flat SVMs, with one-against-rest strategy

  • N is the number of training documents

  • M is the number of categories

  • denotes the average training time per SVM model

  • model


Complexity and effectiveness1
Complexity and Effectiveness Large-scale Taxonomy

  • Hierarchical

  • mi is the number of categories defined at the i-th level

  • j is the size-based rank of the categories

  • nij is the number of training documents for the j-th category at the i-th level

  • ni1 is the number of training document for the most common category at the i-th level

  • is a level-specific parameter


Complexity and effectiveness2
Complexity and Effectiveness Large-scale Taxonomy

  • was used to approximate the number of categories at the i-th level


Complexity and effectiveness3
Complexity and Effectiveness Large-scale Taxonomy


Complexity and effectiveness4
Complexity and Effectiveness Large-scale Taxonomy


Complexity and effectiveness5
Complexity and Effectiveness Large-scale Taxonomy


Complexity and effectiveness6
Complexity and Effectiveness Large-scale Taxonomy

  • For the testing phase of hierarchical SVMs

  • Pachinko-machine search: 從根部做起,每次從當前類中選擇一個最可能的子類打開,直到遇到葉子為止



Complexity of svm classification with threshold tuning1
Complexity of SVM Classification with Threshold Tuning Large-scale Taxonomy

  • SCut

    • Optimal performance of the classifier is obtained for the category

    • Fix the per-category thresholds when applying the classifier to new documents in the test set

  • RCut

    • Sort categories by score and assign YES to each of the t top-ranking categories


Effectiveness analysis
Effectiveness Analysis Large-scale Taxonomy

  • Compared to scalability analysis, classification effectiveness is not as clear and predictable

    • be affected by many other factors

  • Potential problems of SVM

    • noisy, imbalanced

  • Can’t expect the performance of hierarchical SVM to be very good


Experimental results
Experimental Results Large-scale Taxonomy

  • 10 machines, each with four 3GHz CPUs and 4 GB of memory


Experimental results1
Experimental Results Large-scale Taxonomy


Experimental results2
Experimental Results Large-scale Taxonomy


Experimental results3
Experimental Results Large-scale Taxonomy


Experimental results4
Experimental Results Large-scale Taxonomy


Conclusions
Conclusions Large-scale Taxonomy

  • Text categorization algorithms to very large problems, especially large-scale Web taxonomies


Conclusions1
Conclusions Large-scale Taxonomy

  • Drawback

    • Lower performance in deep level

  • Application

    • combine SVMs with concept hierarchical tree

    • Application to Text, or others domain

    • Pachinko-machine search…

  • Future Work

    • learn SVMs kernel to implement ?


ad