Support Vector Machines Classification with A Very Large-scale Taxonomy

Download Presentation

Support Vector Machines Classification with A Very Large-scale Taxonomy

Loading in 2 Seconds...

- 66 Views
- Uploaded on
- Presentation posted in: General

Support Vector Machines Classification with A Very Large-scale Taxonomy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Support Vector Machines Classification with A Very Large-scale Taxonomy

Advisor ：Dr. Hsu

Presenter： Chien-Shing Chen

Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun

Zeng, Zheng Chen, and Wei-Ying Ma

SIGKDD , 2004

Outline

- Motivation
- Objective
- Introduction
- Dataset characteristic
- Complexity Analysis
- Effectiveness Analysis
- Experimental Settings
- Conclusions
- Personal Opinion

- very large-scale classification taxonomies
- Hundreds of thousands of categories
- Deep hierarchies
- Skewed category distribution over documents

- open question whether the state-of-the-art technologies in text categorization
- evaluation of SVM in web-page classification over the full taxonomy of the Yahoo! categories

- scalability and effectiveness
- a data analysis on the Yahoo! Taxonomy
- development of a scalable system for large-scale text categorization
- theoretical analysis and experimental evaluation of SVMs in hierarchical and non-hierarchical setting for classification
- threshold tuning algorithms with respect to time complexity and accuracy of SVMs

- TC (Text categorization), SVMs, KNN, NB,…
- in recent years, the scale of TC problems to become larger and larger
- Answer this question from the views of scalability and effectiveness

- flat SVMs , hierarchical SVMs
- structure of the taxonomy tree

- Optimal separating hyperplane between the two classes by max the margin between the classes’ closest points

- Multi-class classification
- basically, SVMs can only solve binary classification problems
- fit all binary sub-classifiers

- one-against-all
- N two-class (true class and false class)

- one-against-one
- N(N-1)/2 classifiers

- be a set of n labeled training documents
- linear discriminant function
- a corresponding classification function as
- margin of a weight vector

- Optimal separation
- soft-margin multiclass formulation

- The full domain of the Yahoo! Directory
- 292,216 categories
- 792,601 documents

- Over 76% of the Yahoo! Categories have fewer than 5 labeled documents
- As “rare categories” increases at deeper hierarchy levels
- 36% are rare categories at deep levels

- many documents have multiple labels
- average has 2.23 labels
- the largest number of labels for a single document is 31

- Yahoo! Directory into a training set and a testing set with a ratio of 7:3
- Remove those categories containing only one labeled document
- 132,199 categories
- 492,617 training documents 275,364 testing documents

- Flat SVMs, with one-against-rest strategy
- N is the number of training documents
- M is the number of categories
- denotes the average training time per SVM model
- model

- Hierarchical
- mi is the number of categories defined at the i-th level
- j is the size-based rank of the categories
- nij is the number of training documents for the j-th category at the i-th level
- ni1 is the number of training document for the most common category at the i-th level
- is a level-specific parameter

- was used to approximate the number of categories at the i-th level

- For the testing phase of hierarchical SVMs
- Pachinko-machine search: 從根部做起，每次從當前類中選擇一個最可能的子類打開，直到遇到葉子為止

- SCut
- Optimal performance of the classifier is obtained for the category
- Fix the per-category thresholds when applying the classifier to new documents in the test set

- RCut
- Sort categories by score and assign YES to each of the t top-ranking categories

- Compared to scalability analysis, classification effectiveness is not as clear and predictable
- be affected by many other factors

- Potential problems of SVM
- noisy, imbalanced

- Can’t expect the performance of hierarchical SVM to be very good

- 10 machines, each with four 3GHz CPUs and 4 GB of memory

- Text categorization algorithms to very large problems, especially large-scale Web taxonomies

- Drawback
- Lower performance in deep level

- Application
- combine SVMs with concept hierarchical tree
- Application to Text, or others domain
- Pachinko-machine search…

- Future Work
- learn SVMs kernel to implement ?