support vector machines classification with a very large scale taxonomy
Download
Skip this Video
Download Presentation
Support Vector Machines Classification with A Very Large-scale Taxonomy

Loading in 2 Seconds...

play fullscreen
1 / 36

Support Vector Machines Classification with A Very Large-scale Taxonomy - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

Support Vector Machines Classification with A Very Large-scale Taxonomy. Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma. SIGKDD , 2004. Outline. Motivation Objective Introduction

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Support Vector Machines Classification with A Very Large-scale Taxonomy' - dooley


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
support vector machines classification with a very large scale taxonomy

Support Vector Machines Classification with A Very Large-scale Taxonomy

Advisor :Dr. Hsu

Presenter: Chien-Shing Chen

Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun

Zeng, Zheng Chen, and Wei-Ying Ma

SIGKDD , 2004

slide2
Outline
  • Motivation
  • Objective
  • Introduction
  • Dataset characteristic
  • Complexity Analysis
  • Effectiveness Analysis
  • Experimental Settings
  • Conclusions
  • Personal Opinion
motivation
Motivation
  • very large-scale classification taxonomies
    • Hundreds of thousands of categories
    • Deep hierarchies
    • Skewed category distribution over documents
  • open question whether the state-of-the-art technologies in text categorization
  • evaluation of SVM in web-page classification over the full taxonomy of the Yahoo! categories
objective
Objective
  • scalability and effectiveness
  • a data analysis on the Yahoo! Taxonomy
  • development of a scalable system for large-scale text categorization
  • theoretical analysis and experimental evaluation of SVMs in hierarchical and non-hierarchical setting for classification
  • threshold tuning algorithms with respect to time complexity and accuracy of SVMs
introduction
Introduction
  • TC (Text categorization), SVMs, KNN, NB,…
  • in recent years, the scale of TC problems to become larger and larger
  • Answer this question from the views of scalability and effectiveness
slide6
SVM
  • flat SVMs , hierarchical SVMs
    • structure of the taxonomy tree
slide9
SVM
  • Optimal separating hyperplane between the two classes by max the margin between the classes’ closest points
slide11
SVM
  • Multi-class classification
    • basically, SVMs can only solve binary classification problems
    • fit all binary sub-classifiers
  • one-against-all
    • N two-class (true class and false class)
  • one-against-one
    • N(N-1)/2 classifiers
slide12
SVM
  • be a set of n labeled training documents
  • linear discriminant function
  • a corresponding classification function as
  • margin of a weight vector
slide13
SVM
  • Optimal separation
  • soft-margin multiclass formulation
database first characteristic
DATABASE-first characteristic
  • The full domain of the Yahoo! Directory
    • 292,216 categories
    • 792,601 documents
database second characteristic
DATABASE-second characteristic
  • Over 76% of the Yahoo! Categories have fewer than 5 labeled documents
  • As “rare categories” increases at deeper hierarchy levels
    • 36% are rare categories at deep levels
database third characteristic
DATABASE-third characteristic
  • many documents have multiple labels
    • average has 2.23 labels
    • the largest number of labels for a single document is 31
database
DATABASE
  • Yahoo! Directory into a training set and a testing set with a ratio of 7:3
  • Remove those categories containing only one labeled document
    • 132,199 categories
    • 492,617 training documents 275,364 testing documents
complexity and effectiveness
Complexity and Effectiveness
  • Flat SVMs, with one-against-rest strategy
  • N is the number of training documents
  • M is the number of categories
  • denotes the average training time per SVM model
  • model
complexity and effectiveness1
Complexity and Effectiveness
  • Hierarchical
  • mi is the number of categories defined at the i-th level
  • j is the size-based rank of the categories
  • nij is the number of training documents for the j-th category at the i-th level
  • ni1 is the number of training document for the most common category at the i-th level
  • is a level-specific parameter
complexity and effectiveness2
Complexity and Effectiveness
  • was used to approximate the number of categories at the i-th level
complexity and effectiveness6
Complexity and Effectiveness
  • For the testing phase of hierarchical SVMs
  • Pachinko-machine search: 從根部做起,每次從當前類中選擇一個最可能的子類打開,直到遇到葉子為止
complexity of svm classification with threshold tuning1
Complexity of SVM Classification with Threshold Tuning
  • SCut
    • Optimal performance of the classifier is obtained for the category
    • Fix the per-category thresholds when applying the classifier to new documents in the test set
  • RCut
    • Sort categories by score and assign YES to each of the t top-ranking categories
effectiveness analysis
Effectiveness Analysis
  • Compared to scalability analysis, classification effectiveness is not as clear and predictable
    • be affected by many other factors
  • Potential problems of SVM
    • noisy, imbalanced
  • Can’t expect the performance of hierarchical SVM to be very good
experimental results
Experimental Results
  • 10 machines, each with four 3GHz CPUs and 4 GB of memory
conclusions
Conclusions
  • Text categorization algorithms to very large problems, especially large-scale Web taxonomies
conclusions1
Conclusions
  • Drawback
    • Lower performance in deep level
  • Application
    • combine SVMs with concept hierarchical tree
    • Application to Text, or others domain
    • Pachinko-machine search…
  • Future Work
    • learn SVMs kernel to implement ?
ad