1 / 36

Support Vector Machines Classification with A Very Large-scale Taxonomy

Support Vector Machines Classification with A Very Large-scale Taxonomy. Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma. SIGKDD , 2004. Outline. Motivation Objective Introduction

dooley
Download Presentation

Support Vector Machines Classification with A Very Large-scale Taxonomy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machines Classification with A Very Large-scale Taxonomy Advisor :Dr. Hsu Presenter: Chien-Shing Chen Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma SIGKDD , 2004

  2. Outline • Motivation • Objective • Introduction • Dataset characteristic • Complexity Analysis • Effectiveness Analysis • Experimental Settings • Conclusions • Personal Opinion

  3. Motivation • very large-scale classification taxonomies • Hundreds of thousands of categories • Deep hierarchies • Skewed category distribution over documents • open question whether the state-of-the-art technologies in text categorization • evaluation of SVM in web-page classification over the full taxonomy of the Yahoo! categories

  4. Objective • scalability and effectiveness • a data analysis on the Yahoo! Taxonomy • development of a scalable system for large-scale text categorization • theoretical analysis and experimental evaluation of SVMs in hierarchical and non-hierarchical setting for classification • threshold tuning algorithms with respect to time complexity and accuracy of SVMs

  5. Introduction • TC (Text categorization), SVMs, KNN, NB,… • in recent years, the scale of TC problems to become larger and larger • Answer this question from the views of scalability and effectiveness

  6. SVM • flat SVMs , hierarchical SVMs • structure of the taxonomy tree

  7. SVM

  8. SVM

  9. SVM • Optimal separating hyperplane between the two classes by max the margin between the classes’ closest points

  10. SVM

  11. SVM • Multi-class classification • basically, SVMs can only solve binary classification problems • fit all binary sub-classifiers • one-against-all • N two-class (true class and false class) • one-against-one • N(N-1)/2 classifiers

  12. SVM • be a set of n labeled training documents • linear discriminant function • a corresponding classification function as • margin of a weight vector

  13. SVM • Optimal separation • soft-margin multiclass formulation

  14. SVM

  15. SVM

  16. DATABASE-first characteristic • The full domain of the Yahoo! Directory • 292,216 categories • 792,601 documents

  17. DATABASE-second characteristic • Over 76% of the Yahoo! Categories have fewer than 5 labeled documents • As “rare categories” increases at deeper hierarchy levels • 36% are rare categories at deep levels

  18. DATABASE-third characteristic • many documents have multiple labels • average has 2.23 labels • the largest number of labels for a single document is 31

  19. DATABASE • Yahoo! Directory into a training set and a testing set with a ratio of 7:3 • Remove those categories containing only one labeled document • 132,199 categories • 492,617 training documents 275,364 testing documents

  20. Complexity and Effectiveness • Flat SVMs, with one-against-rest strategy • N is the number of training documents • M is the number of categories • denotes the average training time per SVM model • model

  21. Complexity and Effectiveness • Hierarchical • mi is the number of categories defined at the i-th level • j is the size-based rank of the categories • nij is the number of training documents for the j-th category at the i-th level • ni1 is the number of training document for the most common category at the i-th level • is a level-specific parameter

  22. Complexity and Effectiveness • was used to approximate the number of categories at the i-th level

  23. Complexity and Effectiveness

  24. Complexity and Effectiveness

  25. Complexity and Effectiveness

  26. Complexity and Effectiveness • For the testing phase of hierarchical SVMs • Pachinko-machine search: 從根部做起,每次從當前類中選擇一個最可能的子類打開,直到遇到葉子為止

  27. Complexity of SVM Classification with Threshold Tuning

  28. Complexity of SVM Classification with Threshold Tuning • SCut • Optimal performance of the classifier is obtained for the category • Fix the per-category thresholds when applying the classifier to new documents in the test set • RCut • Sort categories by score and assign YES to each of the t top-ranking categories

  29. Effectiveness Analysis • Compared to scalability analysis, classification effectiveness is not as clear and predictable • be affected by many other factors • Potential problems of SVM • noisy, imbalanced • Can’t expect the performance of hierarchical SVM to be very good

  30. Experimental Results • 10 machines, each with four 3GHz CPUs and 4 GB of memory

  31. Experimental Results

  32. Experimental Results

  33. Experimental Results

  34. Experimental Results

  35. Conclusions • Text categorization algorithms to very large problems, especially large-scale Web taxonomies

  36. Conclusions • Drawback • Lower performance in deep level • Application • combine SVMs with concept hierarchical tree • Application to Text, or others domain • Pachinko-machine search… • Future Work • learn SVMs kernel to implement ?

More Related