1 / 31

Externally Enhanced Classifiers and Application in Web Page Classification

Externally Enhanced Classifiers and Application in Web Page Classification. Jyh-Jong Tsay National Chung Cheng University. Join work with Chi-Feng Chang and Hsuan-Yu Chen. This research is supported in part by National Science Council, Taiwan, under. Outline. Introduction

sondra
Download Presentation

Externally Enhanced Classifiers and Application in Web Page Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Externally Enhanced Classifiers and Application in Web Page Classification Jyh-Jong Tsay National Chung Cheng University Join work with Chi-Feng Chang and Hsuan-Yu Chen This research is supported in part by National Science Council, Taiwan, under.

  2. Outline • Introduction • Externally Enhanced Classifiers • Enhanced NB • Topic Restriction • Conclusion

  3. Classification: Definition • assignment of objects into a set of predefined categories (classes) • classification of applicants into risk levels • classification of web pages into topics • classification of protein sequences into families • topic-specific retrieval, information filter, recommendation, …

  4. Classification: Task • Input: a training set of examples, each labeled with one class label • Output: a model (classifier) that assigns a class label to each instance based on the other attributes • The model can be used to predict the class of new instances, for which the class label is missing or unknown

  5. Train and Test • example =instance + class label • Examples are divided into training set + test set • Classification model is built in two steps: • training - build the model from the training set • test - check the accuracy of the model using test set

  6. Train and Test • Kind of models: • if - then rules • decision trees • joint probabilities • decision surfaces • Accuracy of models: • the known class of test samples is matched against the class predicted by the model • accuracy rate = % of test set samples correctly classified by the model

  7. Training step Classification algorithm training data Classifier (model) if age < 31 or Car Type =Sports then Risk = High class label

  8. Test step Classifier (model) test data

  9. Classification (prediction) Classifier (model) new data

  10. Classification Techniques • Decision Tree Classification • Bayesian Classifiers • Hidden Markov Models(HMM) • Neural Networks • Support Vector Machines(SVM) • k-nearest neighbor classifiers(KNN) • Genetic Algorithms • Rough Set Approach

  11. Web Page Classification • automatically assign the document to a predefined category(topic) • Topic Specific Retrieval, Filter, Recommendation, …

  12. External Annotations S: Source Hierarchy T: Target Hierarchy www.openfind.com.tw www.yam.com T1 T2 T3 S1 S2 S3 S3 T5 T6 S6 T4 S4 S5 T7 T8 T9 : topic (class) : document 使用者瀏覽,找出有興趣的資訊, 根據使用者興趣來做filtering及資料歸類。 利用其他相關類別的資訊來幫助歸類。 • use external annotations to enhance classification of documents categorized in one topic hierarchy (source) to another one (target).

  13. Examples • web directories • Google, Yahoo, ProFusion, … • domain-specific channels • music, sports, … • product catalogs • expert annotations

  14. Learning Approaches • internal learning • produces traditional classifiers from internal information • large amount of internal information • external learning • produces external enhancer or reducers from external information • heterogeneous, sparse, dynamic

  15. External Learning • Probabilistic Enhancement • use probabilistic enhancer to improve probabilistic classifiers • Naïve Bayes, Hidden Markov Models, … • Topic Retriction • cascade reducer to reduce the set of candidates • KNN, SVM, Neural Nets, …

  16. Externally Enhanced Classifiers • KNN • SVM • NB • HMM Reducers Enhancers Topic Restriction Probabilistic Enhancement Predicted Class Annotated Instance Externally Enhanced Classifiers

  17. Summary • Traditional Clasifiers (Yam . 工商經濟Openfind . 工商經濟) • Naïve Bayes: 55% • SVM: 57% • Enhanced Classfiers: • Enhanced Naïve Bayes: 66% • Topic Restricted SVM: 67%

  18. Proposed Approaches • Probabilistic Enhancement that uses class information to enhance probabilistic classifiers such as Naïve Bayes and HMM • Topic Restriction that uses class information to restrict the set of candidate classes, and can be used to extend any classifier such as SVM and kNN

  19. Probabilistic Methods • Probabilistic Classifier • When external information is available, • Probabilistic Enhancement

  20. Estimation of P(vt|s) • straightforward estimation • more robust estimation • when • when

  21. NB-Based Methods (Agrawal and Srikant, 2001)

  22. Data Sets • Data set I • source hierarchy: Yam • target hierarchy: Openfind • Data set II • source hierarchy: Yam.BusinessAndEconomics • target hierarchy: Openfind.BusinessAndEconomics • Data set III • source hierarchy: Google.Business • target hierarchy: Yahoo.BusinessAndEconomics

  23. Comparison of NB-Based Method

  24. Class-Level Comparison

  25. Topic Restriction(TR) • TR uses class information to reduce the set of candidate classes, and can be used for any traditional classifiers such as SVM and kNN • Static Topic Restriction • Most source classes are related to a small number of targeted classes • Consider only those target classes that intersect the source class • Dynamic Topic Restriction • Simple classifiers achieve very high top k measure for small k • Consider only those top k classes ranked by a simple classifier

  26. Static Topic Restriction

  27. Dynamic Topic Restriction Data Set II

  28. Conclusion • We propose probabilistic enhancement to enhance Naïve Bayes. • We propose a topic restriction method to extend SVM. • We carry out extensive experiment for text collections from Google and Yahoo, and Openfind and Yam. • Experiment shows that our approaches significantly improve traditional approaches

  29. Further Remarks • Topic restriction is a general idea for cascading simpler, such as NB and linear classifiers, and more complicated classifiers, such as SVM and kNN • Cascading improves both the running times and classification accuracy of SVM and kNN, especially when the number of topic classes is large. • Further study on topic restriction is going on.

  30. Cascaded SVM • Web Directory Data (Openfind)

  31. Cascaded SVM • CNA news collection

More Related