Externally Enhanced Classifiers and Application in Web Page Classification

Externally Enhanced Classifiers and Application in Web Page Classification Jyh-Jong Tsay National Chung Cheng University Join work with Chi-Feng Chang and Hsuan-Yu Chen This research is supported in part by National Science Council, Taiwan, under.

Outline • Introduction • Externally Enhanced Classifiers • Enhanced NB • Topic Restriction • Conclusion

Classification: Definition • assignment of objects into a set of predefined categories (classes) • classification of applicants into risk levels • classification of web pages into topics • classification of protein sequences into families • topic-specific retrieval, information filter, recommendation, …

Classification: Task • Input: a training set of examples, each labeled with one class label • Output: a model (classifier) that assigns a class label to each instance based on the other attributes • The model can be used to predict the class of new instances, for which the class label is missing or unknown

Train and Test • example =instance + class label • Examples are divided into training set + test set • Classification model is built in two steps: • training - build the model from the training set • test - check the accuracy of the model using test set

Train and Test • Kind of models: • if - then rules • decision trees • joint probabilities • decision surfaces • Accuracy of models: • the known class of test samples is matched against the class predicted by the model • accuracy rate = % of test set samples correctly classified by the model

Training step Classification algorithm training data Classifier (model) if age < 31 or Car Type =Sports then Risk = High class label

Test step Classifier (model) test data

Classification (prediction) Classifier (model) new data

Classification Techniques • Decision Tree Classification • Bayesian Classifiers • Hidden Markov Models(HMM) • Neural Networks • Support Vector Machines(SVM) • k-nearest neighbor classifiers(KNN) • Genetic Algorithms • Rough Set Approach

Web Page Classification • automatically assign the document to a predefined category(topic) • Topic Specific Retrieval, Filter, Recommendation, …

External Annotations S: Source Hierarchy T: Target Hierarchy www.openfind.com.tw www.yam.com T1 T2 T3 S1 S2 S3 S3 T5 T6 S6 T4 S4 S5 T7 T8 T9 : topic (class) : document 使用者瀏覽,找出有興趣的資訊, 根據使用者興趣來做filtering及資料歸類。利用其他相關類別的資訊來幫助歸類。 • use external annotations to enhance classification of documents categorized in one topic hierarchy (source) to another one (target).

Examples • web directories • Google, Yahoo, ProFusion, … • domain-specific channels • music, sports, … • product catalogs • expert annotations

Learning Approaches • internal learning • produces traditional classifiers from internal information • large amount of internal information • external learning • produces external enhancer or reducers from external information • heterogeneous, sparse, dynamic

External Learning • Probabilistic Enhancement • use probabilistic enhancer to improve probabilistic classifiers • Naïve Bayes, Hidden Markov Models, … • Topic Retriction • cascade reducer to reduce the set of candidates • KNN, SVM, Neural Nets, …

Externally Enhanced Classifiers • KNN • SVM • NB • HMM Reducers Enhancers Topic Restriction Probabilistic Enhancement Predicted Class Annotated Instance Externally Enhanced Classifiers

Summary • Traditional Clasifiers (Yam . 工商經濟Openfind . 工商經濟) • Naïve Bayes: 55% • SVM: 57% • Enhanced Classfiers: • Enhanced Naïve Bayes: 66% • Topic Restricted SVM: 67%

Proposed Approaches • Probabilistic Enhancement that uses class information to enhance probabilistic classifiers such as Naïve Bayes and HMM • Topic Restriction that uses class information to restrict the set of candidate classes, and can be used to extend any classifier such as SVM and kNN

Probabilistic Methods • Probabilistic Classifier • When external information is available, • Probabilistic Enhancement

Estimation of P(vt|s) • straightforward estimation • more robust estimation • when • when

NB-Based Methods (Agrawal and Srikant, 2001)

Data Sets • Data set I • source hierarchy: Yam • target hierarchy: Openfind • Data set II • source hierarchy: Yam.BusinessAndEconomics • target hierarchy: Openfind.BusinessAndEconomics • Data set III • source hierarchy: Google.Business • target hierarchy: Yahoo.BusinessAndEconomics

Comparison of NB-Based Method

Class-Level Comparison

Topic Restriction(TR) • TR uses class information to reduce the set of candidate classes, and can be used for any traditional classifiers such as SVM and kNN • Static Topic Restriction • Most source classes are related to a small number of targeted classes • Consider only those target classes that intersect the source class • Dynamic Topic Restriction • Simple classifiers achieve very high top k measure for small k • Consider only those top k classes ranked by a simple classifier

Static Topic Restriction

Dynamic Topic Restriction Data Set II

Conclusion • We propose probabilistic enhancement to enhance Naïve Bayes. • We propose a topic restriction method to extend SVM. • We carry out extensive experiment for text collections from Google and Yahoo, and Openfind and Yam. • Experiment shows that our approaches significantly improve traditional approaches

Further Remarks • Topic restriction is a general idea for cascading simpler, such as NB and linear classifiers, and more complicated classifiers, such as SVM and kNN • Cascading improves both the running times and classification accuracy of SVM and kNN, especially when the number of topic classes is large. • Further study on topic restriction is going on.

Cascaded SVM • Web Directory Data (Openfind)

Cascaded SVM • CNA news collection

Externally Enhanced Classifiers and Application in Web Page Classification

Externally Enhanced Classifiers and Application in Web Page Classification

Presentation Transcript

Classification Web

Web Page Classification

Classification, Application, and Differences

Web Page

Web Page Classification by Academic Fields

Comparison of Web Page Classification Algorithms

PEBL: Web Page Classification without Negative Examples

PEBL: Web Page Classification without Negative Examples

Bayesian Online Classifiers for Text Classification and Filtering

Holistic Web Page Classification

Classification and Linear Classifiers

Web classification

Classification Bayesian Classifiers

AngularJS For Enhanced Web & Mobile Application Development

Back-end Application Development Importance | Enhanced Web Experience

.JSP Page Flow – and Managing State in your Web Application

Holistic Web Page Classification

Web Application Development: Single page app vs Multi page app