Combining Unsupervised Feature Selection Strategy for Automatic Text Categorization

Combining Unsupervised Feature Selection Strategy for Automatic Text Categorization Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Ping-Tsun Chang

Introduction • In recent researches • The limit of using statistic or computational approach for natural language understanding • The develop of machine learning technique is almost reached its bound • Natural language is infinite and nonlinear! • Unsupervised Feature Selection Ping-Tsun Chang

Sensing Classification Segmentation Post-Processing Feature Extraction Decision Text CategorizationBackground Knowledge • Problem Definition: Text Categorization is a problem to assign a unknown lebel to a large amount of document by a large amount of text data. Ping-Tsun Chang

Background KnowledgeMachine Learning • Instance-Based Learning • K-Nearest Neighbors • Neural Networks • Support Vector Machine • Using Computer help us to induction from complex and large amount of pattern data • Bayesian Learning Ping-Tsun Chang

Background KnowledgeFeature Selection • Information Gain • Mutual Information • CHI-Square Ping-Tsun Chang

Baysian Classifier • Recent Researches • Naïve Bayes classifiers are competitive with other techniques in accuracy • Fast: single pass and quickly classify new documents • ATHENA: EDBT 2000 Ping-Tsun Chang

d ? Machine LearningApproaches: kNN Classifier Ping-Tsun Chang

Machine LearningApproaches: Support Vector Machine • Basic hypotheses : Consistent hypotheses of the Version Space • Project the original training data in space X to a higher dimension feature space F via a Mercel operator K Ping-Tsun Chang

What is Certainly? • Rule for kNN • Rule for SVM Ping-Tsun Chang

ALGORITHM Two-Stage-Text-Categorization (input: document d) returns category C Statistic: Trained classifier: Traditional-Classifier The feature set: F The new feature set by user feedback: Ui for related catehory Ci For new document d C ← Traditional-Classifier (d) If NOT satisfy the rule of uncertainly Return C Algorithm for Two-StageAutomatic Text Categorization Else For all category Ci If d have the feature in F C ← Ci Return C End If Cj ←User-Input Uj ← Uj + User-Selected C ←Cj END If Return C Ping-Tsun Chang

Determine threshold of the Rule Ping-Tsun Chang

Experienments Ping-Tsun Chang

References [1] Dunja Mladenic, J. Stefen Institute, Text-Learning and Related Intelligent Agents: A Survey, IEEE Transactions on Intelligent Systems, pp. 44-54, 1999. [2] Yiming Yang, Improving Text Categorization Methods for Event Tracking, In Proceedings of the 23th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’00), 2000. [3] Yiming Yang, Combining Multiple Learning Strategies for Effective Cross Vaildation, In Proceedings of the 17th International Conference on Machine Learning (ICML ’00) ,2000. [4] V. Vapnik, The Nature of Statiscal Learning Theory. Springer, New York, 1995. [5] Thorsten Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevent Features.In European Conference on Machine Learning(ECML ’98), pages 137-142, Berlin, 1998, Springer. [6] Yiming Yang, A re-examination of Text Categorization Methods, In Proceedings of the 22th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’99), 1999. [7] Lee-Feng Chien. Pat-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the 20th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’97), pages 50-58, 1997. [8] Jyh-Jong Tsay and Jing-Doo Wang, Improving Automatic Chinese Text Categorization by Error Correction. In Proceedings of Information Retrieval of Asian Languages(IRAL ’00), 2000. [9] James Tin-Yau Kwok, Automated Text Classification Using Support Vector Machine, International Conference on Neural Information Processing(ICNIP ’98), 1998. [10] Daphne Koller and Simon Tone, Support Vector Machine Active Learning with Applications to Text Classification, In Proceedings of International Conference on Machine Learning(ICML ’00), 2000. [11] Central News Agency, URL: http://www.cna.com.tw [12] Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000. [13] D. E. Appelt, D. J. Israel. Introduction to Information Extraction Technology. Tutorial for International Joint Conference on Artificial Intelligence, Stockholm, August 1999. Ping-Tsun Chang

Combining Unsupervised Feature Selection Strategy for Automatic Text Categorization

Combining Unsupervised Feature Selection Strategy for Automatic Text Categorization

Presentation Transcript

Text Categorization

On feature distributional clustering for text categorization

Learning for Text Categorization

Feature Selection for Automatic Taxonomy Induction

Unsupervised Feature Selection for Linked Social Media Data

Incorporating Game Theory in Feature Selection for Text Categorization

Text Categorization

Text Categorization

Text Categorization

Combining Labeled and Unlabeled Data for Multiclass Text Categorization

text categorization

Unsupervised Feature Selection for Multi-Cluster Data

Improving Text Categorization Bootstrapping via Unsupervised Learning

Text Categorization

On feature distributional clustering for text categorization

Text Categorization

Unsupervised Feature Selection for Linked Social Media Data

OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization

Text Categorization

Feature selection for text categorization on imbalanced data