Presenter : Yu-Ting LU Authors : Harun Ug˘uz 2011.KBS

A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm Presenter : Yu-Ting LUAuthors : HarunUg˘uz2011.KBS

Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments

Motivation • A major problem of text categorization is its large number of features. • Most of those are irrelevant noise that can mislead the classiﬁer.

Objectives • Two-stage feature selection and feature extraction is used to improve the performance of text categorization.

Methodology

Methodology – pre-processing • removing of stop-words • Stemming • term weighting • pruning of the words a, an, and, because, can, do, every, the… computer,computing,computation, computescomput Terms of the document collection prune the words that appear less thantwo times in the documents. documents

Methodology – feature ranking with information gain • each term within the text is ranked depending on their importance for the classiﬁcation in decreasing order using the IG method.

Methodology – dimension reduction methods • principal component analysis • Genetic algorithm for feature selection p ≦ m 11011 00110 01110 11110 Individual’s encoding Fitness function Selection Mutation Crossover

Methodology – text categorization methods • KNN classifier • C4.5 decision tree classiﬁer

Methodology – evaluation of the performance

Experiments – datasets • Reuters dataset-21578 • Classic3 dataset

Experiments – Reuters-21578 A document-term matrix is acquired with a dimension of 8158 × 7542 at the end of pre-processing.

Experiments – Reuters-21578

Experiments – Classic3 A document-term matrix is acquired in the dimension of 3891 × 6679 at the end of pre-processing.

Experiments – Classic3

Conclusions • The success of text categorization performed through the C4.5 decision tree and KNN algorithms using fewer features selected via IG-PCA and IG- GA is higher than the success acquired using features selected via IG. • Two-stage feature selection methods can improve the performance of text categorization.

Comments • Advantages - understand the basic methods • Applications - text categorization

Presenter : Yu-Ting LU Authors : Harun Ug˘uz 2011.KBS

Presenter : Yu-Ting LU Authors : Harun Ug˘uz 2011.KBS

Presentation Transcript

Speech and Articulation Screening Test August 5, 2011 Presenter: Jennifer Crookham, MH/Disability Manager

Introduction to Q10 Pharmaceutical Quality System

Ethernet Passive Optical Network (EPON) : Building a Next- Generation Optical Access Network

2011 PE Review:

Software Solutions for Landscape Professionals

Web People Search via Connection Analysis

Promoting Positive Partnerships with Parents

Authors

Presenter: Sayaka Abe

Autonomous Distributed V2G (Vehicle-to-Grid) Satisfying Scheduled Charging

Presenter Disclosure Information

Welcome to the BCUG Business Community Users Group Meeting Tuesday, March 22, 2011

A Security Model/Enforcement Framework with Assurance for a Distributed Environment

Or….did I see that coming?

THE MIRCALES OF QUR ’ AN

Demand Supply WebQuest 2011

BASED ON THE WORKS OF HARUN YAHYA WWW.HARUNYAHAY.COM and others

Key Terms