1 / 17

Presenter : Yu-Ting LU Authors : Harun Ug˘uz 2011.KBS

A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Presenter : Yu-Ting LU Authors : Harun Ug˘uz 2011.KBS. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation.

armine
Download Presentation

Presenter : Yu-Ting LU Authors : Harun Ug˘uz 2011.KBS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm Presenter : Yu-Ting LUAuthors : HarunUg˘uz2011.KBS

  2. Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments

  3. Motivation • A major problem of text categorization is its large number of features. • Most of those are irrelevant noise that can mislead the classifier.

  4. Objectives • Two-stage feature selection and feature extraction is used to improve the performance of text categorization.

  5. Methodology

  6. Methodology – pre-processing • removing of stop-words • Stemming • term weighting • pruning of the words a, an, and, because, can, do, every, the… computer,computing,computation, computescomput Terms of the document collection prune the words that appear less thantwo times in the documents. documents

  7. Methodology – feature ranking with information gain • each term within the text is ranked depending on their importance for the classification in decreasing order using the IG method.

  8. Methodology – dimension reduction methods • principal component analysis • Genetic algorithm for feature selection p ≦ m 11011 00110 01110 11110 Individual’s encoding Fitness function Selection Mutation Crossover

  9. Methodology – text categorization methods • KNN classifier • C4.5 decision tree classifier

  10. Methodology – evaluation of the performance

  11. Experiments – datasets • Reuters dataset-21578 • Classic3 dataset

  12. Experiments – Reuters-21578 A document-term matrix is acquired with a dimension of 8158 × 7542 at the end of pre-processing.

  13. Experiments – Reuters-21578

  14. Experiments – Classic3 A document-term matrix is acquired in the dimension of 3891 × 6679 at the end of pre-processing.

  15. Experiments – Classic3

  16. Conclusions • The success of text categorization performed through the C4.5 decision tree and KNN algorithms using fewer features selected via IG-PCA and IG- GA is higher than the success acquired using features selected via IG. • Two-stage feature selection methods can improve the performance of text categorization.

  17. Comments • Advantages - understand the basic methods • Applications - text categorization

More Related