1 / 30

OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization

OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization. Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et al. Microsoft Research Asia Peking University Tsinghua University Chinese University of Hong Kong

Download Presentation

OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et al. Microsoft Research Asia Peking University Tsinghua University Chinese University of Hong Kong Virginia Polytechnic Institute and State University

  2. Outline • Motivation • Problem formulation • Related works • The OCFS algorithm • Experiments • Conclusion and future works

  3. Motivation • DR are highly desired for web scale text data • DR can improve efficiency and effectiveness • Feature selection (FS) is more applicable than feature extraction (FE) • Most of FS algorithms are greedy. simple, effective, efficient and optimal FS algorithm

  4. Outline • Motivation • Problem formulation • Related works • The OCFS algorithm • Experiments • Conclusion and future works

  5. Problem Formulation (p<<d) Dimension reduction: Consider linear case: suppose where

  6. Problem Formulation FS: We denote the discrete solution space as: The problem is: given a set of labeled training documents X, learn a transformation matrix such that it is optimal according to some criterion in space .

  7. Outline • Motivation • Problem formulation • Related works • The OCFS algorithm • Experiments • Conclusion and future works

  8. Related Works – IG Information gain aims to select a group of optimal features: by: and global optimal is NP, greedy computing

  9. Related Works – CHI CHI aims to select a group of features by: and

  10. Outline • Motivation • Problem formulation • Related works • The OCFS algorithm • Experiments • Conclusion and future works

  11. Orthogonal Centroid Algorithm • Orthogonal centroid : FE algorithm. • Effective for DR of text classification problems. • Computation is based on QR matrix decomposition • Theorem: the solution of OC algorithm equals to the solution of the following optimization problem, where

  12. Intuition of Our Work OC from the FE perspective where

  13. Intuition of Our Work by FE by FS OC from the FS perspective: how to optimize J in discrete space

  14. The OCFS Algorithm FS problem: suppose we want to preserve the mth and nth feature of and discard the others.

  15. The OCFS Algorithm Optimization:

  16. The OCFS Algorithm OCFS: Solution :p largest ones from

  17. The OCFS Algorithm

  18. Algorithm Analysis • The Number of selected features is subject to where the energy function

  19. Algorithm Analysis • Complexity: • time complexity is O(cd) • OCFS only compute the simple square function instead of some functional computation such as logarithm of IG

  20. Outline • Motivation • Problem formulation • Related works • The OCFS algorithm • Experiments • Conclusion and future works

  21. Experiments Setup • Datasets: • 20 Newsgroups (5-class; 5,000-data; 131,072-d) • Reuters Corpus Volume 1 (4-class; 800,000-data; 500,000-d) • Open Directory Project (13-class) • Baseline: IG & CHI • Performance measurement : • CPU runtime • and • Classifier: SVM SMO

  22. Experimental Results –20NG CPU runtime F1 20NG

  23. Experimental Results –20NG

  24. Experimental Results – RCV1 CPU runtime F1 RCV1

  25. Experimental Results – ODP F1 ODP

  26. Results Analysis • Better than IG and CHI • Only half of the time • Outperform performance when dimension small. • dimension is small, optimal outperform greedy. • increasing selected features, the saturation of features makes additional features of less value.

  27. Outline • Motivation • Problem formulation • Related works • The OCFS algorithm • Experiments • Conclusion and future works

  28. Conclusion • We proposed a novel efficient and effective feature selection algorithm for text categorization. • Main advantages : • optimal • better performance • more efficient

  29. Future Works • Future works: • unbalanced data • combine with other approaches. E.g. OCFS + PCA

  30. The End Thanks! Q & A

More Related