Feature Selection on Chinese Text Classification Using Character N-grams

Feature Selection on Chinese Text Classification Using Character N-grams Tongji University, Key laboratory "Embedded System and Service Computing" Ministry of Education Ph.D : Zhihua WEI Superviser: Duoqian MIAO Jean-Hugues CHAUCHAT

Outline • Text representation • What is n-gram? Why we use n-gram? • Feature selection • Introduction of experiments • Results and discussions Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Problems of Chinese text representation The problems of Chinese text processing: Chinese words do not have a remarkable boundary. • Word segmentation is necessary before any other preprocessing. The use of a dictionary is necessary. • Word sense disambiguation (WSD) and unknown words recognition are two difficulties of word segmentation. Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

A example: two segmentation possibilities Sentence: 流感到冬天很普遍。 Can be segmented as: • 流感/ 到/ 冬天 / 很 / 普遍/。 • 流/ 感到 / 冬天 / 很 / 普遍/ 。 Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

A example: two segmentation possibilities • 流感 / 到 / 冬天 / 很 / 普遍/。 (right) Flu arrive winter very popular Flu is popular in winter. • *流 / 感到 / 冬天 / 很 / 普遍/ 。 (error) Flow feel winter very popular Flow feels that winter is universal. Here “流感” is closely related to the document theme (e.g. medicine). Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

What is word n-gram? Two kinds of n-gram definitions. 我们明天去北京。 After word segmentation: 我们| 明天| 去| 北京 1-grams:{我们; 明天; 去; 北京} 2-grams:{我们明天; 明天去; 去北京} 3-grams:{我们明天去; 明天去北京} Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

What is charactern-gram? Character n-gram: 我们明天去北京。 1-grams:{我; 们; 明; 天; 去; 北; 京} 2-grams:{我们; 们明; 明天; 天去; 去北; 北京} 3-grams:{我们明; 们明天; 明天去; 天去北; 去北京} In our work, we use character n-grams. Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Advantage and disadvantage of character n-grams Advantage: • Independent of language • Avoid the problem of word segmentation Disadvantage: • Yield a large number of n-grams Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Text representation by vectors We adopt VSM (vector space model). Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Choose “n” in n-gram • Results of Zhou Shuigeng et al. (2001) 1-, 2-, 3-, 4-grams (best result) 1-,2-grams 2-grams 2-, 3-, 4-grams 1-grams (worst result) • Result of Lelu et al.(1998) 2-grams Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Choose “n” in n-gram • “n” selection in our experiments. (1) 1-,2-grams (2) 1-,2-,3-grams Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Some definitions Each text in corpus D belongs to one class ci. Here, ci∈C, C = {c1, c2, … ci…cn} is the class set defined before classification. • Text_freqij= the number of texts which include n-gram j in class ci. • Text_freq_relativeij = Text_freqij / Ni, here, Ni is the quantity of texts in class ci in training set; • Gram_freqij= the number of n-gram j in all texts in class ci in training set; • Gram_freq_relativeij= Gram_freqij / N’i, here, N’i is the total of occurrence of all n-grams in all texts in class ci in training set Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Feature selection • Two steps: • Inter-class feature number reduction • Cross-class feature selection Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Inter-class feature number reduction(1) Adopt relative text frequency as threshold. Before selection: >15,000 features/class selected by threshold: Text_freq_relativeij<0.02 : Remain about 7,000 features/class Text_freq_relativeij<0.03: Remain about 4,000 features/class Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Inter-class feature number reduction(2) Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Cross-class feature selection(1) • Cross-class feature selection by Chi-Square Here, Select Oij: 1.Text_freq_relativeij; 2.Gram_freq_relativeij; 3. Text_freqij; 4. Gram_freqij; Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Cross-class feature selection(2) Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Experiments dataset The distribution of TanCorp-12 (M= megabyte) Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Experiment scenarios Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Classifier • C-SVC classifier (introduced in LIBSVM). It is the SVM algorithm designed for the multi-classification task. • Platform: TANAGRA (developed by Ricco Rakotomalala) Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Evaluation F1 measure in bi-classifier: F1 in our work (more than two classes): Micro - F1 = average in documents and classes Macro - F1 = average of within - category F1 values Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

1-,2-grams vs. 1-,2-,3-grams Ex_3>Ex_5 Ex_2>Ex_6 1-,2-grams is better! Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

N-gram frequency vs. text frequency Ex_1>Ex_2 Ex_3>Ex_4 Ex_5>Ex_6 n-gram freq. is better! Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Absolute freq. vs. relative freq. Ex_1 ≈ Ex_3 Ex_2 ≈ Ex_4 The two are similar!! Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Sparseness comparison [Šilić Artur et. Al, 2007] shows that the computational time is more linked with the number of non-zero values in the cross-table (document by feature) than with its number of columns (features). Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Conclusion • The feature selection methods based on n-gram frequency (absolute or relative) always give better results than those based on text frequency (absolute or relative). • Relative frequency is not better than the absolute frequency. • Methods based on n-gram frequency also produce denser “document by feature” matrices. Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Confusion Matrix Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Future work • Better methods in n-gram feature selection. • Test its result in hierarchical classification tasks. Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Thank you! Questions ? Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

Feature Selection on Chinese Text Classification Using Character N-grams

Feature Selection on Chinese Text Classification Using Character N-grams

Presentation Transcript

Feature Selection in Nonlinear Kernel Classification

Feature Selection in Nonlinear Kernel Classification

Classification and Feature Selection for Craniosynostosis

Feature selection

Semi-Supervised Feature Selection for Graph Classification

Feature Selection

Multi-Label Feature Selection for Graph Classification

Feature Selection in Classification and R Packages

Feature Selection using Mutual Information

Feature Selection

Applicability of N-Grams to Data Classification

Feature Selection

6. N-GRAMs

Feature Selection Stability Analysis for Classification Using Microarray Data

Feature Selection

Feature selection

N-Grams

A Survey on Classification of Feature Selection Strategies

Feature Selection

Feature selection for text categorization on imbalanced data

Feature selection