1 / 33

Feature Selection on Chinese Text Classification Using Character N-grams

Feature Selection on Chinese Text Classification Using Character N-grams. Tongji University, Key laboratory "Embedded System and Service Computing" Ministry of Education Ph.D : Zhihua WEI Superviser: Duoqian MIAO Jean-Hugues CHAUCHAT. Outline. Text representation

valora
Download Presentation

Feature Selection on Chinese Text Classification Using Character N-grams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature Selection on Chinese Text Classification Using Character N-grams Tongji University, Key laboratory "Embedded System and Service Computing" Ministry of Education Ph.D : Zhihua WEI Superviser: Duoqian MIAO Jean-Hugues CHAUCHAT

  2. Outline • Text representation • What is n-gram? Why we use n-gram? • Feature selection • Introduction of experiments • Results and discussions Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  3. Problems of Chinese text representation The problems of Chinese text processing: Chinese words do not have a remarkable boundary. • Word segmentation is necessary before any other preprocessing. The use of a dictionary is necessary. • Word sense disambiguation (WSD) and unknown words recognition are two difficulties of word segmentation. Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  4. A example: two segmentation possibilities Sentence: 流感到冬天很普遍。 Can be segmented as: • 流感/ 到/ 冬天 / 很 / 普遍/。 • 流/ 感到 / 冬天 / 很 / 普遍/ 。 Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  5. A example: two segmentation possibilities • 流感 / 到 / 冬天 / 很 / 普遍/。 (right) Flu arrive winter very popular Flu is popular in winter. • *流 / 感到 / 冬天 / 很 / 普遍/ 。 (error) Flow feel winter very popular Flow feels that winter is universal. Here “流感” is closely related to the document theme (e.g. medicine). Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  6. Outline • Text representation • What is n-gram? Why we use n-gram? • Feature selection • Introduction of experiments • Results and discussions Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  7. What is word n-gram? Two kinds of n-gram definitions. 我们明天去北京。 After word segmentation: 我们| 明天| 去| 北京 1-grams:{我们; 明天; 去; 北京} 2-grams:{我们明天; 明天去; 去北京} 3-grams:{我们明天去; 明天去北京} Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  8. What is charactern-gram? Character n-gram: 我们明天去北京。 1-grams:{我; 们; 明; 天; 去; 北; 京} 2-grams:{我们; 们明; 明天; 天去; 去北; 北京} 3-grams:{我们明; 们明天; 明天去; 天去北; 去北京} In our work, we use character n-grams. Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  9. Advantage and disadvantage of character n-grams Advantage: • Independent of language • Avoid the problem of word segmentation Disadvantage: • Yield a large number of n-grams Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  10. Text representation by vectors We adopt VSM (vector space model). Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  11. Choose “n” in n-gram • Results of Zhou Shuigeng et al. (2001) 1-, 2-, 3-, 4-grams (best result) 1-,2-grams 2-grams 2-, 3-, 4-grams 1-grams (worst result) • Result of Lelu et al.(1998) 2-grams Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  12. Choose “n” in n-gram • “n” selection in our experiments. (1) 1-,2-grams (2) 1-,2-,3-grams Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  13. Outline • Text representation • What is n-gram? Why we use n-gram? • Feature selection • Introduction of experiments • Results and discussions Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  14. Some definitions Each text in corpus D belongs to one class ci. Here, ci∈C, C = {c1, c2, … ci…cn} is the class set defined before classification. • Text_freqij= the number of texts which include n-gram j in class ci. • Text_freq_relativeij = Text_freqij / Ni, here, Ni is the quantity of texts in class ci in training set; • Gram_freqij= the number of n-gram j in all texts in class ci in training set; • Gram_freq_relativeij= Gram_freqij / N’i, here, N’i is the total of occurrence of all n-grams in all texts in class ci in training set Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  15. Feature selection • Two steps: • Inter-class feature number reduction • Cross-class feature selection Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  16. Inter-class feature number reduction(1) Adopt relative text frequency as threshold. Before selection: >15,000 features/class selected by threshold: Text_freq_relativeij<0.02 : Remain about 7,000 features/class Text_freq_relativeij<0.03: Remain about 4,000 features/class Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  17. Inter-class feature number reduction(2) Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  18. Cross-class feature selection(1) • Cross-class feature selection by Chi-Square Here, Select Oij: 1.Text_freq_relativeij; 2.Gram_freq_relativeij; 3. Text_freqij; 4. Gram_freqij; Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  19. Cross-class feature selection(2) Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  20. Outline • Text representation • What is n-gram? Why we use n-gram? • Feature selection • Introduction of experiments • Results and discussions Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  21. Experiments dataset The distribution of TanCorp-12 (M= megabyte) Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  22. Experiment scenarios Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  23. Classifier • C-SVC classifier (introduced in LIBSVM). It is the SVM algorithm designed for the multi-classification task. • Platform: TANAGRA (developed by Ricco Rakotomalala) Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  24. Evaluation F1 measure in bi-classifier: F1 in our work (more than two classes): Micro - F1 = average in documents and classes Macro - F1 = average of within - category F1 values Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  25. Outline • Text representation • What is n-gram? Why we use n-gram? • Feature selection • Introduction of experiments • Results and discussions Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  26. 1-,2-grams vs. 1-,2-,3-grams Ex_3>Ex_5 Ex_2>Ex_6 1-,2-grams is better! Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  27. N-gram frequency vs. text frequency Ex_1>Ex_2 Ex_3>Ex_4 Ex_5>Ex_6 n-gram freq. is better! Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  28. Absolute freq. vs. relative freq. Ex_1 ≈ Ex_3 Ex_2 ≈ Ex_4 The two are similar!! Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  29. Sparseness comparison [Šilić Artur et. Al, 2007] shows that the computational time is more linked with the number of non-zero values in the cross-table (document by feature) than with its number of columns (features). Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  30. Conclusion • The feature selection methods based on n-gram frequency (absolute or relative) always give better results than those based on text frequency (absolute or relative). • Relative frequency is not better than the absolute frequency. • Methods based on n-gram frequency also produce denser “document by feature” matrices. Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  31. Confusion Matrix Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  32. Future work • Better methods in n-gram feature selection. • Test its result in hierarchical classification tasks. Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

  33. Thank you! Questions ? Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education

More Related