1 / 8

Extracting key-substring-group features for text classification

Extracting key-substring-group features for text classification. Advisor : Dr. Hsu Graduate : Chen, Shao-Pei Authors : Dell Zhang, Wee Sun Lee 2006.SIGKDD.10. Outline. Motivation Objective Methodology Experimental Results Conclusion.

sherry
Download Presentation

Extracting key-substring-group features for text classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting key-substring-group features for text classification Advisor : Dr. Hsu Graduate : Chen, Shao-Pei Authors : Dell Zhang, Wee Sun Lee 2006.SIGKDD.10

  2. Outline • Motivation • Objective • Methodology • Experimental Results • Conclusion

  3. The huge number of substrings in the training corpus D is an obstacle to making use of discriminative learning methods for text classification. A document of length |d| has at least |d|(|d|+1)/2 matches of substring with itself. Motivation abc

  4. We proposed key-substring-group feature extraction technique solves the effectiveness and efficiency problems of string kernel, and opens numerous promising directions. We proposed a suffix tree based algorithm that can extract such features in linear time. Objective abc bc c

  5. Methodology-suffix tree 1xabxa$ 2abxa$ 3bxa$ 4xa$ 5a$ 6$ 1xabxa$ 4xa$ 2abxa$ 5a$ 3bxa$ 6$ Theorem 2. Given an arbitrary string P, we can find all occurrences of P in S in O(|P|) time taking advantage of the suffix tree T for S. Theorem 3. If P is a substring of S, the occurrence frequency of P in S would be equal to the number of leaves in the subtree of T rooted at P. Theorem 5. Ukkonen’s algorithm can construct the suffix tree T for a string S of length n, along with all its suffix links, in O(n) time. Theorem 8. Suppose the corpus is of size n . Then there are n trivial groups whose substrings occur only once in D, and at most n-1 non-trivial groups. 5

  6. Methodology-algorithm l: the minimum frequency. h: the maximum frequency. b: the minimum number of branches (children). p: the maximum parent-child conditional probability. q: the maximum suffix-link conditional probability. 6

  7. Experimental Results l, h, b, p and q Initially we took the default values for those parameters which would not filter any nontrivial substring-group out. Then we reduced the number of features gradually by adjusting the parameters. English Text Topic Classification Chinese Text Topic Classification:87.3% Greek Text Authorship Classification:92% Greek Text Genre Classification:94%

  8. Conclusion Our proposed key-substring-group feature extraction technique solves the effectiveness and efficiency problems of string kernel, and opens numerous promising directions.

More Related