1 / 30

Using Error-Correcting Codes For Text Classification

Using Error-Correcting Codes For Text Classification. Rayid Ghani rayid@cs.cmu.edu. This presentation can be accessed at http://www.cs.cmu.edu/~rayid/talks/. Outline. Introduction to ECOC Intuition & Motivation Some Questions? Experimental Results Semi-Theoretical Model Types of Codes

gyala
Download Presentation

Using Error-Correcting Codes For Text Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Error-Correcting Codes For Text Classification Rayid Ghani rayid@cs.cmu.edu This presentation can be accessed at http://www.cs.cmu.edu/~rayid/talks/

  2. Outline • Introduction to ECOC • Intuition & Motivation • Some Questions? • Experimental Results • Semi-Theoretical Model • Types of Codes • Drawbacks • Conclusions

  3. Introduction • Decompose a multiclass classification problem into multiple binary problems • One-Per-Class Approach (moderately expensive) • All-Pairs (very expensive) • Distributed Output Code (efficient but what about performance?) • Error-Correcting Output Codes (?)

  4. Is it a good idea? • Larger margin for error since errors can now be “corrected” • One-per-class is a code with minimum hamming distance (HD) = 2 • Distributed codes have low HD • The individual binary problems can be harder than before • Useless unless number of classes > 5

  5. Training ECOC • Given m distinct classes • Create an m x n binary matrix M. • Each class is assigned ONE row of M. • Each column of the matrix divides the classes into TWO groups. • Train the Base classifiers to learn the n binary problems.

  6. Testing ECOC • To test a new instance • Apply each of the n classifiers to the new instance • Combine the predictions to obtain a binary string(codeword) for the new point • Classify to the class with the nearest codeword (usually hamming distance is used as the distance measure)

  7. ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D

  8. ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D

  9. ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D

  10. ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D X 1 1 110

  11. Questions? • How well does it work? • How long should the code be? • Do we need a lot of training data? • What kind of codes can we use? • Are there intelligent ways of creating the code?

  12. Previous Work • Combine with Boosting – ADABOOST.OC (Schapire, 1997), (Guruswami & Sahai, 1999) • Local Learners (Ricci & Aha, 1997) • Text Classification (Berger, 1999)

  13. Experimental Setup • Generate the code • BCH Codes • Choose a Base Learner • Naive Bayes Classifier as used in text classification tasks (McCallum & Nigam 1998)

  14. Dataset • Industry Sector Dataset • Consists of company web pages classified into 105 economic sectors • Standard stoplist • No Stemming • Skip all MIME headers and HTML tags • Experimental approach similar to McCallum et al. (1998) for comparison purposes.

  15. Results Industry Sector Data Set ECOC reduces the error of the Naïve Bayes Classifier by 66% • (McCallum et al. 1998) 2,3. (Nigam et al. 1999)

  16. The Longer the Better! • Longer codes mean larger codeword separation • The minimum hamming distance of a code C is the smallest distance between any pair of distance codewords in C • If minimum hamming distance is h, then the code can correct  (h-1)/2 errors Table 2: Average Classification Accuracy on 5 random 50-50 train-test splits of the Industry Sector dataset with a vocabulary size of 10000 words selected using Information Gain.

  17. Size Matters?

  18. Size does NOT matter!

  19. Semi-Theoretical Model • Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly

  20. Semi-Theoretical Model • Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly

  21. Semi-Theoretical Model • Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly

  22. Types of Codes Types of Codes • Data-Independent • Data-Dependent Hand-Constructed Adaptive Algebraic Random

  23. What is a Good Code? • Row Separation • Column Separation (Independence of errors for each binary classifier) • Efficiency (for long codes)

  24. Choosing Codes

  25. Experimental Results

  26. Drawbacks • Can be computationally expensive • Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems

  27. Future Work • Combine ECOC with Co-Training • Automatically construct optimal / adaptive codes

  28. Conclusion • Improves Classification Accuracy considerably! • Can be used when training data is sparse • Algebraic codes perform better than random codes for a given code lenth • Hand-constructed codes are not the answer

More Related