1 / 23

On Compression-Based Text Classification

On Compression-Based Text Classification. Yuval Marton 1 , Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University. ECIR-05 Santiago de Compostela, Spain. March 2005. Compression for Text Classification??.

clover
Download Presentation

On Compression-Based Text Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Compression-Based Text Classification Yuval Marton1, Ning Wu2 and Lisa Hellerstein2 1) University of Maryland and 2) Polytechnic University ECIR-05 Santiago de Compostela, Spain. March 2005

  2. Compression for Text Classification?? • Proposed in the last ~10 years. Not well-understood why works. • Compression is stupid! slow! Non-standard! • Using compression tools is easy.. • Does it work? (Controversy. Mess)

  3. Overview • What’s Text classification (problem setting) • Compression-based text classification • Classification Procedures ( + Do it yourself !) • Compression Methods (RAR, LZW, and gzip) • Experimental Evaluation • Why?? (Compression as char-based method) • Influence of sub/super/non-word features • Conclusions and future work

  4. Text Classification • Given training corpus (labeled documents). • Learn how to label new (test) documents. • Our setting: • Single-class: document belongs to exactly one class. • 3 topic classification and 3 authorship attribution tasks.

  5. Classification by Compression • Compression programs build a model or dictionary of their input (language modeling). • Better model  better compression • Idea: • Compress a document using different class models. • Label with class achieving highest compression rate. • Minimal Description Length (MDL) principle: select model with shortest length of model + data.

  6. Standard MDL (Teahan & Harper) FREEZE! Concat. training data Ai A1 D1 D2 D3 M1 T Class 1 Compress Ai model Mi Compress T using each Mi Assign T to its best compressor M2 A2 D1 D2 D3 Class 2 … and the winner is… An D1 D2 D3 Mn Class n

  7. Do it yourself Five minutes on how to classify text documents e.g., according to their topic or author, using only off-the-shelf compression tools (such as WinZip or RAR)…

  8. AMDL (Khmelev / Kukushkina et al. 2001) A1T A1 D1 D2 D3 T Class 1 Concat. training data  Ai Concat. Ai and T  AiT Compress each Ai and AiT Class 2 A2T D1 A2 D2 D3 Subtract compressed file sizes vi = |AiT| - |Ai| Assign T to class i w/ min vi Class n An D1 AnT D2 D3

  9. BCN (Benedetto et al. 2002) D1T D1 T D1 D2 D3 Class 1 Like AMDL, but concat. each doc Dj with T  DjT D2T D3T Compress each Dj and DjT Class 2 D4 D4T D4 D5 D6 Subtract compressed file sizes vDT = |DjT| - |Dj| D5T D6T Assign T to class i of doc Dj with min vDT Class n D7 D7 D7T D8 D9 D8T D9T

  10. Compression Methods • Gzip: Lempel-Ziv compression (LZ77). - “Dictionary”-based - Sliding window typically 32K. • LZW (Lempel-Ziv-Welch) - Dictionary-based (16 bit). - Fills up on big corpora (typically after ~300KB). • RAR (proprietary shareware) - PPMII variant on text.- Markov Model, n-grams frequencies. -32K- -16 bit (~300K) - - (almost) unlimited -

  11. Previous Work • Khmelev et al. (+Kukushkina): Russian authors. Thaper: LZ78, char- and word-based PPM. • Frank et al.: compression (PPM) bad for topic. Teahan and Harper: compression (PPM) good. • Benedetto et al.: gzip good for authors. Goodman: gzip bad! • Khmelev and Teahan: RAR (PPM). • Peng et al.: Markov Language Models.

  12. Compression Good or Bad? Scoring: we measured Accuracy:Total # correct classifications Total # tests (Micro-averaged accuracy) Why? Single-class labels, no tuning parameters.

  13. AMDL Results

  14. RAR is a Star! • RAR is best performing method on all but small Reuters-9 corpus. • Poor performance of gzip on large corpora due to its 32K sliding window. • Poor performance of LZW: dictionary fills up after ~ 300KB, other reasons too.

  15. RAR on Standard Corpora - Comparison • 90.5% for RAR on 20news:- 89.2% Language Modeling (Peng et al. 2004)- 86.2% Extended NB (Rennie et al. 2003)- 82.1% PPMC (Teahan and Harper 2001) • 89.6% for RAR on Sector:- 93.6% SVM (Zhang and Oles 2001)- 92.3% Extended NB (Rennie et al. 2003) - 64.5% Multinomial NB (Ghani 2001)

  16. AMDL vs. BCN • Gzip / BCN good.Due to processing each doc separatelywith T (1-NN). • Gzip / AMDL bad. • BCN was slow.Probably due to more sys calls and disk I/O.

  17. Why Good?! • Compression tools are character-based.(Stupid, remember?) • Better than word-based? WHY? Can they capture • sub-word • word • super-word • non-word features?

  18. Pre-processing “the more – the better!” • STD: no change to input. • NoP: remove punctuation; replace white spaces (tab, line, parag & page breaks) with spaces. • WOS: NoP + Word Order Scrambling. • RSW: NoP + random-string words.…and more… the more the better better the the more dqf tmdw dqf lkwe

  19. Non-words: Punctuation • Intuition: • punctuation usage is characteristic of writing style (authorship attribution). • Results: • Accuracy remained the same, or even increased, in many cases. • RAR insensitive to punctuation removal.

  20. Super-words: word seq. Order Scrambling (WOS) • WOS removes punctuation and scrambles word order. • WOS leaves sub-word and word info intact. Destroys super-word relations. • RAR: accuracy declined in all but one corpus  seems to exploit word seq. (n-grams?).Advantage over SVM state-of-the-art bag-of-words methods. • LZW & gzip: no consistent accuracy decline.

  21. Summary • Compared effectiveness of compression for text classification(compression methods x classification procedures). • RAR (PPM) is a star – under AMDL.- BCN (1-NN) slow(er) and never better in accuracy. - Compression good (Teahan and Harper).- Character-based Markov models good (Peng et al.) • Introduced pre-processing testing techniques:novel ways to test how compression (and other character-based methods) exploit sub/super/non-word features. - RAR benefits from super-word info. - Suggests word-based methods might benefit from it too.

  22. Future Research • Test / confirm results on more and bigger corpora. • Compare to state-of-the-art techniques: • Other compression / character-based methods. • SVM • Word-based n-gram language modeling (Peng et al). • Word-based compression? • Use Standard MDL (Teahan and Harper). • Faster, better insight. • Sensitivity to class training data imbalance • When is throwing away data desirable for compression?

  23. Thank you!

More Related