230 likes | 361 Views
This paper explores the novel approach of compression-based text classification, highlighting its effectiveness and challenges. It provides an overview of various compression methods (such as RAR, LZW, and gzip) and outlines how these can be employed for classifying documents by topic or authorship. The study presents experimental evaluations and key findings, demonstrating the relationship between compression rates and classification accuracy. Insights into preprocessing techniques and model selection based on the Minimum Description Length (MDL) principle are also discussed, paving the way for future advancements in text classification methodologies.
E N D
On Compression-Based Text Classification Yuval Marton1, Ning Wu2 and Lisa Hellerstein2 1) University of Maryland and 2) Polytechnic University ECIR-05 Santiago de Compostela, Spain. March 2005
Compression for Text Classification?? • Proposed in the last ~10 years. Not well-understood why works. • Compression is stupid! slow! Non-standard! • Using compression tools is easy.. • Does it work? (Controversy. Mess)
Overview • What’s Text classification (problem setting) • Compression-based text classification • Classification Procedures ( + Do it yourself !) • Compression Methods (RAR, LZW, and gzip) • Experimental Evaluation • Why?? (Compression as char-based method) • Influence of sub/super/non-word features • Conclusions and future work
Text Classification • Given training corpus (labeled documents). • Learn how to label new (test) documents. • Our setting: • Single-class: document belongs to exactly one class. • 3 topic classification and 3 authorship attribution tasks.
Classification by Compression • Compression programs build a model or dictionary of their input (language modeling). • Better model better compression • Idea: • Compress a document using different class models. • Label with class achieving highest compression rate. • Minimal Description Length (MDL) principle: select model with shortest length of model + data.
Standard MDL (Teahan & Harper) FREEZE! Concat. training data Ai A1 D1 D2 D3 M1 T Class 1 Compress Ai model Mi Compress T using each Mi Assign T to its best compressor M2 A2 D1 D2 D3 Class 2 … and the winner is… An D1 D2 D3 Mn Class n
Do it yourself Five minutes on how to classify text documents e.g., according to their topic or author, using only off-the-shelf compression tools (such as WinZip or RAR)…
AMDL (Khmelev / Kukushkina et al. 2001) A1T A1 D1 D2 D3 T Class 1 Concat. training data Ai Concat. Ai and T AiT Compress each Ai and AiT Class 2 A2T D1 A2 D2 D3 Subtract compressed file sizes vi = |AiT| - |Ai| Assign T to class i w/ min vi Class n An D1 AnT D2 D3
BCN (Benedetto et al. 2002) D1T D1 T D1 D2 D3 Class 1 Like AMDL, but concat. each doc Dj with T DjT D2T D3T Compress each Dj and DjT Class 2 D4 D4T D4 D5 D6 Subtract compressed file sizes vDT = |DjT| - |Dj| D5T D6T Assign T to class i of doc Dj with min vDT Class n D7 D7 D7T D8 D9 D8T D9T
Compression Methods • Gzip: Lempel-Ziv compression (LZ77). - “Dictionary”-based - Sliding window typically 32K. • LZW (Lempel-Ziv-Welch) - Dictionary-based (16 bit). - Fills up on big corpora (typically after ~300KB). • RAR (proprietary shareware) - PPMII variant on text.- Markov Model, n-grams frequencies. -32K- -16 bit (~300K) - - (almost) unlimited -
Previous Work • Khmelev et al. (+Kukushkina): Russian authors. Thaper: LZ78, char- and word-based PPM. • Frank et al.: compression (PPM) bad for topic. Teahan and Harper: compression (PPM) good. • Benedetto et al.: gzip good for authors. Goodman: gzip bad! • Khmelev and Teahan: RAR (PPM). • Peng et al.: Markov Language Models.
Compression Good or Bad? Scoring: we measured Accuracy:Total # correct classifications Total # tests (Micro-averaged accuracy) Why? Single-class labels, no tuning parameters.
RAR is a Star! • RAR is best performing method on all but small Reuters-9 corpus. • Poor performance of gzip on large corpora due to its 32K sliding window. • Poor performance of LZW: dictionary fills up after ~ 300KB, other reasons too.
RAR on Standard Corpora - Comparison • 90.5% for RAR on 20news:- 89.2% Language Modeling (Peng et al. 2004)- 86.2% Extended NB (Rennie et al. 2003)- 82.1% PPMC (Teahan and Harper 2001) • 89.6% for RAR on Sector:- 93.6% SVM (Zhang and Oles 2001)- 92.3% Extended NB (Rennie et al. 2003) - 64.5% Multinomial NB (Ghani 2001)
AMDL vs. BCN • Gzip / BCN good.Due to processing each doc separatelywith T (1-NN). • Gzip / AMDL bad. • BCN was slow.Probably due to more sys calls and disk I/O.
Why Good?! • Compression tools are character-based.(Stupid, remember?) • Better than word-based? WHY? Can they capture • sub-word • word • super-word • non-word features?
Pre-processing “the more – the better!” • STD: no change to input. • NoP: remove punctuation; replace white spaces (tab, line, parag & page breaks) with spaces. • WOS: NoP + Word Order Scrambling. • RSW: NoP + random-string words.…and more… the more the better better the the more dqf tmdw dqf lkwe
Non-words: Punctuation • Intuition: • punctuation usage is characteristic of writing style (authorship attribution). • Results: • Accuracy remained the same, or even increased, in many cases. • RAR insensitive to punctuation removal.
Super-words: word seq. Order Scrambling (WOS) • WOS removes punctuation and scrambles word order. • WOS leaves sub-word and word info intact. Destroys super-word relations. • RAR: accuracy declined in all but one corpus seems to exploit word seq. (n-grams?).Advantage over SVM state-of-the-art bag-of-words methods. • LZW & gzip: no consistent accuracy decline.
Summary • Compared effectiveness of compression for text classification(compression methods x classification procedures). • RAR (PPM) is a star – under AMDL.- BCN (1-NN) slow(er) and never better in accuracy. - Compression good (Teahan and Harper).- Character-based Markov models good (Peng et al.) • Introduced pre-processing testing techniques:novel ways to test how compression (and other character-based methods) exploit sub/super/non-word features. - RAR benefits from super-word info. - Suggests word-based methods might benefit from it too.
Future Research • Test / confirm results on more and bigger corpora. • Compare to state-of-the-art techniques: • Other compression / character-based methods. • SVM • Word-based n-gram language modeling (Peng et al). • Word-based compression? • Use Standard MDL (Teahan and Harper). • Faster, better insight. • Sensitivity to class training data imbalance • When is throwing away data desirable for compression?