On Compression-Based Text Classification

On Compression-Based Text Classification Yuval Marton1, Ning Wu2 and Lisa Hellerstein2 1) University of Maryland and 2) Polytechnic University ECIR-05 Santiago de Compostela, Spain. March 2005

Compression for Text Classification?? • Proposed in the last ~10 years. Not well-understood why works. • Compression is stupid! slow! Non-standard! • Using compression tools is easy.. • Does it work? (Controversy. Mess) Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Overview • What’s Text classification (problem setting) • Compression-based text classification • Classification Procedures ( + Do it yourself !) • Compression Methods (RAR, LZW, and gzip) • Experimental Evaluation • Why?? (Compression as char-based method) • Influence of sub/super/non-word features • Conclusions and future work Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Text Classification • Given training corpus (labeled documents). • Learn how to label new (test) documents. • Our setting: • Single-class: document belongs to exactly one class. • 3 topic classification and 3 authorship attribution tasks. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Classification by Compression • Compression programs build a model or dictionary of their input (language modeling). • Better model  better compression • Idea: • Compress a document using different class models. • Label with class achieving highest compression rate. • Minimal Description Length (MDL) principle: select model with shortest length of model + data. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Standard MDL (Teahan & Harper) FREEZE! Concat. training data Ai D1 A1 D2 D3 M1 T Class 1 Compress Ai model Mi Compress T using each Mi Assign T to its best compressor M2 A2 D1 D2 D3 Class 2 … and the winner is… D1 An D2 D3 Mn Class n Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Do it yourself Five minutes on how to classify text documents e.g., according to their topic or author, using only off-the-shelf compression tools (such as WinZip, gzip, or RAR)… Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

AMDL (Khmelev / Kukushkina et al. 2001) D1 A1 A1T D2 D3 T Class 1 Concat. training data  Ai Concat. Ai and T  AiT Compress each Ai and AiT Class 2 A2T A2 D1 D2 D3 Subtract compressed file sizes vi = |AiT| - |Ai| Assign T to class i w/ min vi Class n An AnT D1 D2 D3 Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

BCN (Benedetto et al. 2002) D1T D1 T D1 D2 D3 Class 1 Like AMDL, but concat. each doc Dj with T  DjT D2T D3T Compress each Dj and DjT Class 2 D4 D4T D4 D5 D6 Subtract compressed file sizes vDT = |DjT| - |Dj| D5T D6T Assign T to class i of doc Dj with min vDT Class n D7 D7T D7 D8 D9 D8T D9T Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Compression Methods • Gzip: Lempel-Ziv compression (LZ77). - “Dictionary”-based - Sliding window typically 32K. • LZW (Lempel-Ziv-Welch) - Dictionary-based (16 bit). - Fills up on big corpora (typically after ~300KB). • RAR (proprietary shareware) - PPMII variant on text.- Markov Model, n-grams frequencies. -32K- -16 bit (~300K) - - (almost) unlimited - Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Previous Work • Khmelev et al. (+Kukushkina): Russian authors. Thaper: LZ78, char- and word-based PPM. • Frank et al.: compression (PPM) bad for topic. Teahan and Harper: compression (PPM) good. • Benedetto et al.: gzip good for authors. Goodman: gzip bad! • Khmelev and Teahan: RAR (PPM). • Peng et al.: Markov Language Models. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Compression Good or Bad? Scoring: we measured Accuracy:Total # correct classifications Total # tests (Micro-averaged accuracy) Why? Single-class labels, no tuning parameters. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

AMDL Results Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

RAR is a Star! • RAR is best performing method on all but small Reuters-9 corpus. • Poor performance of gzip on large corpora due to its 32K sliding window. • Poor performance of LZW: dictionary fills up after ~ 300KB, other reasons too. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

RAR on Standard Corpora - Comparison • 90.5% for RAR on 20news:- 89.2% Language Modeling (Peng et al. 2004)- 86.2% Extended NB (Rennie et al. 2003)- 82.1% PPMC (Teahan and Harper 2001) • 89.6% for RAR on Sector:- 93.6% SVM (Zhang and Oles 2001)- 92.3% Extended NB (Rennie et al. 2003) - 64.5% Multinomial NB (Ghani 2001) Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

AMDL vs. BCN • Gzip / BCN good.Due to processing each doc separatelywith T (1-NN). • Gzip / AMDL bad. • BCN was slow.Probably due to more sys calls and disk I/O. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Why Good?! • Compression tools are character-based.(Stupid, remember?) • Better than word-based? WHY? Can they capture • sub-word • word • super-word • non-word features? Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Pre-processing “the more – the better!” • STD: no change to input. • NoP: remove punctuation; replace white spaces (tab, line, parag & page breaks) with spaces. • WOS: NoP + Word Order Scrambling. • RSW: NoP + random-string words.…and more… the more the better better the the more dqf tmdw dqf lkwe Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Non-words: Punctuation • Intuition: • punctuation usage is characteristic of writing style (authorship attribution). • Results: • Accuracy remained the same, or even increased, in many cases. • RAR insensitive to punctuation removal. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Super-words: word seq. Order Scrambling (WOS) • WOS removes punctuation and scrambles word order. • WOS leaves sub-word and word info intact. Destroys super-word relations. • RAR: accuracy declined in all but one corpus  seems to exploit word seq. (n-grams?).Advantage over SVM state-of-the-art bag-of-words methods. • LZW & gzip: no consistent accuracy decline. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Summary • Compared effectiveness of compression for text classification(compression methods x classification procedures). • RAR (PPM) is a star – under AMDL.- BCN (1-NN) slow(er) and never better in accuracy. - Compression good (Teahan and Harper).- Character-based Markov models good (Peng et al.) • Introduced pre-processing testing techniques:novel ways to test how compression (and other character-based methods) exploit sub/super/non-word features. - RAR benefits from super-word info. - Suggests word-based methods might benefit from it too. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Future Research • Test / confirm results on more and bigger corpora. • Compare to state-of-the-art techniques: • Other compression / character-based methods. • SVM • Word-based n-gram language modeling (Peng et al). • Word-based compression? • Use Standard MDL (Teahan and Harper). • Faster, better insight. • Sensitivity to class training data imbalance • When is throwing away data desirable for compression? Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

Thank you! Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

On Compression-Based Text Classification

On Compression-Based Text Classification

Presentation Transcript

A Survey on Text Classification

Text Classification

Compressed indices for text based on Ziv-Lempel compression

A Semantic Text Classification Based on DBpedia

TEXT CLASSIFICATION

On Compression-Based Text Classification

Text Classification

Text Classification

Text compression

Text Classification

Text Compression

Text Classification

Text Compression

Text Compression

TEXT CLASSIFICATION -----SVM-based Approach

Text Classification

Classification Text

Language-Model Based Text-Compression

Text Classification

TEXT CLASSIFICATION