1 / 17

Using IR techniques to improve Automated Text Classification

NLDB04 – 9 th International Conference on Applications of Natural Language to Information Systems . Using IR techniques to improve Automated Text Classification. Teresa Gonçalves, Paulo Quaresma tcg@di.uevora.pt, pq@di.uevora.pt Departamento de Informática Universidade de Évora, Portugal.

ayame
Download Presentation

Using IR techniques to improve Automated Text Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NLDB04 – 9th International Conference on Applications of Natural Language to Information Systems Using IR techniques to improve Automated Text Classification Teresa Gonçalves, Paulo Quaresma tcg@di.uevora.pt, pq@di.uevora.pt Departamento de Informática Universidade de Évora, Portugal

  2. Overview • Application area: Text Categorisation • Research area: Machine Learning • Paradigm: Support Vector Machine • Written language: European Portuguese and English • Study: Evaluation of Preprocessing Techniques Using IR techniques to improve ATC

  3. Datasets • Multilabel classification datasets • Each document can be classified into multiple concepts • Written language • European Portuguese • PAGOD dataset • English • Reuters dataset Using IR techniques to improve ATC

  4. PAGOD dataset • Represents the decisions of the Portuguese Attorney General’s Office since 1940 • Caracteristics • 8151 documents, 96 Mbytes of characters • 68886 distinct words • Averaged document: 1339 words, 306 distinct • Taxonomy of 6000 concepts, used around 3000 • 5 most used concepts: number of documents • 909, 680, 497, 410, 409 Using IR techniques to improve ATC

  5. Reuters-21578 dataset • Originally collected by the Carnegie group from the Reuters newswire in 1987 • Caracteristics • 9603 train documents, 3299 test documents (ModApté split) • 31715 distinct words • Averaged document: 126 words, 70 distinct • Taxonomy of 135 concepts, 90 appear in train/test sets • 5 most used concepts: number of documents • Train set: 2861, 1648, 534, 428, 385 • Test set: 1080, 718, 179, 148, 186 Using IR techniques to improve ATC

  6. Experiments • Document representation • Bag-of-words • Retain words’ frequency • Discard words that contain digits • Algorithm • Linear SVM (WEKA software package) • Classes of preprocessing experiments • Feature reduction/construction • Feature subset selection • Term weighting Using IR techniques to improve ATC

  7. Feature reduction/construction • Uses linguistic information • red1: no use of linguistic information • red2: remove a list of non-relevant words (articles, pronouns, adverbs and prepositions) • red3: remove red2 word list and transform each word onto its lemma (its stem for the English dataset) • Portuguese • POLARIS, a Portuguese lexical database • English • FreeWAIS stop-list • Porter algorithm Using IR techniques to improve ATC

  8. Feature subset selection • Uses a filtering approach • Keeps the features (words) that receive higher scores • Scoring functions • scr1: Term frequency • scr2: Mutual information • scr3: Gain ratio • Threshold value • scr1: the number of times each word appears in all documents • scr2, scr3: the same number of features as scr1 • Experiments • sel1, sel50, sel100, sel200, sel400, sel800, sel1200, sel1600 Using IR techniques to improve ATC

  9. Number of attributes (with respect to threshold values) Using IR techniques to improve ATC

  10. Term weighting • Uses the document, collection and normalisation components • wgt1: binary representation with no collection component but normalised to unit lenght • wgt2: raw term frequency with no collection nor normalisation component • wgt3: term frequency with no collection component but normalised to unit lenght • wgt4: term frequency divided by the collection component and normalised to unit lenght Using IR techniques to improve ATC

  11. Experimental results • Method • PAGOD: 10-fold cross-validation • Reuters: train and test set (ModApté split) • Measures • Precision, recall and F1 • Micro- and macro-averaging for the top 5 concepts • Significance tests with 95% of confidence Using IR techniques to improve ATC

  12. PAGOD dataset Using IR techniques to improve ATC

  13. Reuters dataset Using IR techniques to improve ATC

  14. Results • PAGOD • The best combination • scr1 – red2 – wgt3 – sel1 • Worst values • scr3 and wgt2 experiments • Reuters • The best combination • scr2 –(red1,red3) – (wgt3,wgt4)– sel400 • Worst values • scr3, and wgt2 experiments Using IR techniques to improve ATC

  15. Results Worse values for PAGOD written language? more difficult concepts to learn? more imbalanced dataset? Best experiments Different for both datasets written language? area of written documents? Discussion • SVM • Deals well with non informative and non independent features in different languages Using IR techniques to improve ATC

  16. Future work • Explore • the impact of the imbalance nature of datasets • the use of morpho-syntactical information • other datasets • Try • more powerful document representations Using IR techniques to improve ATC

  17. Scoring functions • scr1: Term frequency • The score is the number of times the feature appears in the dataset • scr2: Mutual information • It evaluates the worth of a feature A by measuring its mutual information, I(C;A) , with respect to the class, C • scr3: Gain ratio • The worth is the attribute’s gain ratio with respect to the class Using IR techniques to improve ATC

More Related