1 / 21

TÍTULO GENÉRICO

Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid. TÍTULO GENÉRICO. OUTLINE. Motivation Concept indexing with WordNet synsets Concept indexing in ATC Experiments set-up Summary of results & discussion Updated results

odessa
Download Presentation

TÍTULO GENÉRICO

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid TÍTULO GENÉRICO

  2. OUTLINE • Motivation • Concept indexing with WordNet synsets • Concept indexing in ATC • Experiments set-up • Summary of results & discussion • Updated results • Conclusions & current work

  3. Pre-classified documents Representation& learning Classifier(s) Classification Categories Newdocumentsinstances Newdocuments Representation Newdocumentscategorized MOTIVATION • Most popular & effective model for thematic ATC • IR-like text representation • ML feature selection, learning classifiers

  4. MOTIVATION • Bag of Words • Binary • TF • TF*IDF • Stoplist • Stemming • Feature Selection

  5. MOTIVATION • Text representation requirements in thematic ATC • Semantic characterization of text content • Words convey an important part of the meaning • But we must deal with polysemy and synonymy • Must allow effective learning • Thousands to tens of thousands attributes  noise (effectiveness) & lack of efficiency

  6. ---- car --- ------------- ------------- - wagon -- N036030448 {automobile, car, wagon} ------------- ------------- ------------- automobile N206726781{train wagon, wagon} CONCEPT INDEXING WITH WORDNET SYNSETS • Using vectors of synsets instead of word stems • Ambiguous words mapped to correct senses • Synonyms mapped to same synsets

  7. CONCEPT INDEXING WITH WORDNET SYNSETS • Considerable controversy in IR • Assumed potential for improving text representation • Mixed experimental results, ranging from • Very good [Gonzalo et al. 98] to bad [Voorhees 98] • Recent review in [Stokoe et al. 03] • A problem of state-of-the-art WSD effectiveness • But ATC is different!!!

  8. CONCEPT INDEXING IN ATC • Apart of the potential... • We have much more information about ATC categories than IR queries • WSD lack of effectiveness can be less hurting because of term (feature) selection • But we have new problems!!! • Data sparseness & noise • Most terms are rare (Zipf’s Law)  bad estimates • Categories with few documents  bad estimates, lack of information

  9. CONCEPT INDEXING IN ATC • Concept indexing helps to solve IR & new ATC problems • Text ambiguity in IR & ATC • Data sparseness & noise in ATC • Less indexing units of higher quality (selection)  probably better estimates • Categories with few documents  why not enriching representation with WordNet semantic relations? • Hyperonymy, meronymy, etc.

  10. CONCEPT INDEXING IN ATC • Literature review • As in IR, mixed results, ranging from • Good [Fukumoto & Suzuki, 01] to bad [Scott, 98] • Notably, researchers use words in synsets instead of the synset codes themselves • Still lacking Concept indexing evaluation in ATC overa representative range of selectionstrategies and learning algorithms

  11. EXPERIMENTS SETUP • Primary goal • Comparing terms vs. correct synsets as indexing units • Requires perfect disambiguated collection (SemCor) • Secondary goals • Comparing perfect WSD with simple methods • More scalability, less accuracy • Comparing terms with/out stemming, stop-listing • Nature of SemCor (genre + topic classification)

  12. EXPERIMENTS SETUP • Overview of parameters • Binary classifiers vs. multi-class classifiers • Three concept indexing representations • Correct WSD (CD) • WSD by POS Tagger (CF) • WSD by corpus frequency (CA)

  13. EXPERIMENTS SETUP • Overview of parameters • Four term indexing representations • No Stemming, No StopList (BNN) • No Stemming, with Stoplist (BNS) • With Stemming, without Stoplist (BSN) • With Stemming and Stoplist (BSS)

  14. EXPERIMENTS SETUP • Levels of selection with IG • No selection (NOS) • top 1% (S01) • top 10% (S10) • IG>0 (S00)

  15. EXPERIMENTS SETUP • Learning algorithms • Naïve Bayes • kNN • C4.5 • PART • SVMs • Adaboost+Naïve Bayes • Adaboost+C4.5

  16. EXPERIMENTS SETUP • Evaluation metrics • F1 (average of recall – precission) • Macroaverage • Microaverage • K-fold cross validation (k=10 in our experiments)

  17. Binary classification Multi-class classification SUMMARY OF RESULTS & DISCUSSION • Overview of results

  18. CD > C* weakly supports that accurate WSD is required BNN > B* does not support that stemming & stop-listing are NOT required Genre/topic orientation Most importantly CD > B* does not support that synsets are better indexing units than words (stemmed & stop-listed or not) SUMMARY OF RESULTS & DISCUSSION

  19. UPDATED RESULTS • Recent results combining synsets & words (no stemming, no stop-listing, binary problem) • NB  S00, C4.5  S00, S01, S10 • SVM  S01, ABNB  S00, S00, S10

  20. CONCLUSSIONS & CURRENT WORK • Synsets are NOT a better representation, but IMPROVE the bag-of-words representation • We are testing semantic relations (hyperonymy) on SemCor • It is required more work on Reuters-21578 • We will have to address WSD, initially with the approaches described in this work

  21. THANK YOU !

More Related