Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encycloped...
Download
1 / 15

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 - PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on

Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li. Paper Structure. Introduction Feature Generation with Wikipedia

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006' - hogan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge

Engeniy Gabrilovich and Shaul Markovitch

American Association for Artificial Intelligence 2006

Prepared by Qi Li


Paper structure
Paper Structure Enhancing Text Categorization with Encyclopedic Knowledge

  • Introduction

  • Feature Generation with Wikipedia

    • Wikipedia as a knowledge Repository

    • Feature Construction

    • Feature generator design

    • Using the link structure

  • Empirical Evaluation

    • Implementation Details

    • Experimental Methodology

    • The effect of feature generation

    • Classifying short documents

  • Conclusions and Future Work


Introduction
Introduction Enhancing Text Categorization with Encyclopedic Knowledge

  • Text categorization

    • Deals with automatic assignment of category labels to natural language documents

    • Represent document as bags of words

    • Features from words

    • Categorization based on features

    • Limitation of BOW:

      • by individual word occurrences in the training set

        • Wal-Mart supply chain goes real time

        • Wal-Mart manages its stock with RFID technology

      • Effective in medium difficulty categorization, but bad in small categories or short documents

  • Using encyclopedia to endow the machine document with the broader of knowledge available to humans


  • Auxiliary text classifier: Enhancing Text Categorization with Encyclopedic Knowledge

    • matching documents with the most relevant articles of wikipedia

    • conventional bag of words + new features

  • Examples for idea of auxiliary text classifier:

    • “Bernanke takes charge”

    • BEN BERNANKE, FEDERAL RESERVE, CHAIRMAN OF THE FEDERAL RESERVE, ALAN GREENSPAN, MONETARISM, …

  • Using wikipedia

    • Use text similarity algorithms to automatically identify encyclopedia articles relevant to each document

    • Leverage the knowledge gained from these articles


Feature generation with wikipedia
Feature Generation with Wikipedia Enhancing Text Categorization with Encyclopedic Knowledge

  • Extend the representation of documents for text categorization with knowledge concepts relevant to the document text.

  • Wikipedia

    • Largest knowledge repository

    • Large-scale hierarchies

    • Qualify, stander written English


Feature construction
Feature Construction Enhancing Text Categorization with Encyclopedic Knowledge

  • Receive a text fragment, and map to most relevant wikipedia articles

    • E.g. overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with Encyclopedia knwoledge

    • ENCYCLOPEID, WIKIPEDIA, ENTERPRISE CONTENT MANAGEMENT, BOTTELENECK, PERFORMANCE PROBLEM, HERMENEUTICS

  • Training documents -> features -> wikipedia concepts -> augment the bag of word


Feature construction cont
Feature Construction (cont.) Enhancing Text Categorization with Encyclopedic Knowledge

  • Unit for feature generation?

    • Word, sentence, paragraph, document?

  • Multi-resolution approach

    • Features are generated for

      • Individual words

      • Sentences

      • Paragraphs

      • Entire document

    • Polysemous words is mapped to the concepts that correspond to the sense shared by the context words


Feature construction example
Feature Construction example Enhancing Text Categorization with Encyclopedic Knowledge

  • “jaguar car models”,

  • the Wikipedia-based feature generator returns:

    • JAGUAR (CAR),

    • DAIMLER and BRITISH LEYLAND MOTOR CORPORATION (companies merged with Jaguar),

    • V12 (Jaguar’s engine),

    • JAGUAR E-TYPE

    • JAGUAR XJ.

  • “jaguar Panthera onca”,

    • JAGUAR,

    • FELIDAE (feline species family), related felines such as LEOPARD,

    • PUMA and BLACK PANTHER, as well as KINKAJOU


Feature generator design
Feature generator design Enhancing Text Categorization with Encyclopedic Knowledge

  • A set of simple heuristics for pruning the sets of concepts (wikipedia):

    • Discarding:

      • with <100 non stop words

      • <5 incoming and outgoing links (too short)

      • disambiguation pages

    • Each concept is an attribute vector assigned weights using a TF.IDF


Using the link structure
Using the link structure Enhancing Text Categorization with Encyclopedic Knowledge

  • Link—anchor text:

    • Identical to the canonical name of the target article

    • Different anchor text refer to the same article: alternative names, variant spellings, and related phrases

    • Incoming links: significance of an article

    • Problem: taking all articles pointed from a concept: ill-advised, a lot of weakly related material

    • Pursue this direction in future work


Empirical evaluation
Empirical Evaluation Enhancing Text Categorization with Encyclopedic Knowledge

  • Wikipedia snapshot: November 5, 2005

  • 1.8Gb text in 910,989 articles,

    • removing small and overly specific concepts --remaining 171,332 articles

    • Removing stop words and rare words

    • Stemmed

    • 296,157 distinct terms presenting concepts


Experimental methodology
Experimental Methodology Enhancing Text Categorization with Encyclopedic Knowledge

  • 1 Reuter-21578

  • 2 Reuters Corpus Volume I (RCV1)

  • 3 OHSUMED

  • 4 20 Newsgroups(20NG)

  • 5 Movie Reviews (Movies)

  • Method: SVM with a linear kernel

  • Metrics:

    • precision-recall break-even point (BEP)

    • Reuter and OHSUMED: micro- and macro-average BEP

    • 20 NG and Movies: 4-fold cross-validation


More effective in small categories Enhancing Text Categorization with Encyclopedic Knowledge

Improve more


Experiment on short documents
Experiment on short documents Enhancing Text Categorization with Encyclopedic Knowledge

Only use title of the articles to do classification


Conclusion and future work
Conclusion and Future work Enhancing Text Categorization with Encyclopedic Knowledge

  • Feature generator:

    • identify the most relevant encyclopedia articles

    • Creating new features

  • Add semantics to conventional BOW

    • Latent semantic indexing

    • LSI + SVM: not good

    • Wikipedia +svm: improve

  • Information retrieval


ad