Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encycloped...
This presentation is the property of its rightful owner.
Sponsored Links
1 / 15

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li. Paper Structure. Introduction Feature Generation with Wikipedia

Download Presentation

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Engeniy gabrilovich and shaul markovitch american association for artificial intelligence 2006

Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge

Engeniy Gabrilovich and Shaul Markovitch

American Association for Artificial Intelligence 2006

Prepared by Qi Li


Paper structure

Paper Structure

  • Introduction

  • Feature Generation with Wikipedia

    • Wikipedia as a knowledge Repository

    • Feature Construction

    • Feature generator design

    • Using the link structure

  • Empirical Evaluation

    • Implementation Details

    • Experimental Methodology

    • The effect of feature generation

    • Classifying short documents

  • Conclusions and Future Work


Introduction

Introduction

  • Text categorization

    • Deals with automatic assignment of category labels to natural language documents

    • Represent document as bags of words

    • Features from words

    • Categorization based on features

    • Limitation of BOW:

      • by individual word occurrences in the training set

        • Wal-Mart supply chain goes real time

        • Wal-Mart manages its stock with RFID technology

      • Effective in medium difficulty categorization, but bad in small categories or short documents

  • Using encyclopedia to endow the machine document with the broader of knowledge available to humans


Engeniy gabrilovich and shaul markovitch american association for artificial intelligence 2006

  • Auxiliary text classifier:

    • matching documents with the most relevant articles of wikipedia

    • conventional bag of words + new features

  • Examples for idea of auxiliary text classifier:

    • “Bernanke takes charge”

    • BEN BERNANKE, FEDERAL RESERVE, CHAIRMAN OF THE FEDERAL RESERVE, ALAN GREENSPAN, MONETARISM, …

  • Using wikipedia

    • Use text similarity algorithms to automatically identify encyclopedia articles relevant to each document

    • Leverage the knowledge gained from these articles


Feature generation with wikipedia

Feature Generation with Wikipedia

  • Extend the representation of documents for text categorization with knowledge concepts relevant to the document text.

  • Wikipedia

    • Largest knowledge repository

    • Large-scale hierarchies

    • Qualify, stander written English


Feature construction

Feature Construction

  • Receive a text fragment, and map to most relevant wikipedia articles

    • E.g. overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with Encyclopedia knwoledge

    • ENCYCLOPEID, WIKIPEDIA, ENTERPRISE CONTENT MANAGEMENT, BOTTELENECK, PERFORMANCE PROBLEM, HERMENEUTICS

  • Training documents -> features -> wikipedia concepts -> augment the bag of word


Feature construction cont

Feature Construction (cont.)

  • Unit for feature generation?

    • Word, sentence, paragraph, document?

  • Multi-resolution approach

    • Features are generated for

      • Individual words

      • Sentences

      • Paragraphs

      • Entire document

    • Polysemous words is mapped to the concepts that correspond to the sense shared by the context words


Feature construction example

Feature Construction example

  • “jaguar car models”,

  • the Wikipedia-based feature generator returns:

    • JAGUAR (CAR),

    • DAIMLER and BRITISH LEYLAND MOTOR CORPORATION (companies merged with Jaguar),

    • V12 (Jaguar’s engine),

    • JAGUAR E-TYPE

    • JAGUAR XJ.

  • “jaguar Panthera onca”,

    • JAGUAR,

    • FELIDAE (feline species family), related felines such as LEOPARD,

    • PUMA and BLACK PANTHER, as well as KINKAJOU


Feature generator design

Feature generator design

  • A set of simple heuristics for pruning the sets of concepts (wikipedia):

    • Discarding:

      • with <100 non stop words

      • <5 incoming and outgoing links (too short)

      • disambiguation pages

    • Each concept is an attribute vector assigned weights using a TF.IDF


Using the link structure

Using the link structure

  • Link—anchor text:

    • Identical to the canonical name of the target article

    • Different anchor text refer to the same article: alternative names, variant spellings, and related phrases

    • Incoming links: significance of an article

    • Problem: taking all articles pointed from a concept: ill-advised, a lot of weakly related material

    • Pursue this direction in future work


Empirical evaluation

Empirical Evaluation

  • Wikipedia snapshot: November 5, 2005

  • 1.8Gb text in 910,989 articles,

    • removing small and overly specific concepts --remaining 171,332 articles

    • Removing stop words and rare words

    • Stemmed

    • 296,157 distinct terms presenting concepts


Experimental methodology

Experimental Methodology

  • 1 Reuter-21578

  • 2 Reuters Corpus Volume I (RCV1)

  • 3 OHSUMED

  • 4 20 Newsgroups(20NG)

  • 5 Movie Reviews (Movies)

  • Method: SVM with a linear kernel

  • Metrics:

    • precision-recall break-even point (BEP)

    • Reuter and OHSUMED: micro- and macro-average BEP

    • 20 NG and Movies: 4-fold cross-validation


Engeniy gabrilovich and shaul markovitch american association for artificial intelligence 2006

More effective in small categories

Improve more


Experiment on short documents

Experiment on short documents

Only use title of the articles to do classification


Conclusion and future work

Conclusion and Future work

  • Feature generator:

    • identify the most relevant encyclopedia articles

    • Creating new features

  • Add semantics to conventional BOW

    • Latent semantic indexing

    • LSI + SVM: not good

    • Wikipedia +svm: improve

  • Information retrieval


  • Login