slide1
Download
Skip this Video
Download Presentation
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006

Loading in 2 Seconds...

play fullscreen
1 / 15

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li. Paper Structure. Introduction Feature Generation with Wikipedia

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006' - hogan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge

Engeniy Gabrilovich and Shaul Markovitch

American Association for Artificial Intelligence 2006

Prepared by Qi Li

paper structure
Paper Structure
  • Introduction
  • Feature Generation with Wikipedia
    • Wikipedia as a knowledge Repository
    • Feature Construction
    • Feature generator design
    • Using the link structure
  • Empirical Evaluation
    • Implementation Details
    • Experimental Methodology
    • The effect of feature generation
    • Classifying short documents
  • Conclusions and Future Work
introduction
Introduction
  • Text categorization
    • Deals with automatic assignment of category labels to natural language documents
    • Represent document as bags of words
    • Features from words
    • Categorization based on features
    • Limitation of BOW:
      • by individual word occurrences in the training set
        • Wal-Mart supply chain goes real time
        • Wal-Mart manages its stock with RFID technology
      • Effective in medium difficulty categorization, but bad in small categories or short documents
  • Using encyclopedia to endow the machine document with the broader of knowledge available to humans
slide4
Auxiliary text classifier:
    • matching documents with the most relevant articles of wikipedia
    • conventional bag of words + new features
  • Examples for idea of auxiliary text classifier:
    • “Bernanke takes charge”
    • BEN BERNANKE, FEDERAL RESERVE, CHAIRMAN OF THE FEDERAL RESERVE, ALAN GREENSPAN, MONETARISM, …
  • Using wikipedia
    • Use text similarity algorithms to automatically identify encyclopedia articles relevant to each document
    • Leverage the knowledge gained from these articles
feature generation with wikipedia
Feature Generation with Wikipedia
  • Extend the representation of documents for text categorization with knowledge concepts relevant to the document text.
  • Wikipedia
    • Largest knowledge repository
    • Large-scale hierarchies
    • Qualify, stander written English
feature construction
Feature Construction
  • Receive a text fragment, and map to most relevant wikipedia articles
    • E.g. overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with Encyclopedia knwoledge
    • ENCYCLOPEID, WIKIPEDIA, ENTERPRISE CONTENT MANAGEMENT, BOTTELENECK, PERFORMANCE PROBLEM, HERMENEUTICS
  • Training documents -> features -> wikipedia concepts -> augment the bag of word
feature construction cont
Feature Construction (cont.)
  • Unit for feature generation?
    • Word, sentence, paragraph, document?
  • Multi-resolution approach
    • Features are generated for
      • Individual words
      • Sentences
      • Paragraphs
      • Entire document
    • Polysemous words is mapped to the concepts that correspond to the sense shared by the context words
feature construction example
Feature Construction example
  • “jaguar car models”,
  • the Wikipedia-based feature generator returns:
    • JAGUAR (CAR),
    • DAIMLER and BRITISH LEYLAND MOTOR CORPORATION (companies merged with Jaguar),
    • V12 (Jaguar’s engine),
    • JAGUAR E-TYPE
    • JAGUAR XJ.
  • “jaguar Panthera onca”,
    • JAGUAR,
    • FELIDAE (feline species family), related felines such as LEOPARD,
    • PUMA and BLACK PANTHER, as well as KINKAJOU
feature generator design
Feature generator design
  • A set of simple heuristics for pruning the sets of concepts (wikipedia):
    • Discarding:
      • with <100 non stop words
      • <5 incoming and outgoing links (too short)
      • disambiguation pages
    • Each concept is an attribute vector assigned weights using a TF.IDF
using the link structure
Using the link structure
  • Link—anchor text:
    • Identical to the canonical name of the target article
    • Different anchor text refer to the same article: alternative names, variant spellings, and related phrases
    • Incoming links: significance of an article
    • Problem: taking all articles pointed from a concept: ill-advised, a lot of weakly related material
    • Pursue this direction in future work
empirical evaluation
Empirical Evaluation
  • Wikipedia snapshot: November 5, 2005
  • 1.8Gb text in 910,989 articles,
    • removing small and overly specific concepts --remaining 171,332 articles
    • Removing stop words and rare words
    • Stemmed
    • 296,157 distinct terms presenting concepts
experimental methodology
Experimental Methodology
  • 1 Reuter-21578
  • 2 Reuters Corpus Volume I (RCV1)
  • 3 OHSUMED
  • 4 20 Newsgroups(20NG)
  • 5 Movie Reviews (Movies)
  • Method: SVM with a linear kernel
  • Metrics:
    • precision-recall break-even point (BEP)
    • Reuter and OHSUMED: micro- and macro-average BEP
    • 20 NG and Movies: 4-fold cross-validation
experiment on short documents
Experiment on short documents

Only use title of the articles to do classification

conclusion and future work
Conclusion and Future work
  • Feature generator:
    • identify the most relevant encyclopedia articles
    • Creating new features
  • Add semantics to conventional BOW
    • Latent semantic indexing
    • LSI + SVM: not good
    • Wikipedia +svm: improve
  • Information retrieval
ad