Entity Categorization Over Large Document Collections

Entity Categorization Over Large Document Collections Presenter : Shu-Ya Li Authors : VenkateshGanti, Arnd Christian König, RaresVernica KDD, 2008

Outline • Motivation • Objective • Methodology • Experiments and Results • Conclusion • Comments

Motivation • Prior approaches • But… Entity • companies • [Entity] • present results • … Donald Knuth • works in research … is-a-researcher (Donald_Knuth) is-a-researcher (Entity)? Context [Entity] publish • newspapers • Going from unstructured data to structured data • Extracting entities (people, movies) from documents and identifying the categories (painter, writer, actor) • Most prior approaches (unary relation extraction) • only analyzed the local document context within which entities occur.

Objectives } “…[Entity]’s paper…” [Entity], ‘paper’ [Entity], ‘talk’ [Entity], ‘published’ ([Entity], is-a-researcher) “…[Entity] gave a talk…” “…[Entity] published…” Multi-Feature Relation Extractor • In this paper, we improve the accuracy of entity categorization by • considering an entity’s context across multiple documents • exploiting existing large lists of related entities

Methodology … Julia Roberts starred in Pretty Woman in 1988 … (Yao_Ming, is-a-athlete) Actor-List Feature: Co-occurrence between entityand actor name in context. Ex: Extraction of is-a-movie relation Alan Alba Richard Gere Julia Roberts … actor name Entity (Pretty Woman , is-a-movie)

Methodology - Processing large Document Collections Classification Classifiers C Aggregation List-Member Extraction Context Feature Extraction Entity-List Pairs ．retaining the most important list members Verification (Delete false Positives) Entity-Feature Pairs a known set of directors (as ε) a list of actors (as ) 3.2 million documents from Wiki Entity – Candidate Context Pairs } E1: Pretty Woman E2: Mystic Pizza E3:Doubt E4: Duplicity E5:Enchanted … Amy Adams ElizabethReaser JuliaRoberts TaraReid JudyReyes … Actors list n-gram Extraction Rule-based Extraction List-Member Detection wiki Co-Occurrence List corpus L Document Corpus D Synopsis of L

Methodology - Processing large Document Collections Classification Classifiers C … Julia Roberts starred in Pretty Woman in 1988 … Aggregation List-Member Extraction Context Feature Extraction ．Scanning D once {Julia, Roberts, starred, Pretty, Woman, Julia Roberts, Pretty Woman, … } Entity-List Pairs 1. the large amount of data written 2. not expected to contain an entity is a member of a list Verification (Delete false Positives) ．Our Approach – Bloom Filter {starred, Pretty, Woman, Pretty Woman, … } Entity-Feature Pairs Entity – Candidate Context Pairs (Julia Robert, starred) (Julia Robert, Pretty) (Julia Robert, Woman) (Julia Robert, Pretty Woman) Verification n-gram Extraction Rule-based Extraction List-Member Detection Co-Occurrence List corpus L Document Corpus D Synopsis of L

Experiments

Conclusion Studied the effect of aggregate context in relation extraction. Proposed efficient processing techniques for large text corpora. Both aggregate and co-occurrence features provide significant increase in extraction accuracy compared to single-context classifiers.

Comments • Advantage • The first half of this paper is clear. • Drawback • But the first half of this paper isn’t clear. • Application • Entity categorization

Entity Categorization Over Large Document Collections

Entity Categorization Over Large Document Collections

Presentation Transcript

Processing of large document collections

Processing of large document collections

Entity Categorization Over Large Document Collections

Processing of Large Document Collections 1

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Document Categorization

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections