120 likes | 254 Views
This paper presents an improved method for entity categorization, focusing on analyzing the context across multiple documents, which builds on prior approaches that typically examine only local document contexts. By leveraging extensive lists of related entities, the proposed Multi-Feature Relation Extractor enhances accuracy in extracting entities such as actors from large document collections. The methodology, supported by extensive experiments, highlights the significance of aggregate context and co-occurrence features, resulting in notable improvements over traditional classifiers.
E N D
Entity Categorization Over Large Document Collections Presenter : Shu-Ya Li Authors : VenkateshGanti, Arnd Christian König, RaresVernica KDD, 2008
Outline • Motivation • Objective • Methodology • Experiments and Results • Conclusion • Comments
Motivation • Prior approaches • But… Entity • companies • [Entity] • present results • … Donald Knuth • works in research … is-a-researcher (Donald_Knuth) is-a-researcher (Entity)? Context [Entity] publish • newspapers • Going from unstructured data to structured data • Extracting entities (people, movies) from documents and identifying the categories (painter, writer, actor) • Most prior approaches (unary relation extraction) • only analyzed the local document context within which entities occur.
Objectives } “…[Entity]’s paper…” [Entity], ‘paper’ [Entity], ‘talk’ [Entity], ‘published’ ([Entity], is-a-researcher) “…[Entity] gave a talk…” “…[Entity] published…” Multi-Feature Relation Extractor • In this paper, we improve the accuracy of entity categorization by • considering an entity’s context across multiple documents • exploiting existing large lists of related entities
Methodology … Julia Roberts starred in Pretty Woman in 1988 … (Yao_Ming, is-a-athlete) Actor-List Feature: Co-occurrence between entityand actor name in context. Ex: Extraction of is-a-movie relation Alan Alba Richard Gere Julia Roberts … actor name Entity (Pretty Woman , is-a-movie)
Methodology - Processing large Document Collections Classification Classifiers C Aggregation List-Member Extraction Context Feature Extraction Entity-List Pairs .retaining the most important list members Verification (Delete false Positives) Entity-Feature Pairs a known set of directors (as ε) a list of actors (as ) 3.2 million documents from Wiki Entity – Candidate Context Pairs } E1: Pretty Woman E2: Mystic Pizza E3:Doubt E4: Duplicity E5:Enchanted … Amy Adams ElizabethReaser JuliaRoberts TaraReid JudyReyes … Actors list n-gram Extraction Rule-based Extraction List-Member Detection wiki Co-Occurrence List corpus L Document Corpus D Synopsis of L
Methodology - Processing large Document Collections Classification Classifiers C … Julia Roberts starred in Pretty Woman in 1988 … Aggregation List-Member Extraction Context Feature Extraction .Scanning D once {Julia, Roberts, starred, Pretty, Woman, Julia Roberts, Pretty Woman, … } Entity-List Pairs 1. the large amount of data written 2. not expected to contain an entity is a member of a list Verification (Delete false Positives) .Our Approach – Bloom Filter {starred, Pretty, Woman, Pretty Woman, … } Entity-Feature Pairs Entity – Candidate Context Pairs (Julia Robert, starred) (Julia Robert, Pretty) (Julia Robert, Woman) (Julia Robert, Pretty Woman) Verification n-gram Extraction Rule-based Extraction List-Member Detection Co-Occurrence List corpus L Document Corpus D Synopsis of L
Conclusion Studied the effect of aggregate context in relation extraction. Proposed efficient processing techniques for large text corpora. Both aggregate and co-occurrence features provide significant increase in extraction accuracy compared to single-context classifiers.
Comments • Advantage • The first half of this paper is clear. • Drawback • But the first half of this paper isn’t clear. • Application • Entity categorization