Entity categorization over large document collections
Download
1 / 14

Entity Categorization Over Large Document Collections - PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on

Entity Categorization Over Large Document Collections. Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research. Relationship Extraction from Text. Task : Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text. Entity.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Entity Categorization Over Large Document Collections' - lamya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Entity categorization over large document collections

Entity Categorization Over Large Document Collections

Arnd Christian König

Venkatesh Ganti

Rares Vernica

Microsoft Research


Relationship extraction from text
Relationship Extraction from Text

Task: Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text.

Entity

  • … Donald Knuth

  • works in research …

is-a-researcher(Donald_Knuth)

Context

  • …Yao Ming plays for

  • the Houston Rockets…

works-for(Yao_Ming,

Houston_Rockets)

Motivation: Going fromunstructured data to structured data

  • Applications in search, business intelligence, etc.

  • Focus:

  • Openrelationship extraction vs. targeted extraction


Relationship extraction from text1
Relationship Extraction from Text

Task: Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text.

Entity

  • … Donald Knuth

  • works in research …

is-a-researcher(Donald_Knuth)

Context

  • …Yao Ming plays for

  • the Houston Rockets…

works-for(Yao_Ming,

Houston_Rockets)

Motivation: Going fromunstructured data to structured data

  • Applications in search, business intelligence, etc.

  • Focus:

  • Openrelationship extraction vs.targetedextraction

  • Large document collections (> 107 Documents)


Using aggregate context
Using Aggregate Context

Extraction logic: ‘[E] works … research’

Single-context Extraction:

  • “…[Entity] works

  • in research…”

([Entity], is-a-researcher)

We track an entity across contexts, allowing us to combine less predictive features.

}

  • Multi-context Extraction:

“…[Entity]’s

paper…”

Aggregate Context Features

[Entity], ‘paper’

[Entity], ‘talk’

[Entity], ‘published’

([Entity],

is-a-researcher)

“…[Entity]

gave a talk…”

“…[Entity]

published…”

Multi-Feature Relation Extractor


Using co occurrence features
Using Co-occurrence Features

Leverage co-occurrence of entity classes (e.g. directors likely co-occur with actors) for extraction.

Example: Extraction of is-a-director relation:

Two Questions:

What difference do the aggregate contextsmake for extraction accuracy?

This means keeping track of contexts across documents - can we make this efficient?

}

Aggregate Context Features

… Julia Roberts starred in a Robert Altman film in 1994 …

Robert_Altman, co-occurs with actor name

  • Co-occurrence features can be between

  • Entities of different classes.

  • Entities of one class.

  • Combination with text-features possible:

  • e.g., ‘[Entity] plays for [Team_Name]’.

Actor-List

Alan Alba

Richard Gere

Julia Roberts


Processing large document collections
Processing large Document Collections


New architecture

Architecture

New Architecture

Classification

COUNT(entity, relation) > Δ

Context Feature

Extraction

Aggregation

  • Duplicated overhead from

  • - Document scanning

  • - Document processing

  • - Entity Extraction.

Entity-Relation Pairs

Entity-Feature Pairs

Agg. Feature

Extraction

Single-Context

Extraction

Co-Occurrence

Detection

  • Co-Occurrence

  • Detection

  • Co-Occurrence

  • Detection

  • Co-Occurrence

  • Detection

Document Corpus D

Co-Occurrence List corpus L


New architecture1
New Architecture

  • Frequency-distribution of entities very skewed.

  • Pruning based on retaining most frequent entities and list members in memory.

  • Challenge: Determining frequencies online.

  • => Compact hash-synopses of frequencies (CM-Sketch) perform well.

Challenges:

1. Fast & accurate co-occurrence detection using the synopsis.

2. Pruning of redundant output.

Classification

Context Feature

Extraction

List-Member

Extraction

  • Fast identification of candidate matches through 2-stage filtering.

  • Use of Bloom-Filters to trade off memory footprint with false positive rate.

Aggregation

Aggregation

  • Potentially very large output:

    • Duplication, e.g.

    • Entity: “George Bush” Feature: ‘President’

  • Potentially very large output:

    • Duplication via very many co-occurrences, e.g. actor-actor.

Entity-List Pairs

Delete false Positives

Entity-Feature Pairs

Entity – Candidate

Context Pairs

Agg. Feature

Extraction

Rule-based

Extraction

Co-Occurrence

Detection

Co-Occurrence List corpus L

Document Corpus D

Synopsis of L



Experimental evaluation
Experimental Evaluation

  • Task: Categorization of entities into professions (actor, writer, painter, etc.)

  • Document-Corpus: 3.2 Million Wikipedia pages

  • Training data generated using Wikipedia lists of famous painters, writers, etc…

  • Aggregate-Context Classifier: linear SVM using text n-gram & co-occurrence features (binary)

  • Single-Context classifier: 100K extraction rules (incl. gaps) derived from training data (algorithm of [König and Brill, KDD’06]).

  • Co-occurrence list: contains 10% of entity strings in training data.



Experimental Evaluation: Overhead

  • Main remaining overhead: writing of entity-features pairs.

  • Simple caching strategy reduces this overhead by an order of magnitude.


Conclusions
Conclusions

  • Studied the effect of aggregate context in relation extraction.

  • Proposed efficient processing techniques for large text corpora.

  • Both aggregate and co-occurrence features provide significant increase in extraction accuracy compared to single-context classifiers.

  • The use of pruning techniques and approximate filters results in significant reduction in the overall extraction overhead.



ad