Information extraction what has worked what hasn t and what has promise for the future
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Information Extraction: What has Worked, What hasn't, and What has Promise for the Future PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on
  • Presentation posted in: General

Information Extraction: What has Worked, What hasn't, and What has Promise for the Future. Ralph Weischedel BBN Technologies 7 November 2000. Outline. Information extraction tasks & past performance Two approaches Learning to extract relations Our view of the future.

Download Presentation

Information Extraction: What has Worked, What hasn't, and What has Promise for the Future

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Information extraction what has worked what hasn t and what has promise for the future

Information Extraction:What has Worked,What hasn't, and What has Promise for the Future

Ralph Weischedel

BBN Technologies

7 November 2000


Outline

Outline

  • Information extraction tasks & past performance

  • Two approaches

  • Learning to extract relations

  • Our view of the future


Tasks and performance in information extraction

Tasks and Performance in Information Extraction


Muc tasks

MUC Tasks

Named Entity (NE)

Names only of persons, organizations, locations

Template Element (TE)

All names, a description (if any) and type

of organizations and persons;

name and type of a named location

Template Relations (TR)

Who works for what organization;

Where an organization is located;

What an organization produces

Generic

Scenario Template (ST)

Domain Specific


Scenario template example

Scenario Template Example

  • Terrorism Event

    • Location: Georgia

    • Date: 09/06/95

    • Type: bombing_event

    • Instrument: a bomb

    • Victim: Georgian leader Eduard Shevardnadze

    • Injury: nothing worse than cuts and bruises

    • Accused: a group of people with plans of the parliament building

    • Accuser: Officials investigating the bombing

Georgian leader Eduard Shevardnadze suffered nothing worse than cuts and bruises when a bomb exploded yesterday near the parliament building. Officials investigating the bombing said they are blaming a group of people with plans of the parliament building.


Best performance in scenario template

F

MUC-4

MUC-6

MUC-5

MUC-7

MUC-3

Year

Best Performance in Scenario Template

  • No discernible progress on the domain specific task of scenario templates


Problems with scenario template task

Problems with Scenario Template Task

  • Templates are too domain dependent

    • not reusable or extensible to new domains

  • Answer keys are inappropriate for machine learning

    • inadequate information content

      • many facts are omitted due to relevancy filtering

      • weak association between facts and texts

    • insufficient quantity -- too expensive to produce

  • Scenario template conflates too many phenomena

    • entity and descriptor finding

    • sentence understanding

    • co-reference

    • world knowledge / inference

    • relevancy filtering


Named entity

Locations

Persons

Organizations

The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leaderRadovan Karadzic.

Named Entity

  • Within a document, identify every

    • name mention of locations, persons, and organizations

    • mention of dates, times, monetary amounts, and percentages


Template entity task

Template Entity Task

Find

all names of an organization/person/location,

one description of the organization/person, and

a classification for the organization/person/location.

“...according to the report by Edwin Dorn, under secretary of defense for personnel and readiness. … Dorn's conclusion that Washington…”

<ENTITY-9601020516-13> :=

ENT_NAME: "Edwin Dorn"

"Dorn"

ENT_TYPE: PERSON

ENT_DESCRIPTOR: "under secretary of defense for personnel and readiness"

ENT_CATEGORY: PER_CIV


Template relation task

Template Relation Task

Determine who works for what organization,

where an organization is located,

what an organization produces.

“Donald M. Goldstein, a historian at the University of Pittsburgh who helped write…”

<EMPLOYEE_OF-9601020516-5> :=

PERSON: <ENTITY-9601020516-18>

ORGANIZATION: <ENTITY-9601020516-9>

<ENTITY-9601020516-9> :=

ENT_NAME: "University of Pittsburgh"

ENT_TYPE: ORGANIZATION

ENT_CATEGORY: ORG_CO

<ENTITY-9601020516-18> :=

ENT_NAME: "Donald M. Goldstein"

ENT_TYPE: PERSON

ENT_DESCRIPTOR: "a historian at the University of Pittsburgh"


Performance in muc broadcast news

BN

BN

MUC-7

BN

BN

MUC-6

MUC-7

BN

BN

Performance in MUC/Broadcast News


Performance in muc bn tasks

Performance in MUC/BN Tasks

  • Clear progress in named entity for broadcast news

  • Promising progress in template element task for newswire

  • Promising start on three template relations


Overview of approaches

Overview of Approaches


Existing approaches

Existing Approaches

  • Manually constructed rules

    • New rules required for each domain, relation/template type, divergent source, new language

      • Written by an expert computational linguist (not just any computational linguist)

    • Cascaded components

    • Adequate performance apparently only for named entity recognition

  • Learning algorithms

    • Require manually annotated training data

    • Integrated search space offers potential of reduced errors

    • Adequate performance apparently only for named entity recognition


Named entity ne extraction

Training

Program

training

sentences

answers

NE

Models

The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leaderRadovan Karadzic.

Speech

Speech

Recognition

  • Since 1997 - Statistical approaches (BBN, NYU, MITRE) achieve state-of-the-art performance

  • By 1998 - Performance on automatically transcribed broadcast news of interest

Named Entity (NE) Extraction

Locations

Identify every name mention of

locations, persons, and organizations.

Persons

Organizations

The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.

Entities

Extractor

Text

  • Up to 1996 - no learning approach competitive with hand-built rules


Traditional rule based architecture

Typical Architecture

Message

Text Finder

Morphological Analyzer

Lexical Pattern Matcher

Output

Traditional (Rule-Based) Architecture

Morphological analysis may determine part of speech

Lots of manually constructed patterns

  • <NNP>+ [“Inc” | “Ltd” | Gmbh ...]

  • <NNP> [“Power & Light”]

  • <NNP>+ [“City” | “River” | “Valley”, …]

  • <title> <NNP>+


Name extraction via identifinder tm

  • Bi-gram transition probabilities

Name Extraction via IdentiFinderTM

Structure of IdentiFinder’s Model

  • One language model for each category plus one for other (not-a-name)

  • The number of categories is learned from training


Effect of speech recognition errors

Effect of Speech Recognition Errors

HUB 4 98

SER(speech) @ SER(text) + WER


Traditional rule based architecture beyond names

Message

Text Finder

After 1995

Part of Speech

HMM

Named Entity Extraction

HMM

(Chunk) Parser

LPCFG-HD

(Chunk) Semantics

Sentence-Level Pattern Matcher

Coref, Merging, & Inference

Template Generator

Output

Traditional (Rule-based) Architecture ( beyond names)

BBN Architecture in MUC-6 (1995)

Waterfall architecture

  • Errors in early processes propagate

  • Little chance of correcting errors

  • Learning rules for one component at a time

Discourse/Document

Clause/Sentence


Rule based extraction examples

Rule-based Extraction Examples

Determining which person holds what office in what organization

  • [person] , [office] of [org]

    • Vuk Draskovic, leader of the Serbian Renewal Movement

  • [org] (named, appointed, etc.) [person] P [office]

    • NATO appointed Wesley Clark as Commander in Chief

      Determining where an organization is located

  • [org] in [loc]

    • NATO headquarters in Brussels

  • [org] [loc] (division, branch, headquarters, etc.)

    • KFOR Kosovo headquarters


Learning to extract relations

Learning to Extract Relations


Motivation for a new approach

Motivation for a New Approach

  • Breakthrough in parsing technology achieved in the mid-90s

  • Few attempts to embed the technology in a task

  • Information extraction tasks in MUC-7 (1998) offered an opportunity

    • How can the (limited) semantic interpretation required for MUC be integrated with parsing?

    • Since the source documents were NYT newswire, rather than WSJ, would we need to treebank NYT data?

    • Would computational linguists be required as semantic annotators?


New approach

Language

Answers

Message

Trainer

Sentence

Discourse

Semantics

Syntax

Models

Fact Identification

Template Generator

Output

New Approach

  • SIFT, statistical processing of

    • Name finding

    • Part of speech

    • Parsing

  • Penn TREEBANK for data about syntax

  • Core semantics of descriptions


A treebank skeletal parse

S

VP

NP

VP

SBAR

PP

S

NP

VP

WHNP

NP

NP

NP

NP

NP

,

was

ousted

Sharif

who

led

Pakistan

,

Muscharraf

12

by

,

Pervez

General

October

Nawaz

Army

Pakistani

A TREEBANK Skeletal Parse


Integrated syntactic semantic parsing semantic annotation required

Employee

relation

Coreference

person-descriptor

organization

person

Nance , who is also a paid consultant to ABC News , said ...

Semantic training data consists ONLY of

  • Named entities (as in NE)

  • Descriptor phrases (as in MUC TE)

  • Descriptor references (as in MUC TE)

  • Relation/events to be extracted (as in MUC TR)

Integrated syntactic-semantic parsing: Semantic Annotation Required


Automatic augmentation of parse trees

Automatic Augmentation of Parse Trees

  • Add nodes for names and descriptors not bracketed in Treebank, e.g. Lt. Cmdr. Edwin Lewis

  • Attach semantics to noun phrases corresponding to entities (persons, organizations, descriptors)

  • Insert a node indicating the relation between entities (where one entity is a modifier of another)

  • Attach semantics indicating relation to lowermost ancestor node of related entities (where one entity is not a modifier of another)

  • Add pointer semantics to intermediate nodes for entities not immediately dominated by the relation node


Augmented semantic tree

s

Semantic label

Syntax label

per/np

vp

per-desc-of/sbar-lnk

per-desc-ptr/sbar

per-desc-ptr/vp

per-desc-r/np

emp-of/pp-lnk

org-ptr/pp

per-r/np

whnp

advp

per-desc/np

org-r/np

per/nnp

,

wp

vbz

rb

det

vbn

per-desc/nn

to

org-c/nnp

org/nnp

,

vbd

Nance

,

who

is

also

a

paid

consultant

to

ABC

News

,

said

...

Augmented Semantic Tree


Do we need to treebank nyt data no

Do we need to treebank NYT data? - No

  • Key idea is to exploit the Penn Treebank

  • Train the sentence-level model on syntactic trees from Treebank

  • For each sentence in the semantically annotated corpus

    • Parse the sentence constraining the search to find parses that are consistent with semantics

    • Augment the syntactic parse with semantic structure

  • Result is a corpus that is annotated both semantically and syntactically


Lexicalized probabilistic cfg model

P(node | history)

Head category: P(ch | cp), e.g.

P(vp | s)

Left modifier categories: PL(cm | cp,chp,cm-1,wp), e.g.

PL(per/np | s, vp, null, said)

Right modifier categories: PR(cm | cp,chp,cm-1,wp)

PR(emp-of/pp-lnk | per-desc-r/np, per-desc/np, null, consultant)

Head part-of-speech: P(tm | cm,th,wh),

P(per/nnp | per/np, vbd, said)

Head word: P(wm | cm, tm, th, wh), e.g.

P(nance | per/np, per/nnp, vbd, said)

Head word features: P(fm | cm, tm, th, wh, known(wm)), e.g.

P(cap | per/np, per/nnp, vbd, said, true)

(1) Max [ product[P(node | history)]]

tree nodes

Lexicalized Probabilistic CFG Model


A generative model

A Generative Model

  • Trees are generated top-down, except

    • immediately upon generating each node, its head word and part-of-speech tag are generated

  • For each node, child nodes are constructed in three steps

    (1) the head node is generated

    (2) premodifier nodes, if any, are generated

    (3) postmodifier nodes, if any, are generated


Tree generation example

Semantic label

Syntax label

per-desc-ptr/vp

per-desc-r/np

emp-of/pp-lnk

org-ptr/pp

advp

per-desc/np

org-r/np

vbz

rb

det

vbn

per-desc/nn

to

org-c/nnp

org/nnp

is

also

a

paid

consultant

to

ABC

News

Tree Generation Example

S

per/np

vp

per-desc-of/sbar-lnk

per-desc-ptr/sbar

per-r/np

whnp

per/nnp

,

wp

,

vbd

Nance

,

who

,

said

...


Searching the model

Searching the Model

  • CKY bottom-up search of top-down model

  • Dynamic programming

    • keep only the most likely constituent if several are equivalent relative to all future decisions

  • Constituents are considered identical if

    • They have identical category labels.

    • Their head constituents have identical labels.

    • They have the same head word.

    • Their leftmost modifiers have identical labels.

    • Their rightmost modifiers have identical labels.


Cross sentence merging model

Cross Sentence (Merging) Model

  • Classifier model applied to entity pairs

    • whose types fit the relation

    • first argument not already related

  • Feature-based model

    • structural features, e.g. distance between closest references

    • content features, e.g. similar relations found in training

    • feature probabilities estimated from annotated training

  • Compute odds ratio:p(rel) p(feat1|rel) p(feat2|rel) … p(~rel) p(feat1|~rel) p(feat2|~rel) ...

  • Create new relation if odds ratio > 1.0


Performance on muc 7 test data

Performance on MUC-7 Test Data


Issues and answers

How can the (limited) semantic interpretation required for MUC be integrated with parsing?

Integrate syntax and semantics by training on and then generating parse trees augmented with semantic labels

Since the source documents were NYT newswire, rather than WSJ, would we need to treebank NYT data?

No.

First train the parser on WSJ.

Then constrain the parser on NYT to produce trees consistent with the semantic annotation.

Retrain the parser to produce augmented syntax/semantics trees on the NYT data.

Must computational linguists be the semantic annotators?

No, college students from various majors are sufficient.

Issues and Answers


Issues and answers cont

Issues and Answers (cont.)

  • LPCFG can be effectively applied to information extraction

  • A single model performed all necessary sentential processing

  • Much future work required for successful deployment

    • Statistical modeling of co-reference

    • Improved performance

    • Cross-document tracking of entities, facts, and events

    • Robust handling of noisy input (speech recognition and OCR)


Pronoun resolution

Pronoun Resolution

  • Statistical model attempts to resolve pronouns to

    • A previous mention of an entity (person, organization, geo-political entity, location, facility), or

    • An arbitrary noun phrase, for cases where the pronoun resolves to a non-entity, or

    • Null, for cases where the pronoun is unresolvable (such as “it is raining”)

  • This generative model depends on

    • All previous noun phrases and pronouns

    • Syntactically local lexical environment

    • Tree distance (similar to Hobbs ‘76)

    • Number and gender


Our view of the future

Our View of the Future


Status

Status

  • Named entity extraction is mature enough for technology transfer

    • In multiple languages

    • On online text or automatically recognized text (speech or OCR)

  • Fact extraction would benefit from further R&D

    • To increase accuracy from 70 - 75% on newswire to 85-95% on WWW, newswire, audio, video, or printed matter

    • To reduce training requirements from two person months to two person days

    • To correlate facts about entities and events across time and across sources for update of a relational data base


Key effective ideas in 90s

Key Effective Ideas in 90s

  • Recent results in learning algorithms

    • Named entity recognition via hidden Markov models

    • Lexicalized probabilistic context-free grammars

    • Pronoun resolution

    • Co-training

  • New training data -- TREEBANK data (parse trees, pronoun co-reference, ...)

  • A recipe for progress

    • Corpus of annotated data

    • An appropriate model for the data

    • Automatic learning techniques

    • Recognition search algorithm

    • Metric-based evaluation


Our vision

Training

Program

Models

Our Vision

Tables

training

sentences

answers

Link

Analysis

Data Base

Entities

Extractor

Events

Relations

Information Extraction

Geo Display

Time Line


  • Login