information extraction n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Information Extraction PowerPoint Presentation
Download Presentation
Information Extraction

Loading in 2 Seconds...

play fullscreen
1 / 96

Information Extraction - PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on

Information Extraction. Jordi Turmo TALP Research Centre Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya turmo@lsi.upc.edu http://www.lsi.upc.edu/~turmo. Summary. Information Extraction Systems Evaluation Multilinguality Adaptability. Summary.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Extraction' - mandel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information extraction

Information Extraction

Jordi Turmo

TALP Research Centre

Dep. Llenguatges i Sistemes Informàtics

Universitat Politècnica de Catalunya

turmo@lsi.upc.edu

http://www.lsi.upc.edu/~turmo

Adaptive Information Extraction

slide2

Summary

  • Information Extraction Systems
  • Evaluation
  • Multilinguality
  • Adaptability

Adaptive Information Extraction

slide3

Summary

  • Information Extraction Systems
      • Introduction
      • Historical framework
      • Architecture
      • Knowledge specific for IE
      • Examples
  • Evaluation
  • Multilinguality
  • Adaptability

Adaptive Information Extraction

slide4

Introduction

Definition

  • Goal: Localization and extraction, in a specific format, of the relevant information included in a collection of documents
  • Input requirements: scenario of extraction and document collection
  • Output requirements: output format

Adaptive Information Extraction

slide5

Introduction

Typology

  • Different points of view:
    • conceptual coverage: restricted-domain IE vs. open-domain IE
    • language coverage: monoligual IE vs. multilingual IE
    • media coverage: written text IE, speech IE, image IE, multimedia IE
    • document type: IE from free text, from semi-structured documents, from structured documents (including Web pages in HTML and XML)
    • task: TE, TR, ST, others

Adaptive Information Extraction

slide6

Introduction

Typology

  • Different points of view:
    • conceptual converage: restricted-domain IEvs. open-domain IE
    • language coverage: monoligual IEvs. multilingual IE
    • media coverage:written text IE, speech IE, image IE, multimedia IE
    • document type: IE from free text, from semi-structured documents, from structured documents (including Web pages in HTML and XML)
    • task: TE, TR, ST, others

Adaptive Information Extraction

slide7

Introduction

Example 1: Structured documents

  • Web pages
  • A list of members of an organization per
  • document
  • English
  • Scenario of Extraction
    • Name, degree, school and affiliation of the member

Adaptive Information Extraction

slide8

Introduction

Example 1: Structured documents

Name Degree School Affiliation

WL Hsu PhD Cornell IIS, Sinica

CS Ho PhD NTU EE,NTIT

C.Chen PhD SUNY EE,NTIT

C.Wu PhD Utexas Cedu,NNU

Mark Liao PhD NWU IIS, Sinica

CJ Liau PhD NTU IIS, Sinica

WK Cheng PhD TKU Tunghai

WC Wang MS Syracus FIT

...

Adaptive Information Extraction

slide9

Introduction

Example 2: Semi-structured documents

  • 485 seminar announcements
  • A description of one seminar per document
  • English
  • Scenario of Extraction
    • Speaker, location, start time and end time of the
    • seminar

Adaptive Information Extraction

slide10

Introduction

Example 2: Semi-structured documents

Adaptive Information Extraction

slide11

Introduction

Example 3: Free text

  • 318 Wall Street Journal articles
  • A description of an incident per document
  • English
  • Scenario of Extraction
    • Type of incident, perpetrator, target, date, location,
    • effects and instrument

Adaptive Information Extraction

slide12

A bomb went off this morning near a power tower in San Salvador leaving

a large part of the city without energy, but no casualties have been reported.

According to unofficial sources, the bomb -allegedly detonated by urban

guerrilla commandos- blew up a power tower in the northwestern part of

San Salvador at 0650.

Introduction

Example 3: Free text

Incident type: bombing

date: March 19

Location: El Salvador: San Salvador (city)

Perpetrator: urban guerrilla commandos

Physical target: power tower

Human target: -

Effect on physical target: destroyed

Effect on human target: no injury or death

Instrument: bomb

Adaptive Information Extraction

slide13

Introduction

Example 4: Free text

  • 78 documents
  • A description of mushroom per document
  • Spanish
  • Scenario of Extraction
    • colors of parts of mushrooms and the circumstances
    • in which they occur

Adaptive Information Extraction

slide14

Introduction

Example 4: Free text

Adaptive Information Extraction

slide15

Introduction

Example 4: Free text

El color blanco de su sombrero pasa a amarillo crema al corte.

El sombrero ennegrece si se corta.

color_1

base: blanco

tono: indef

luz: indef

Sombrero_1

color:

virar_1

inicio:

final:

causa: corte

color_2

base: amarillo

tono: crema

luz: indef

Sombrero_2

color:

virar_2

inicio: indef

final:

causa: corte

color_3

base: indef

tono: negro

luz: indef

Adaptive Information Extraction

slide16

Introduction

Example 5: Combination

  • 78 documents
  • A description of mushroom per document
  • Spanish
  • Scenario of Extraction
    • Names of the mushroom in different languages, ethimology
    • colors of parts of mushrooms and the circumstances
    • in which they occur

Adaptive Information Extraction

slide17

Introduction

Example 5: Combination

Adaptive Information Extraction

slide18

Introduction

Applications

  • IE from the Web
  • Building of news DBs
  • Information Integration
  • Support for QA and Summarization
  • Limitation whenP<80%

Adaptive Information Extraction

slide19

Introduction

References

  • D.E. Appelt, D.J. Israel, 1999
  • E. Hovy, 1999
  • R.J. Mooney, C. Cardie, 1999
  • Muslea, 1999
  • J. Cowie, Y. Wilks, 2000
  • M.T. Pazienza, 2003
  • Turmo, 2003
  • Turmo et al. 2005

Adaptive Information Extraction

slide20

Introduction

Recent events

  • IJCAI 2001 Workshop on Adaptive Text Extraction and Mining (ATEM-2001)
  • ECML 03/PKDD Workshop on Adaptive Text Extraction and Mining (ATEM-2003)
  • AAAI 04 Workshop on Adaptive Text Extraction and Mining (ATEM-2004)
  • EACL 06 Workshop on Adaptive Text Extraction and Mining (ATEM-2006)
  • COLING-ACL 06 Workshop on Information Extraction Beyond the Document
  • ACE conferences

Adaptive Information Extraction

slide21

Summary

  • Information Extraction Systems
      • Introduction
      • Historical framework
      • Architecture
      • Knowledge specific for IE
      • Examples
  • Evaluation
  • Multilinguality
  • Adaptability

Adaptive Information Extraction

slide22

Manual

Process

Experts

on the

Domain

Relevant

Information

Historical framework

Origin of IE

  • Acquisition of the relevant information involved in knowledge-based systems
  • Traditionally (High human cost)

Adaptive Information Extraction

slide23

Text-based Intelligent Systems

Historical framework

Origin of IE

  • Acquisition of the relevant information involved in knowledge-based systems
  • 80’s (text sources)

Relevant

Information

Adaptive Information Extraction

slide24

Historical framework

Origin of IE

  • Text-Based Intelligent Systems (TBIS)
      • Information Retrieval
      • Information Integration
      • Information Filtering
      • Information Routing
      • Information Extraction
      • Document Classification
      • Question Answering
      • Automatic Summarization
      • Topic Detection & Tracking

...

Adaptive Information Extraction

slide25

Historical framework

Relevant Historical Programs

  • Precedents: LSP (Sager, 81), FRUMP (DeJong, 82),
  • JASPER (Hayes, 86)
  • in USA
    • (1987-1991): MUC [US Navy]
    • TIPSTER (1991-1998): MUC [DARPA]
    • TIDES (1999-): ACE [NIST]
  • in Europe
    • LRE (1993-1996): TREE, AVENTINUS, FACILE, ECRAN, SPARKLE
    • PASCAL excellence network (2003-)

Adaptive Information Extraction

slide26

Historical framework

MUC Evolution

  • MUC-1 (1987)
    • naval operations
    • auto-definition of scenarios
    • auto-evaluation
  • MUC-2 (1989)
    • naval operations
    • output structure with 10 attributes
    • (type of event, agent, place, ...)
    • auto-evaluation

Adaptive Information Extraction

slide27

Historical framework

MUC Evolution

  • MUC-3 (1991),
    • Latin-American terrorism
    • output structure with 18 attributes
    • (type of incident, date, place, ...)
    • recall and precision measures

a

extracted = a + b + e + f

relevant = a + f + d

recall = a + 0.5 f/ (a + f + d)

precision = a + 0.5 f/ (a + f + b + e)

extracted

f

b

e

d

c

parcially extracted

relevant

Adaptive Information Extraction

slide28

Historical framework

MUC Evolution

  • MUC-4 (1992),
    • Latin-American terrorism
    • 24 attributes
    • F-score (harmonic average)
  • MUC-5 (1993),
    • Financial news, microelectronics
    • English, Japanese

Adaptive Information Extraction

slide29

Historical framework

MUC Evolution

  • MUC-6 (1995),
    • finantial news
    • subtasks: NE, coreference
    • tasks: TE (template element), ST (scenario template)
  • MUC-7 (1998),
    • air crashes
    • new task: TR (template relation)

Adaptive Information Extraction

slide30

a

extracted

b

d

c

relevant

Historical framework

MUC Evolution

  • MUC-6, MUC-7
    • Partial extractions are discarded

extracted = a + b

relevant = a + d

recall = a / (a + d)

precision = a / (a + b)

Adaptive Information Extraction

slide31

Summary

  • Information Extraction Systems
      • Introduction
      • Historical framework
      • Architecture
      • Knowledge specific for IE
      • Examples
  • Evaluation
  • Multilinguality
  • Adaptability

Adaptive Information Extraction

slide32

Architecture

General Architecture

  • Hobbs,93:
    • Cascade of transducers (or modules) that add structure to text and, often, drop out irrelevant information by applying rules

Adaptive Information Extraction

slide33

Architecture

Traditional Architecture

Document Preprocessing

Conceptual Hierarchy

Pattern Matching

Pattern Base

Postprocess

Adaptive Information Extraction

slide34

Architecture

Traditional Architecture

Text Control

Lexical Analysis

ConceptualHierarchy

Syntactic Analysis

Pattern Matching

Pattern Base

Postprocess

Adaptive Information Extraction

slide35

Architecture

Traditional Architecture

Text Control

Lexical Analysis

Conceptual Hierarchy

Syntactic Analysis

Pattern Matching

Pattern Base

Discourse Analysis

Output Template Generation

Output Format

Adaptive Information Extraction

slide36

Architecture

Architecture

Text control

  • Filtering relevant documents
  • Guessing the language of the documents
  • Splitting documents into textual zones
  • Filtering relevant zones
  • Splitting text into appropriate units (eg. sentences)
  • Filtering relevant units
  • Tokenizing units

Adaptive Information Extraction

slide37

Architecture

Architecture

Text control

  • Example

Adaptive Information Extraction

slide38

Architecture

Architecture

Text control

  • Example

<Sombrero bastante carnoso de 4 a 8 cm , convexo , luego completamente extendido , aplanado y mamelonado , liso , húmedo e higrófano .> <Esta última condición influye en la variabilidad de su coloración desde canela claro a toda la gama de tostados .> <Con la edad generalmente palidece sus tonos .>

<Puede confundirse con otras foliotas comestibles , pero alguna especie es amarga . ><Los aficionados poco experimentados pueden también confundir este género con otros no comestibles , como Hypholoma y Flacemula , también lignícolas.>

Adaptive Information Extraction

slide39

Architecture

Architecture

Lexical analysis

  • Identifying morpho-syntactic categories and semantic categories of words
      • General lexicon
  • Recognizing terminology words
  • Specific dictionaries
  • Recognizing time expressions, quantities, abbreviations, …
  • Extending abbreviations
  • Lists of abbrev. + expansion

Adaptive Information Extraction

slide40

Architecture

Architecture

Lexical analysis

  • Recognizing and classifying proper nouns (Named Entities –NERC-)
  • Gazetteers
  • Patterns
  • Dealing with unknown words
  • Dealing with lexical ambiguities
  • POS taggers
  • WSD (???)

Adaptive Information Extraction

slide41

Architecture

Architecture

Lexical analysis

  • Example1

time expressions

mushroom names

abbreviatures

numbers

morphologic parts

<Sombrero bastante carnoso de 4 a 8cm , convexo , luego completamente extendido , aplanado y mamelonado , liso , húmedo e higrófano .> <Esta última condición influye en la variabilidad de su coloración desde canela claro a toda la gama de tostados .> <Con la edad generalmente palidece sus tonos .>

<Puede confundirse con otras foliotas comestibles , pero alguna especie es amarga . ><Los aficionados poco experimentados pueden también confundir este género con otros no comestibles , como Hypholoma y Flacemula , también lignícolas.>

Depends on

the scenario

Adaptive Information Extraction

slide42

Architecture

Architecture

Lexical analysis

  • Example2

<A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy , but no casualties have been reported .>

<According to unofficial sources , the bomb-allegedly detonated by urban guerrilla commandos- blew up a power tower in the northwestern part of San Salvador at 0650 .>

time expressions

locations

organizations

persons

Adaptive Information Extraction

slide43

Architecture

Architecture

Syntactic analysis

  • Full parsing (Lolita, LaSIE, LaSIE-II)
    • inefficient, sizes of the grammars
    • missing robustness (off vocabulary)
    • treebank grammars
    • cascaded grammars
      • Solves some problems related to the tuning and incompleteness

Adaptive Information Extraction

slide44

Architecture

Architecture

Syntactic analysis

  • Partial parsing
    • the most commonly used
    • chunks or phrasal trees (noun phrases, verbal phrases, prep phrases, adj phrases, adv phrases)
    • absence of global dependences

Adaptive Information Extraction

slide45

Architecture

Architecture

Semantic interpretation

  • Compositive semantics
    • full parsing + λ-expressions
        • LaSIE, LaSIE-II
        • Entries with λ-expressions in the Lexicons
    • partial parsing + gramatical relations [Vilain,99]
    • output = logical forms

Adaptive Information Extraction

slide46

Architecture

Architecture

Semantic interpretation

  • Compositive semantics (example1)

λ(z) λ(y) λ(x) (bombing(x,y,z,bomb,today_morning,power_tower(San_Salvador)))

s

vp

pp

np

np np pp

A bombwent offthis morning near a power tower in San Salvador …

go_off → λ(t) λ(s) λ(r) λ(z) λ(y)λ(x) (bombing(x,y,z,r,s,t))

power_tower → λ(x) (power_tower(x))

Adaptive Information Extraction

slide47

Architecture

Architecture

Semantic interpretation

  • Compositive semantics (example2)

location_of

place

subj

time

A bombwent offthis morning near a power tower in San Salvador …

event(bombing , E)

subj(bomb , E)

time(today_morning , E)

place(power_tower, E)

location_of(power_tower, San_Salvador)

Adaptive Information Extraction

slide48

Architecture

Architecture

Semantic interpretation

  • Pattern matching
    • after partial parsing + svo dependences
    • the most extended
    • patterns can be implemented in different ways
    • scenario driven approach (TE, TR, ST, …)
    • Output = partial templates

Adaptive Information Extraction

slide49

Architecture

Architecture

Semantic interpretation

  • Pattern matching (example)

A bombwent offthis morning near a power tower in San Salvador …

np(C-instrument) … vp(go_off) … np(C-time) … “near” np(C-place) “in” np(C-location)

INSTRUMENT := C-instrument

DATE := C-time

PHIS_TARGET := C-place

LOCATION := C-location

Adaptive Information Extraction

slide50

Architecture

Architecture

Discourse analysis

  • Inter-sentence analysis
    • Co-reference resolution
    • Ellipsis resolution
    • Alias resolution
    • Traditional semantic interpretation procedures
    • Template merging procedures
  • Inference procedures
    • Open-domain and domain-specific knowledge for inferences

Adaptive Information Extraction

slide51

Architecture

Architecture

Discourse analysis

  • Example

A bombwent offthis morning near a power tower in San Salvador …,

but no casualtieshave been reported

λ(y) λ(x) (bombing(x,y,no_casualties,bomb,today_morning,

power_tower(San_Salvador)))

According to unofficial sources , the bomb -allegedly detonated by urban guerrilla commandos- blew upa power tower in the northwestern part of San Salvador at 0650

λ(z) λ(y) (bombing(urban_guerrilla_comandos,y,z,bomb,0650,

power_tower(the_northwestern_part_of_San_Salvador)))

Adaptive Information Extraction

slide52

Architecture

Architecture

Discourse analysis

  • Example

λ(y) λ(x) (bombing(x,y,no_casualties,bomb,today_morning,

power_tower(San_Salvador)))

λ(z) λ(y) (bombing(urban_guerrilla_comandos,y,z,bomb,0650, power_tower( the_northwestern_part_of_San_Salvador)))

Unification & inference

λ(y) (bombing(urban_guerrilla_comandos,y,no_casualties,bomb,today_morning,

power_tower(San_Salvador)))

Inference (blew_up → destroyed)

bombing(urban_guerrilla_comandos,destroyed,no_casualties,bomb,

today_morning,power_tower(San_Salvador))

Adaptive Information Extraction

slide53

Architecture

Architecture

Output template generation

  • Mapping of the extracted pieces into the desired output format
  • Specific inferences:
    • Normalization to predefined values of slots
    • Mandatory slots
    • Extracted information that implies different slot values

Adaptive Information Extraction

slide54

Architecture

Architecture

Output template generation

  • Example

bombing(urban_guerrilla_comandos,destroyed,no_casualties,bomb,

today_morning,power_tower(San_Salvador))

Today_morning → March_19

No_casualties = no_injuries_or_death

Incident type: bombing

date: March 19

Location: El Salvador: San Salvador (city)

Perpetrator: urban guerrilla commandos

Physical target: power tower

Human target: -

Effect on physical target: destroyed

Effect on human target: no injury or death

Instrument: bomb

Adaptive Information Extraction

slide55

Summary

  • Information Extraction Systems
      • Introduction
      • Historical framework
      • Architecture
      • Knowledge specific for IE
      • Examples
  • Evaluation
  • Multilinguality
  • Adaptability

Adaptive Information Extraction

slide56

Knowledge specific for IE

Characteristics of IE systems

  • Strong dependence of the domain
    • Scenario of extraction
    • Semantics vs. syntax
    • Discourse analysis
  • Strong dependence of the text structure
    • Sublanguages
    • Meta-information
  • Strong dependence of the output format
    • BDs
    • annotations

Adaptive Information Extraction

slide57

Knowledge specific for IE

Characteristics of IE systems

  • Importance of the portability and tuning
  • Importance of the Knowledge Engineering
    • Modularity
      • Basic tasks and specific tasks
    • Use of weak and local knowledge
  • Importance of the NL resources
    • MDRs, ontologies, general lexicons, specific dictionaries, …

Adaptive Information Extraction

slide58

IE patterns

Knowledge specifically used for IE

Knowledge specific for IE

Knowledge resources

  • Knowledge more or less stable
    • general lexicon
    • general grammar
    • basic NL processors: segmenters, taggers, parsers, …
  • Domain dependent knowledge
    • Domain specific vocabularies, terminology
    • gazetteers and patterns for NERC
    • IE patterns

Adaptive Information Extraction

slide59

Knowledge specific for IE

Types of IE patterns

  • Viewpoint 1: type of representation
    • rules

np(C-instrument) … vp(go_off) … np(C-time) …

“near” np(C-place) “in” np(C-location)

Event:INSTRUMENT := C-instrument

Event:DATE := C-time

Event:PHIS_TARGET := C-place

Event:LOCATION := C-location

Adaptive Information Extraction

slide60

who

speaker

5409

appointment

with

about

how

1.0

0.99

dr.

professor

robert

michael

mr

will

(

received

Has

w

cavalier

stevens

christel

0.56

0.99

0.76

that

by

speaker

seminar

reminder

theater

1.0

0.24

Knowledge specific for IE

Types of IE patterns

  • Viewpoint 1: type of representation
    • statistical models(BNs, HMMs, ME, Hyperplanes, …)

Adaptive Information Extraction

slide61

who

speaker

5409

appointment

with

about

how

1.0

0.99

dr.

professor

robert

michael

mr

will

(

received

Has

w

cavalier

stevens

christel

0.56

0.99

0.76

that

by

speaker

seminar

reminder

theater

1.0

0.24

Knowledge specific for IE

Types of IE patterns

  • Viewpoint 2: type of values extracted
    • slot filler extraction patterns

(the HMM presented before)

Adaptive Information Extraction

slide62

Knowledge specific for IE

Types of IE patterns

  • Viewpoint 2: type of values extracted
    • slot filler extraction patterns

(the HMM presented before)

  • event extraction patterns
      • (the rule presented before)

np(C-instrument) … vp(go_off) … np(C-time) … “near” np(C-place) “in” np(C-location)

Event:INSTRUMENT := C-instrument

Event:DATE := C-time

Event:PHIS_TARGET := C-place

Event:LOCATION := C-location

Adaptive Information Extraction

slide63

relation extraction patterns

np(C-person) … vp(is) pron(C-his) “wife”

Married_with:HUSBAND := C-his

Married_with:WIFE := C-person

Knowledge specific for IE

Types of IE patterns

  • Point of view: type of values extracted
    • slot filler extraction patterns

(the HMM presented before)

  • event extraction patterns
      • (the rule presented before)

Adaptive Information Extraction

slide64

Knowledge specific for IE

Types of IE patterns

  • Viewpoint 3: number of slot fillers extracted
    • single-slot IE patterns

(the HMM presented before)

    • multi-slot IE patterns

(both rules presented before)

Adaptive Information Extraction

slide65

Summary

  • Information Extraction Systems
      • Introduction
      • Historical framework
      • Architecture
      • Knowledge specific for IE
      • Examples
  • Evaluation
  • Multilinguality
  • Adaptability

Adaptive Information Extraction

slide66

Examples of IE systems

Methodologies [Turmo,2002]

System Reference Parsing Semantics Discourse

LaSIE

LaSIE-II

LOLITA

CIRCUS

FASTUS

BADGER

HASTEN

PROTEUS

ALEMBIC

PIE

TURBIO

PLUM

IE2

LOUELLA

SIFT

Gaizauskas et al, 1995

Humphreys et al, 1998

Garigliano et al, 1998

Lehnert et al, 1991

Hobbs et al, 1993

Fisher et al, 1995

Krupka, 1995

Grishman, 1995

Aberdeen et al, 1993

Lin, 1995

Turmo,2002

Weischedel et al, 1995

Aone et al, 1998

Childs et al, 1995

Miller et al, 1998

indepth understanding

template merging

Chunking Pattern matching -

semantic

Gramm relations interp interpretation

procedures

Partial Parsing pattern matching

Pattern matching template merging

-

sintactico-semantic parsing

Adaptive Information Extraction

slide67

Examples of IE systems

Knowledge [Turmo,2002]

System Parsing Semantics Discourse

LaSIE

LaSIE-II

LOLITA

CIRCUS

FASTUS

BADGER

HASTEN

PROTEUS

ALEMBIC

TURBIO

PIE

PLUM

IE2

LOUELLA

SIFT

Treebank grammar -expressions

hand-crafted stratified general grammar

General grammar semantic network

concept nodes (AutoSlog)

hand-crafted IE rules

concept nodes (CRYSTAL) decision trees

Phrasal grammar E-graphs

IE rules (ExDISCO) hand-crafted gram relations

IE rules (EVIUS)

General grammar hand-crafted IE rules

hand-crafted rules

hand-crafted IE rules decision trees

Statistical models for syntactic-semantic parsing & coreference resolution learned from PTB

and on-domain annotated texts

Adaptive Information Extraction

slide68

TE TR ST

Stratified grammar

Examples of IE systems

LaSIE-II system

gazetteers

Lexicon

Conceptual hierarchy

Gazetteer

lookup

Sentence

splitter

Tagged

morph

Buchart

parser

Name

matcher

Brill

tagger

Discourse

interpreter

Template

writer

Adaptive Information Extraction

slide69

TE TR ST

Stratified grammar

Examples of IE systems

LaSIE-II system

gazetteers

Lexicon

Conceptual hierarchy

Gazetteer

lookup

Sentence

splitter

Tagged

morph

Buchart

parser

Name

matcher

Brill

tagger

Discourse

interpreter

Template

writer

  • Preprocessing
  • NERC preprocess via gazetters and keyword lists
  • Root form and inflexional suffix for verbs, nouns and adjs found in sentences

According_to-adv unofficial-adj source[s]-n , the-det bomb-n – allegedly-adv detonate[ed]-v by-prep urban-adj guerrilla-n commando[s]-n - blow_up-v a-det power_tower-n in-prepthe-det northwestern-adj part-n of-prep San Salvador-loc at-prep0650

Adaptive Information Extraction

slide70

TE TR ST

Stratified grammar

Examples of IE systems

LaSIE-II system

gazetteers

Lexicon

Conceptual hierarchy

Gazetteer

lookup

Sentence

splitter

Tagged

morph

Buchart

parser

Name

matcher

Brill

tagger

Discourse

interpreter

Template

Writer

  • Syntactico-semantic interpretation
  • bottom-up chart parser
  • cascade of NERC grammars (eg. aircraft, person, money, time, timex)

According_to-adv unofficial-adj source[s]-n , the-det bomb-n – allegedly-adv detonate[ed]-vby-prep urban-adj guerrilla-n commando[s]-n - blow_up-va-det power_tower-n in-prepthe-detnorthwestern part of San Salvador-loc at-prep0650-time

NE2

NE1

Adaptive Information Extraction

slide71

TE TR ST

Stratified grammar

Examples of IE systems

LaSIE-II system

gazetteers

Lexicon

Conceptual hierarchy

Gazetteer

lookup

Sentence

splitter

Tagged

morph

Buchart

parser

Name

matcher

Brill

tagger

Discourse

interpreter

Template

Writer

  • Syntactico-semantic interpretation
  • bottom-up chart parser
  • cascade of NERC grammars (eg. aircraft, person, money, time)
  • cascade of partial grammars (NPs, PPs, complex NP, VPs, complex VPs, RelClauses, Sentence)

S(According_to-adv NP(unofficial-adj source[s]-n) , NP(the-det bomb-n) – allegedly-adv VP(detonate[ed]-v) PP(by-prep NP(urban-adj guerrilla-n commando[s]-n)) - VP(blow_up-v) PP(NP(a-det power_tower-n) PP(in-prep NP(the-detNE1-loc))) PP(at-prep NP(NE2-time)))

Adaptive Information Extraction

slide72

TE TR ST

Stratified grammar

Examples of IE systems

LaSIE-II system

gazetteers

Lexicon

Conceptual hierarchy

Gazetteer

lookup

Sentence

splitter

Tagged

morph

Buchart

parser

Name

matcher

Brill

tagger

Discourse

interpreter

Template

Writer

  • Syntactico-semantic interpretation
  • bottom-up chart parser
  • cascade of NERC grammars (eg. aircraft, person, money, time)
  • cascade of partial grammars (NPs, PPs, complex NP, VPs, complex VPs, RelClauses, Sentence)
  • QLFs (Note: the real implementation of QLFs is not specified)

Event(E1), detonate(E1,Y,X), urban_guerrilla_comando(X), bomb(Y),

Event(E2), blow_up(E2,Y,Z), power_tower(Z), location_of(Z,NE1), time_of(E2,NE2)

Adaptive Information Extraction

slide73

TE TR ST

Stratified grammar

Examples of IE systems

LaSIE-II system

gazetteers

Lexicon

Conceptual hierarchy

Gazetteer

lookup

Sentence

splitter

Tagged

morph

Buchart

parser

Name

matcher

Brill

tagger

Discourse

interpreter

Template

writer

  • Discourse analysis
  • Name matcher: Matches variants of NEs across the text
  • Discourse interpreter:
    • adds QLF representation to a semantic net (links)
    • adds presuppositions
    • coreference resolution

bombing event

implies

Event(E1), detonate(E1,Y,X), urban_guerrilla_comando(X), bomb(Y),

Event(E2), blow_up(E2,Y,Z), power_tower(Z), location_of(Z,NE1), time_of(E2,NE2)

implies

isa

location of event

destroy

Adaptive Information Extraction

slide74

TE TR ST

Stratified grammar

Examples of IE systems

LaSIE-II system

gazetteers

Lexicon

Conceptual hierarchy

Gazetteer

lookup

Sentence

splitter

Tagged

morph

Buchart

parser

Name

matcher

Brill

tagger

Discourse

interpreter

Template

writer

  • Output template generation
  • procedure that write the templates in the desired format

Incident type: bombing

date: March 19

Location: El Salvador: San Salvador (city)

Perpetrator: urban guerrilla commandos

Physical target: power tower

Human target: -

Effect on physical target: destroyed

Effect on human target: no injury or death

Instrument: bomb

Adaptive Information Extraction

slide75

TE TR ST

Examples of IE systems

PROTEUS system

Lexicon

Chunk

grammar

NERC

Rules

IE-Rules

Conceptual

hierarchy

Format

Rules

Inference

Rules

Partial

parsing

Lexical

Analizer

Coreference

resolution

Discourse

Analysis

Scenario

Patterns

Output

generator

NERC

Adaptive Information Extraction

slide76

TE TR ST

Examples of IE systems

PROTEUS system

Lexicon

Chunk

grammar

NERC

Rules

IE-Rules

Conceptual

hierarchy

Format

Rules

Inference

Rules

Partial

parsing

Lexical

Analizer

Coreference

resolution

Discourse

Analysis

Scenario

Patterns

Output

generator

NERC

Preprocessing

According_to-adv unofficial-adj sources-n , the-det bomb-n – allegedly-adv detonated-v by-prep urban-adj guerrilla-n commandos-n - blew_up-v a-det power_tower-n in-prepthe-detnorthwestern part of San Salvador-loc at-prep0650-time

NE2

NE1

Adaptive Information Extraction

slide77

TE TR ST

Examples of IE systems

PROTEUS system

Lexicon

Chunk

grammar

NERC

Rules

IE-Rules

Conceptual

hierarchy

Format

Rules

Inference

Rules

Partial

parsing

Lexical

Analizer

Coreference

resolution

Discourse

Analysis

Scenario

Patterns

Output

generator

NERC

  • Sintactico-semantic interpretation
  • basic VP and NP chunks+head_semantics
  • semantics refer to types of slot fillers (Conceptual hierarchy)

According_to-advNP(unofficial-adjsources-n-s1) , NP(the-detbomb-n-artifact)– allegedly-advVP(detonated-v-s3) by-prepNP(urban-adj guerrilla-ncommandos-n-person) – VP(blew_up-v-s4)NP(a-detpower_tower-n-building) in-prep NP(NE1-location) at-prep NP(NE2-time)

Adaptive Information Extraction

slide78

TE TR ST

Examples of IE systems

PROTEUS system

Lexicon

Chunk

grammar

NERC

Rules

IE-Rules

Conceptual

hierarchy

Format

Rules

Inference

Rules

Partial

parsing

Lexical

Analizer

Coreference

resolution

Discourse

Analysis

Scenario

Patterns

Output

generator

NERC

  • Sintactico-semantic interpretation
  • basic VP and NP chunks+head_semantics
  • IE-rules for relations (appositions, PP-attachments, limited conjunctions)
    • NP(A-person) , B-integer years old , → instance(X,person), name_of(X,A), age_of(X,B)
    • NP(A-position) of NP(B-company) → instance(X,person), position_of(X,A), company_of(X,B)

Slot

Value

Class

person

Real implementation

as objects

name

A

age

B

Adaptive Information Extraction

slide79

TE TR ST

Examples of IE systems

PROTEUS system

Lexicon

Chunk

grammar

NERC

Rules

IE-Rules

Conceptual

hierarchy

Format

Rules

Inference

Rules

Partial

parsing

Lexical

Analizer

Coreference

resolution

Discourse

Analysis

Scenario

Patterns

Output

generator

NERC

  • Sintactico-semantic interpretation
  • basic VP and NP chunks+head_semantics
  • IE-rules for relations (appositions, PP-attachments, limited conjunctions)
  • IE-rules for events (PET interface or ExDISCO)
    • NP(A-artifact) v-s4 NP(B-building) → instance(E1,s4), instrument_of(E1,A), phisical_target_of(E1,B)

According_to-adv NP(unofficial-adj sources-n-s1) ,NP(the-detbomb-n-artifact)– allegedly-adv VP(detonated-v-s3) by-prep NP(urban-adj guerrilla-n commandos-n-person) –VP(blew_up-v-s4) NP(a-detpower_tower-n-building) in-prep NP(NE1-location) at-prep NP(NE2-time)

Adaptive Information Extraction

slide80

TE TR ST

Examples of IE systems

PROTEUS system

Lexicon

Chunk

grammar

NERC

Rules

IE-Rules

Conceptual

hierarchy

Format

Rules

Inference

Rules

Partial

parsing

Lexical

Analizer

Coreference

resolution

Discourse

Analysis

Scenario

Patterns

Output

generator

NERC

  • Discourse analysis
  • antecedents found seeking in sequential order.
  • constraints:
    • instance of a hyperclass
    • same number
    • share arguments

Adaptive Information Extraction

slide81

TE TR ST

Examples of IE systems

PROTEUS system

Lexicon

Chunk

grammar

NERC

Rules

IE-Rules

Conceptual

hierarchy

Format

Rules

Inference

Rules

Partial

parsing

Lexical

Analizer

Coreference

resolution

Discourse

Analysis

Scenario

Patterns

Output

generator

NERC

  • Discourse analysis
  • QLFs + inference rules = more complex QLFs
  • conversion of date expressions.
  • inference of slot values from the QLFs already achieved
  • inference of events from others explicitly described
    • Fred, the president of Cuban Cigar Corp., was appointed vice president of Microsoft
    • implies
    • Fred left the Cuban Cigar Corp.

Adaptive Information Extraction

slide82

TE TR ST

Examples of IE systems

PROTEUS system

Lexicon

Chunk

grammar

NERC

Rules

IE-Rules

Conceptual

hierarchy

Format

Rules

Inference

Rules

Partial

parsing

Lexical

Analizer

Coreference

resolution

Discourse

Analysis

Scenario

Patterns

Output

generator

NERC

  • Output template generation
  • use of rules to build the templates with the desired format

Adaptive Information Extraction

slide83

Decision

tree

TE TR ST

Examples of IE systems

IE2 system

Discourse

Module

Custom

NameTag

NetOwl

Extractor 3.0

TempGen

PhraseTag

EventTag

Hand-crafted

rules

Adaptive Information Extraction

slide84

Decision

tree

TE TR ST

Examples of IE systems

IE2 system

Discourse

Module

Custom

NameTag

NetOwl

Extractor 3.0

TempGen

PhraseTag

EventTag

Hand-crafted

rules

  • Preprocessing
  • only NERC
  • SGML-tagged
  • general NE types and subtypes
  • restricted-domain NE types and subtypes

<person id=1>Jeff Bantle</person>, <entity id=2>NASA</entity>’s mission operations directorate representative for the shuttle flight

Adaptive Information Extraction

slide85

Decision

tree

TE TR ST

Examples of IE systems

IE2 system

Discourse

Module

Custom

NameTag

NetOwl

Extractor 3.0

TempGen

PhraseTag

EventTag

Hand-crafted

rules

  • Syntactico-semantic interpretation
  • SGML-tagging of phrases that are values of slots
  • NPs denoting persons (PNP), organizations (ENP), artifacts (ANP), …
  • local links (location-of, employee-of, owner-of, …)

<person id=1>Jeff Bantle</person>, <PNP affil=2><entity id=2>NASA</entity>’s mission operations directorate representative for the shuttle flight</PNP>

Adaptive Information Extraction

slide86

Decision

tree

TE TR ST

Examples of IE systems

IE2 system

Discourse

Module

Custom

NameTag

NetOwl

Extractor 3.0

TempGen

PhraseTag

EventTag

Hand-crafted

rules

  • Syntactico-semantic interpretation
  • SGML-tagging of phrases that are values of slots in templates
  • NPs
  • local semantic relations (employee-of, location-of, product-of, …)
  • event IE-rules (note: the real implementation is not specified)
    • $Vehicle + LaunchN → launch_event::vehicle_info := $Vehicle

<launch_event id=2 vehicle_info=1><ANP> The <vehicle id=1>Arian 5</vehicle> launch</ANP> was successfully achieved at 6am

Adaptive Information Extraction

slide87

Decision

tree

TE TR ST

Examples of IE systems

IE2 system

Discourse

Module

Custom

NameTag

NetOwl

Extractor 3.0

TempGen

PhraseTag

EventTag

Hand-crafted

rules

  • Discourse analysis
  • Three coreference resolution methods
    • Rule based
    • Machine learning based
    • Hybrid
  • Name alias resolution in addition to that performed by NetOwl
  • Definite NPs
  • Singular personal pronouns

<person id=1>JeffBantle</person>, <PNP ref=1 affil=2><entity id=2>NASA</entity>’s mission operations directorate representative for the shuttle flight</PNP>

Adaptive Information Extraction

slide88

Decision

tree

TE TR ST

Examples of IE systems

IE2 system

Discourse

Module

Custom

NameTag

NetOwl

Extractor 3.0

TempGen

PhraseTag

EventTag

Hand-crafted

rules

  • Output template generation
  • Translates SGML output into templates in the desired format
  • Solves and normalizes time expressions
  • Performs event merging

Adaptive Information Extraction

slide89

TE

TR

Examples of IE systems

SIFT system

Output

generator

Cross-sentece level

Sentence level

IdentifinderTM

Statistical models

Adaptive Information Extraction

slide90

TE

TR

Examples of IE systems

SIFT system

Output

generator

Cross-sentece level

Sentence level

IdentifinderTM

Statistical models

  • Preprocessing
  • NERC using a HMM [Bikel et al. 97] + Viterbi maximizing Pr(W,F,C)
  • each word is tagged with one NE class

start-sentence

not-a-name

person

organization

location

end-sentence

Adaptive Information Extraction

slide91

TE

TR

Examples of IE systems

SIFT system

Output

generator

Cross-sentece level

Sentence level

IdentifinderTM

Statistical models

  • Syntactico-semantic interpretation
  • properties of NEs (TE) and relations (TR)
  • generative statistical model [Miller et al. 98, 00]
  • search the most likely augmented parse tree (bottom-up chart based)
  • prunning of low probability constituents

Adaptive Information Extraction

slide92

TE

TR

Examples of IE systems

SIFT system

Output

generator

Cross-sentece level

Sentence level

IdentifinderTM

Statistical models

Syntactico-semantic interpretation

per/np

per-desc-r/np

emp-of/pp-lnk

org-ptr/pp

per-r/np per-desc/np org-r/np

per/nnp , det vbn per-desc/nn to org’/nnp org/nnp ,

Nance , a paid consultant to ABC News , …

Adaptive Information Extraction

slide93

TE

TR

Examples of IE systems

SIFT system

Output

generator

Cross-sentece level

Sentence level

IdentifinderTM

Statistical models

  • Syntactico-semantic interpretation
  • relations between NEs across sentences
  • statistical model [Miller et al. 98]
  • classifier of pairs of entities
    • entities in different sentences
    • entities do not take part into local relations
    • their types are compatible with any relation

Adaptive Information Extraction

slide94

TE TR

Examples of IE systems

TURBIO system

Partial-tree

grammar

Lexicon

NERC

Rules

IE-rule set

scheduling

IE-Rule set

processor

IE-Rule sets

Partial

parsing

Lexical

Analizer

controller

NERC

Output

generator

Adaptive Information Extraction

slide95

TE TR

Examples of IE systems

TURBIO system

Partial-tree

grammar

Lexicon

NERC

Rules

IE-rule set

scheduling

IE-Rule set

processor

IE-Rule sets

Partial

parsing

Lexical

Analizer

controller

NERC

Output

generator

  • Preprocessing
  • WordNet synsets, lemmas, POS tags
  • NERC
  • parsed trees of noun, verbal, and adjectival phrases

Adaptive Information Extraction

slide96

TE TR

Examples of IE systems

TURBIO system

Partial-tree

grammar

Lexicon

NERC

Rules

IE-rule set

scheduling

IE-Rule set

processor

IE-Rule sets

Partial

parsing

Lexical

Analizer

controller

NERC

Output

generator

  • Syntactico-semantic interpretation
  • Hypotesis: dependence among relations of NEs
  • Iterative execution of IE-rule sets depending on the scheduling
  • Example:
    • Scenario = Mushroom parts, their possible colors and the circumstances by which they are produced
    • There are colors in the documents that are not related to any mushroom part, but all colors related with a circumstance are colors related to mushroom parts.

Adaptive Information Extraction