Information extraction
This presentation is the property of its rightful owner.
Sponsored Links
1 / 96

Information Extraction PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on
  • Presentation posted in: General

Information Extraction. Jordi Turmo TALP Research Centre Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya [email protected] http://www.lsi.upc.edu/~turmo. Summary. Information Extraction Systems Evaluation Multilinguality Adaptability. Summary.

Download Presentation

Information Extraction

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Information extraction

Information Extraction

Jordi Turmo

TALP Research Centre

Dep. Llenguatges i Sistemes Informàtics

Universitat Politècnica de Catalunya

[email protected]

http://www.lsi.upc.edu/~turmo

Adaptive Information Extraction


Information extraction

Summary

  • Information Extraction Systems

  • Evaluation

  • Multilinguality

  • Adaptability

Adaptive Information Extraction


Information extraction

Summary

  • Information Extraction Systems

    • Introduction

    • Historical framework

    • Architecture

    • Knowledge specific for IE

    • Examples

  • Evaluation

  • Multilinguality

  • Adaptability

  • Adaptive Information Extraction


    Information extraction

    Introduction

    Definition

    • Goal: Localization and extraction, in a specific format, of the relevant information included in a collection of documents

    • Input requirements: scenario of extraction and document collection

    • Output requirements: output format

    Adaptive Information Extraction


    Information extraction

    Introduction

    Typology

    • Different points of view:

      • conceptual coverage: restricted-domain IE vs. open-domain IE

      • language coverage: monoligual IE vs. multilingual IE

      • media coverage: written text IE, speech IE, image IE, multimedia IE

      • document type: IE from free text, from semi-structured documents, from structured documents (including Web pages in HTML and XML)

      • task: TE, TR, ST, others

    Adaptive Information Extraction


    Information extraction

    Introduction

    Typology

    • Different points of view:

      • conceptual converage: restricted-domain IEvs. open-domain IE

      • language coverage: monoligual IEvs. multilingual IE

      • media coverage:written text IE, speech IE, image IE, multimedia IE

      • document type: IE from free text, from semi-structured documents, from structured documents (including Web pages in HTML and XML)

      • task: TE, TR, ST, others

    Adaptive Information Extraction


    Information extraction

    Introduction

    Example 1: Structured documents

    • Web pages

    • A list of members of an organization per

    • document

    • English

    • Scenario of Extraction

      • Name, degree, school and affiliation of the member

    Adaptive Information Extraction


    Information extraction

    Introduction

    Example 1: Structured documents

    Name Degree School Affiliation

    WL HsuPhD CornellIIS, Sinica

    CS HoPhD NTU EE,NTIT

    C.ChenPhD SUNY EE,NTIT

    C.WuPhD Utexas Cedu,NNU

    Mark Liao PhD NWU IIS, Sinica

    CJ Liau PhD NTU IIS, Sinica

    WK Cheng PhD TKU Tunghai

    WC Wang MS Syracus FIT

    ...

    Adaptive Information Extraction


    Information extraction

    Introduction

    Example 2: Semi-structured documents

    • 485 seminar announcements

    • A description of one seminar per document

    • English

    • Scenario of Extraction

      • Speaker, location, start time and end time of the

      • seminar

    Adaptive Information Extraction


    Information extraction

    Introduction

    Example 2: Semi-structured documents

    Adaptive Information Extraction


    Information extraction

    Introduction

    Example 3: Free text

    • 318 Wall Street Journal articles

    • A description of an incident per document

    • English

    • Scenario of Extraction

      • Type of incident, perpetrator, target, date, location,

      • effects and instrument

    Adaptive Information Extraction


    Information extraction

    A bomb went off this morning near a power tower in San Salvador leaving

    a large part of the city without energy, but no casualties have been reported.

    According to unofficial sources, the bomb -allegedly detonated by urban

    guerrilla commandos- blew up a power tower in the northwestern part of

    San Salvador at 0650.

    Introduction

    Example 3: Free text

    Incident type:bombing

    date:March 19

    Location:El Salvador: San Salvador (city)

    Perpetrator:urban guerrilla commandos

    Physical target:power tower

    Human target:-

    Effect on physical target:destroyed

    Effect on human target:no injury or death

    Instrument:bomb

    Adaptive Information Extraction


    Information extraction

    Introduction

    Example 4: Free text

    • 78 documents

    • A description of mushroom per document

    • Spanish

    • Scenario of Extraction

      • colors of parts of mushrooms and the circumstances

      • in which they occur

    Adaptive Information Extraction


    Information extraction

    Introduction

    Example 4: Free text

    Adaptive Information Extraction


    Information extraction

    Introduction

    Example 4: Free text

    El color blanco de su sombrero pasa a amarillo crema al corte.

    El sombrero ennegrece si se corta.

    color_1

    base: blanco

    tono: indef

    luz: indef

    Sombrero_1

    color:

    virar_1

    inicio:

    final:

    causa: corte

    color_2

    base: amarillo

    tono: crema

    luz: indef

    Sombrero_2

    color:

    virar_2

    inicio: indef

    final:

    causa: corte

    color_3

    base: indef

    tono: negro

    luz: indef

    Adaptive Information Extraction


    Information extraction

    Introduction

    Example 5: Combination

    • 78 documents

    • A description of mushroom per document

    • Spanish

    • Scenario of Extraction

      • Names of the mushroom in different languages, ethimology

      • colors of parts of mushrooms and the circumstances

      • in which they occur

    Adaptive Information Extraction


    Information extraction

    Introduction

    Example 5: Combination

    Adaptive Information Extraction


    Information extraction

    Introduction

    Applications

    • IE from the Web

    • Building of news DBs

    • Information Integration

    • Support for QA and Summarization

    • Limitation whenP<80%

    Adaptive Information Extraction


    Information extraction

    Introduction

    References

    • D.E. Appelt, D.J. Israel, 1999

    • E. Hovy, 1999

    • R.J. Mooney, C. Cardie, 1999

    • Muslea, 1999

    • J. Cowie, Y. Wilks, 2000

    • M.T. Pazienza, 2003

    • Turmo, 2003

    • Turmo et al. 2005

    Adaptive Information Extraction


    Information extraction

    Introduction

    Recent events

    • IJCAI 2001 Workshop on Adaptive Text Extraction and Mining (ATEM-2001)

    • ECML 03/PKDD Workshop on Adaptive Text Extraction and Mining (ATEM-2003)

    • AAAI 04 Workshop on Adaptive Text Extraction and Mining (ATEM-2004)

    • EACL 06 Workshop on Adaptive Text Extraction and Mining (ATEM-2006)

    • COLING-ACL 06 Workshop on Information Extraction Beyond the Document

    • ACE conferences

    Adaptive Information Extraction


    Information extraction

    Summary

    • Information Extraction Systems

      • Introduction

      • Historical framework

      • Architecture

      • Knowledge specific for IE

      • Examples

  • Evaluation

  • Multilinguality

  • Adaptability

  • Adaptive Information Extraction


    Information extraction

    Manual

    Process

    Experts

    on the

    Domain

    Relevant

    Information

    Historical framework

    Origin of IE

    • Acquisition of the relevant information involved in knowledge-based systems

    • Traditionally (High human cost)

    Adaptive Information Extraction


    Information extraction

    Text-based Intelligent Systems

    Historical framework

    Origin of IE

    • Acquisition of the relevant information involved in knowledge-based systems

    • 80’s (text sources)

    Relevant

    Information

    Adaptive Information Extraction


    Information extraction

    Historical framework

    Origin of IE

    • Text-Based Intelligent Systems (TBIS)

      • Information Retrieval

      • Information Integration

      • Information Filtering

      • Information Routing

      • Information Extraction

      • Document Classification

      • Question Answering

      • Automatic Summarization

      • Topic Detection & Tracking

        ...

    Adaptive Information Extraction


    Information extraction

    Historical framework

    Relevant Historical Programs

    • Precedents: LSP (Sager, 81), FRUMP (DeJong, 82),

    • JASPER (Hayes, 86)

    • in USA

      • (1987-1991): MUC [US Navy]

      • TIPSTER (1991-1998): MUC [DARPA]

      • TIDES (1999-): ACE [NIST]

    • in Europe

      • LRE (1993-1996): TREE, AVENTINUS, FACILE, ECRAN, SPARKLE

      • PASCAL excellence network (2003-)

    Adaptive Information Extraction


    Information extraction

    Historical framework

    MUC Evolution

    • MUC-1 (1987)

      • naval operations

      • auto-definition of scenarios

      • auto-evaluation

    • MUC-2 (1989)

      • naval operations

      • output structure with 10 attributes

      • (type of event, agent, place, ...)

      • auto-evaluation

    Adaptive Information Extraction


    Information extraction

    Historical framework

    MUC Evolution

    • MUC-3 (1991),

      • Latin-American terrorism

      • output structure with 18 attributes

      • (type of incident, date, place, ...)

      • recall and precision measures

    a

    extracted = a + b + e + f

    relevant = a + f + d

    recall = a + 0.5 f/ (a + f + d)

    precision = a + 0.5 f/ (a + f + b + e)

    extracted

    f

    b

    e

    d

    c

    parcially extracted

    relevant

    Adaptive Information Extraction


    Information extraction

    Historical framework

    MUC Evolution

    • MUC-4 (1992),

      • Latin-American terrorism

      • 24 attributes

      • F-score (harmonic average)

    • MUC-5 (1993),

      • Financial news, microelectronics

      • English, Japanese

    Adaptive Information Extraction


    Information extraction

    Historical framework

    MUC Evolution

    • MUC-6 (1995),

      • finantial news

      • subtasks: NE, coreference

      • tasks: TE (template element), ST (scenario template)

    • MUC-7 (1998),

      • air crashes

      • new task: TR (template relation)

    Adaptive Information Extraction


    Information extraction

    a

    extracted

    b

    d

    c

    relevant

    Historical framework

    MUC Evolution

    • MUC-6, MUC-7

      • Partial extractions are discarded

    extracted = a + b

    relevant = a + d

    recall = a / (a + d)

    precision = a / (a + b)

    Adaptive Information Extraction


    Information extraction

    Summary

    • Information Extraction Systems

      • Introduction

      • Historical framework

      • Architecture

      • Knowledge specific for IE

      • Examples

  • Evaluation

  • Multilinguality

  • Adaptability

  • Adaptive Information Extraction


    Information extraction

    Architecture

    General Architecture

    • Hobbs,93:

      • Cascade of transducers (or modules) that add structure to text and, often, drop out irrelevant information by applying rules

    Adaptive Information Extraction


    Information extraction

    Architecture

    Traditional Architecture

    Document Preprocessing

    Conceptual Hierarchy

    Pattern Matching

    Pattern Base

    Postprocess

    Adaptive Information Extraction


    Information extraction

    Architecture

    Traditional Architecture

    Text Control

    Lexical Analysis

    ConceptualHierarchy

    Syntactic Analysis

    Pattern Matching

    Pattern Base

    Postprocess

    Adaptive Information Extraction


    Information extraction

    Architecture

    Traditional Architecture

    Text Control

    Lexical Analysis

    Conceptual Hierarchy

    Syntactic Analysis

    Pattern Matching

    Pattern Base

    Discourse Analysis

    Output Template Generation

    Output Format

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Text control

    • Filtering relevant documents

    • Guessing the language of the documents

    • Splitting documents into textual zones

    • Filtering relevant zones

    • Splitting text into appropriate units (eg. sentences)

    • Filtering relevant units

    • Tokenizing units

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Text control

    • Example

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Text control

    • Example

    <Sombrero bastante carnoso de 4 a 8 cm , convexo , luego completamente extendido , aplanado y mamelonado , liso , húmedo e higrófano .> <Esta última condición influye en la variabilidad de su coloración desde canela claro a toda la gama de tostados .> <Con la edad generalmente palidece sus tonos .>

    <Puede confundirse con otras foliotas comestibles , pero alguna especie es amarga . ><Los aficionados poco experimentados pueden también confundir este género con otros no comestibles , como Hypholoma y Flacemula , también lignícolas.>

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Lexical analysis

    • Identifying morpho-syntactic categories and semantic categories of words

      • General lexicon

  • Recognizing terminology words

  • Specific dictionaries

  • Recognizing time expressions, quantities, abbreviations, …

  • Extending abbreviations

  • Lists of abbrev. + expansion

  • Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Lexical analysis

    • Recognizing and classifying proper nouns (Named Entities –NERC-)

    • Gazetteers

    • Patterns

    • Dealing with unknown words

    • Dealing with lexical ambiguities

    • POS taggers

    • WSD (???)

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Lexical analysis

    • Example1

    time expressions

    mushroom names

    abbreviatures

    numbers

    morphologic parts

    <Sombrero bastante carnoso de 4 a 8cm , convexo , luego completamente extendido , aplanado y mamelonado , liso , húmedo e higrófano .> <Esta última condición influye en la variabilidad de su coloración desde canela claro a toda la gama de tostados .> <Con la edad generalmente palidece sus tonos .>

    <Puede confundirse con otras foliotas comestibles , pero alguna especie es amarga . ><Los aficionados poco experimentados pueden también confundir este género con otros no comestibles , como Hypholoma y Flacemula , también lignícolas.>

    Depends on

    the scenario

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Lexical analysis

    • Example2

    <A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy , but no casualties have been reported .>

    <According to unofficial sources , the bomb-allegedly detonated by urban guerrilla commandos- blew up a power tower in the northwestern part of San Salvador at 0650 .>

    time expressions

    locations

    organizations

    persons

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Syntactic analysis

    • Full parsing (Lolita, LaSIE, LaSIE-II)

      • inefficient, sizes of the grammars

      • missing robustness (off vocabulary)

      • treebank grammars

      • cascaded grammars

        • Solves some problems related to the tuning and incompleteness

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Syntactic analysis

    • Partial parsing

      • the most commonly used

      • chunks or phrasal trees (noun phrases, verbal phrases, prep phrases, adj phrases, adv phrases)

      • absence of global dependences

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Semantic interpretation

    • Compositive semantics

      • full parsing + λ-expressions

        • LaSIE, LaSIE-II

        • Entries with λ-expressions in the Lexicons

    • partial parsing + gramatical relations [Vilain,99]

    • output = logical forms

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Semantic interpretation

    • Compositive semantics (example1)

    λ(z) λ(y) λ(x) (bombing(x,y,z,bomb,today_morning,power_tower(San_Salvador)))

    s

    vp

    pp

    np

    np np pp

    A bombwent offthis morning near a power tower in San Salvador …

    go_off → λ(t) λ(s) λ(r) λ(z) λ(y)λ(x) (bombing(x,y,z,r,s,t))

    power_tower → λ(x) (power_tower(x))

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Semantic interpretation

    • Compositive semantics (example2)

    location_of

    place

    subj

    time

    A bombwent offthis morning near a power tower in San Salvador …

    event(bombing , E)

    subj(bomb , E)

    time(today_morning , E)

    place(power_tower, E)

    location_of(power_tower, San_Salvador)

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Semantic interpretation

    • Pattern matching

      • after partial parsing + svo dependences

      • the most extended

      • patterns can be implemented in different ways

      • scenario driven approach (TE, TR, ST, …)

      • Output = partial templates

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Semantic interpretation

    • Pattern matching (example)

    A bombwent offthis morning near a power tower in San Salvador …

    np(C-instrument) … vp(go_off) … np(C-time) … “near” np(C-place) “in” np(C-location)

    INSTRUMENT := C-instrument

    DATE := C-time

    PHIS_TARGET := C-place

    LOCATION := C-location

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Discourse analysis

    • Inter-sentence analysis

      • Co-reference resolution

      • Ellipsis resolution

      • Alias resolution

      • Traditional semantic interpretation procedures

      • Template merging procedures

    • Inference procedures

      • Open-domain and domain-specific knowledge for inferences

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Discourse analysis

    • Example

    A bombwent offthis morning near a power tower in San Salvador …,

    but no casualtieshave been reported

    λ(y) λ(x) (bombing(x,y,no_casualties,bomb,today_morning,

    power_tower(San_Salvador)))

    According to unofficial sources , the bomb -allegedly detonated by urban guerrilla commandos- blew upa power tower in the northwestern part of San Salvador at 0650

    λ(z) λ(y) (bombing(urban_guerrilla_comandos,y,z,bomb,0650,

    power_tower(the_northwestern_part_of_San_Salvador)))

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Discourse analysis

    • Example

    λ(y) λ(x) (bombing(x,y,no_casualties,bomb,today_morning,

    power_tower(San_Salvador)))

    λ(z) λ(y) (bombing(urban_guerrilla_comandos,y,z,bomb,0650, power_tower( the_northwestern_part_of_San_Salvador)))

    Unification & inference

    λ(y) (bombing(urban_guerrilla_comandos,y,no_casualties,bomb,today_morning,

    power_tower(San_Salvador)))

    Inference (blew_up → destroyed)

    bombing(urban_guerrilla_comandos,destroyed,no_casualties,bomb,

    today_morning,power_tower(San_Salvador))

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Output template generation

    • Mapping of the extracted pieces into the desired output format

    • Specific inferences:

      • Normalization to predefined values of slots

      • Mandatory slots

      • Extracted information that implies different slot values

    Adaptive Information Extraction


    Information extraction

    Architecture

    Architecture

    Output template generation

    • Example

    bombing(urban_guerrilla_comandos,destroyed,no_casualties,bomb,

    today_morning,power_tower(San_Salvador))

    Today_morning → March_19

    No_casualties = no_injuries_or_death

    Incident type:bombing

    date:March 19

    Location:El Salvador: San Salvador (city)

    Perpetrator:urban guerrilla commandos

    Physical target:power tower

    Human target:-

    Effect on physical target:destroyed

    Effect on human target:no injury or death

    Instrument:bomb

    Adaptive Information Extraction


    Information extraction

    Summary

    • Information Extraction Systems

      • Introduction

      • Historical framework

      • Architecture

      • Knowledge specific for IE

      • Examples

  • Evaluation

  • Multilinguality

  • Adaptability

  • Adaptive Information Extraction


    Information extraction

    Knowledge specific for IE

    Characteristics of IE systems

    • Strong dependence of the domain

      • Scenario of extraction

      • Semantics vs. syntax

      • Discourse analysis

    • Strong dependence of the text structure

      • Sublanguages

      • Meta-information

    • Strong dependence of the output format

      • BDs

      • annotations

    Adaptive Information Extraction


    Information extraction

    Knowledge specific for IE

    Characteristics of IE systems

    • Importance of the portability and tuning

    • Importance of the Knowledge Engineering

      • Modularity

        • Basic tasks and specific tasks

      • Use of weak and local knowledge

    • Importance of the NL resources

      • MDRs, ontologies, general lexicons, specific dictionaries, …

    Adaptive Information Extraction


    Information extraction

    IE patterns

    Knowledge specifically used for IE

    Knowledge specific for IE

    Knowledge resources

    • Knowledge more or less stable

      • general lexicon

      • general grammar

      • basic NL processors: segmenters, taggers, parsers, …

    • Domain dependent knowledge

      • Domain specific vocabularies, terminology

      • gazetteers and patterns for NERC

      • IE patterns

    Adaptive Information Extraction


    Information extraction

    Knowledge specific for IE

    Types of IE patterns

    • Viewpoint 1: type of representation

      • rules

    np(C-instrument) … vp(go_off) … np(C-time) …

    “near” np(C-place) “in” np(C-location)

    Event:INSTRUMENT := C-instrument

    Event:DATE := C-time

    Event:PHIS_TARGET := C-place

    Event:LOCATION := C-location

    Adaptive Information Extraction


    Information extraction

    who

    speaker

    5409

    appointment

    with

    about

    how

    1.0

    0.99

    dr.

    professor

    robert

    michael

    mr

    will

    (

    received

    Has

    w

    cavalier

    stevens

    christel

    0.56

    0.99

    0.76

    that

    by

    speaker

    seminar

    reminder

    theater

    1.0

    0.24

    Knowledge specific for IE

    Types of IE patterns

    • Viewpoint 1: type of representation

      • statistical models(BNs, HMMs, ME, Hyperplanes, …)

    Adaptive Information Extraction


    Information extraction

    who

    speaker

    5409

    appointment

    with

    about

    how

    1.0

    0.99

    dr.

    professor

    robert

    michael

    mr

    will

    (

    received

    Has

    w

    cavalier

    stevens

    christel

    0.56

    0.99

    0.76

    that

    by

    speaker

    seminar

    reminder

    theater

    1.0

    0.24

    Knowledge specific for IE

    Types of IE patterns

    • Viewpoint 2: type of values extracted

      • slot filler extraction patterns

        (the HMM presented before)

    Adaptive Information Extraction


    Information extraction

    Knowledge specific for IE

    Types of IE patterns

    • Viewpoint 2: type of values extracted

      • slot filler extraction patterns

        (the HMM presented before)

    • event extraction patterns

      • (the rule presented before)

    np(C-instrument) … vp(go_off) … np(C-time) … “near” np(C-place) “in” np(C-location)

    Event:INSTRUMENT := C-instrument

    Event:DATE := C-time

    Event:PHIS_TARGET := C-place

    Event:LOCATION := C-location

    Adaptive Information Extraction


    Information extraction

    • relation extraction patterns

    np(C-person) … vp(is) pron(C-his) “wife”

    Married_with:HUSBAND := C-his

    Married_with:WIFE := C-person

    Knowledge specific for IE

    Types of IE patterns

    • Point of view: type of values extracted

      • slot filler extraction patterns

        (the HMM presented before)

    • event extraction patterns

      • (the rule presented before)

    Adaptive Information Extraction


    Information extraction

    Knowledge specific for IE

    Types of IE patterns

    • Viewpoint 3: number of slot fillers extracted

      • single-slot IE patterns

        (the HMM presented before)

      • multi-slot IE patterns

        (both rules presented before)

    Adaptive Information Extraction


    Information extraction

    Summary

    • Information Extraction Systems

      • Introduction

      • Historical framework

      • Architecture

      • Knowledge specific for IE

      • Examples

  • Evaluation

  • Multilinguality

  • Adaptability

  • Adaptive Information Extraction


    Information extraction

    Examples of IE systems

    Methodologies [Turmo,2002]

    System Reference Parsing Semantics Discourse

    LaSIE

    LaSIE-II

    LOLITA

    CIRCUS

    FASTUS

    BADGER

    HASTEN

    PROTEUS

    ALEMBIC

    PIE

    TURBIO

    PLUM

    IE2

    LOUELLA

    SIFT

    Gaizauskas et al, 1995

    Humphreys et al, 1998

    Garigliano et al, 1998

    Lehnert et al, 1991

    Hobbs et al, 1993

    Fisher et al, 1995

    Krupka, 1995

    Grishman, 1995

    Aberdeen et al, 1993

    Lin, 1995

    Turmo,2002

    Weischedel et al, 1995

    Aone et al, 1998

    Childs et al, 1995

    Miller et al, 1998

    indepth understanding

    template merging

    Chunking Pattern matching -

    semantic

    Gramm relations interpinterpretation

    procedures

    Partial Parsing pattern matching

    Pattern matching template merging

    -

    sintactico-semantic parsing

    Adaptive Information Extraction


    Information extraction

    Examples of IE systems

    Knowledge [Turmo,2002]

    System Parsing Semantics Discourse

    LaSIE

    LaSIE-II

    LOLITA

    CIRCUS

    FASTUS

    BADGER

    HASTEN

    PROTEUS

    ALEMBIC

    TURBIO

    PIE

    PLUM

    IE2

    LOUELLA

    SIFT

    Treebank grammar -expressions

    hand-crafted stratified general grammar

    General grammar semantic network

    concept nodes (AutoSlog)

    hand-crafted IE rules

    concept nodes (CRYSTAL) decision trees

    Phrasal grammar E-graphs

    IE rules (ExDISCO) hand-crafted gram relations

    IE rules (EVIUS)

    General grammar hand-crafted IE rules

    hand-crafted rules

    hand-crafted IE rules decision trees

    Statistical models for syntactic-semantic parsing & coreference resolution learned from PTB

    and on-domain annotated texts

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Stratified grammar

    Examples of IE systems

    LaSIE-II system

    gazetteers

    Lexicon

    Conceptual hierarchy

    Gazetteer

    lookup

    Sentence

    splitter

    Tagged

    morph

    Buchart

    parser

    Name

    matcher

    Brill

    tagger

    Discourse

    interpreter

    Template

    writer

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Stratified grammar

    Examples of IE systems

    LaSIE-II system

    gazetteers

    Lexicon

    Conceptual hierarchy

    Gazetteer

    lookup

    Sentence

    splitter

    Tagged

    morph

    Buchart

    parser

    Name

    matcher

    Brill

    tagger

    Discourse

    interpreter

    Template

    writer

    • Preprocessing

    • NERC preprocess via gazetters and keyword lists

    • Root form and inflexional suffix for verbs, nouns and adjs found in sentences

    According_to-adv unofficial-adj source[s]-n , the-det bomb-n – allegedly-adv detonate[ed]-v by-prep urban-adj guerrilla-n commando[s]-n - blow_up-v a-det power_tower-n in-prepthe-det northwestern-adj part-n of-prep San Salvador-loc at-prep0650

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Stratified grammar

    Examples of IE systems

    LaSIE-II system

    gazetteers

    Lexicon

    Conceptual hierarchy

    Gazetteer

    lookup

    Sentence

    splitter

    Tagged

    morph

    Buchart

    parser

    Name

    matcher

    Brill

    tagger

    Discourse

    interpreter

    Template

    Writer

    • Syntactico-semantic interpretation

    • bottom-up chart parser

    • cascade of NERC grammars (eg. aircraft, person, money, time, timex)

    According_to-adv unofficial-adj source[s]-n , the-det bomb-n – allegedly-adv detonate[ed]-vby-prep urban-adj guerrilla-n commando[s]-n - blow_up-va-det power_tower-n in-prepthe-detnorthwestern part of San Salvador-loc at-prep0650-time

    NE2

    NE1

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Stratified grammar

    Examples of IE systems

    LaSIE-II system

    gazetteers

    Lexicon

    Conceptual hierarchy

    Gazetteer

    lookup

    Sentence

    splitter

    Tagged

    morph

    Buchart

    parser

    Name

    matcher

    Brill

    tagger

    Discourse

    interpreter

    Template

    Writer

    • Syntactico-semantic interpretation

    • bottom-up chart parser

    • cascade of NERC grammars (eg. aircraft, person, money, time)

    • cascade of partial grammars (NPs, PPs, complex NP, VPs, complex VPs, RelClauses, Sentence)

    S(According_to-adv NP(unofficial-adj source[s]-n) , NP(the-det bomb-n) – allegedly-adv VP(detonate[ed]-v) PP(by-prep NP(urban-adj guerrilla-n commando[s]-n)) - VP(blow_up-v) PP(NP(a-det power_tower-n) PP(in-prep NP(the-detNE1-loc))) PP(at-prep NP(NE2-time)))

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Stratified grammar

    Examples of IE systems

    LaSIE-II system

    gazetteers

    Lexicon

    Conceptual hierarchy

    Gazetteer

    lookup

    Sentence

    splitter

    Tagged

    morph

    Buchart

    parser

    Name

    matcher

    Brill

    tagger

    Discourse

    interpreter

    Template

    Writer

    • Syntactico-semantic interpretation

    • bottom-up chart parser

    • cascade of NERC grammars (eg. aircraft, person, money, time)

    • cascade of partial grammars (NPs, PPs, complex NP, VPs, complex VPs, RelClauses, Sentence)

    • QLFs (Note: the real implementation of QLFs is not specified)

    Event(E1), detonate(E1,Y,X), urban_guerrilla_comando(X), bomb(Y),

    Event(E2), blow_up(E2,Y,Z), power_tower(Z), location_of(Z,NE1), time_of(E2,NE2)

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Stratified grammar

    Examples of IE systems

    LaSIE-II system

    gazetteers

    Lexicon

    Conceptual hierarchy

    Gazetteer

    lookup

    Sentence

    splitter

    Tagged

    morph

    Buchart

    parser

    Name

    matcher

    Brill

    tagger

    Discourse

    interpreter

    Template

    writer

    • Discourse analysis

    • Name matcher: Matches variants of NEs across the text

    • Discourse interpreter:

      • adds QLF representation to a semantic net (links)

      • adds presuppositions

      • coreference resolution

    bombing event

    implies

    Event(E1), detonate(E1,Y,X), urban_guerrilla_comando(X), bomb(Y),

    Event(E2), blow_up(E2,Y,Z), power_tower(Z), location_of(Z,NE1), time_of(E2,NE2)

    implies

    isa

    location of event

    destroy

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Stratified grammar

    Examples of IE systems

    LaSIE-II system

    gazetteers

    Lexicon

    Conceptual hierarchy

    Gazetteer

    lookup

    Sentence

    splitter

    Tagged

    morph

    Buchart

    parser

    Name

    matcher

    Brill

    tagger

    Discourse

    interpreter

    Template

    writer

    • Output template generation

    • procedure that write the templates in the desired format

    Incident type:bombing

    date:March 19

    Location:El Salvador: San Salvador (city)

    Perpetrator:urban guerrilla commandos

    Physical target:power tower

    Human target:-

    Effect on physical target:destroyed

    Effect on human target:no injury or death

    Instrument:bomb

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Examples of IE systems

    PROTEUS system

    Lexicon

    Chunk

    grammar

    NERC

    Rules

    IE-Rules

    Conceptual

    hierarchy

    Format

    Rules

    Inference

    Rules

    Partial

    parsing

    Lexical

    Analizer

    Coreference

    resolution

    Discourse

    Analysis

    Scenario

    Patterns

    Output

    generator

    NERC

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Examples of IE systems

    PROTEUS system

    Lexicon

    Chunk

    grammar

    NERC

    Rules

    IE-Rules

    Conceptual

    hierarchy

    Format

    Rules

    Inference

    Rules

    Partial

    parsing

    Lexical

    Analizer

    Coreference

    resolution

    Discourse

    Analysis

    Scenario

    Patterns

    Output

    generator

    NERC

    Preprocessing

    According_to-adv unofficial-adj sources-n , the-det bomb-n – allegedly-adv detonated-v by-prep urban-adj guerrilla-n commandos-n - blew_up-v a-det power_tower-n in-prepthe-detnorthwestern part of San Salvador-loc at-prep0650-time

    NE2

    NE1

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Examples of IE systems

    PROTEUS system

    Lexicon

    Chunk

    grammar

    NERC

    Rules

    IE-Rules

    Conceptual

    hierarchy

    Format

    Rules

    Inference

    Rules

    Partial

    parsing

    Lexical

    Analizer

    Coreference

    resolution

    Discourse

    Analysis

    Scenario

    Patterns

    Output

    generator

    NERC

    • Sintactico-semantic interpretation

    • basic VP and NP chunks+head_semantics

    • semantics refer to types of slot fillers (Conceptual hierarchy)

    According_to-advNP(unofficial-adjsources-n-s1) , NP(the-detbomb-n-artifact)– allegedly-advVP(detonated-v-s3) by-prepNP(urban-adj guerrilla-ncommandos-n-person) – VP(blew_up-v-s4)NP(a-detpower_tower-n-building) in-prep NP(NE1-location) at-prep NP(NE2-time)

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Examples of IE systems

    PROTEUS system

    Lexicon

    Chunk

    grammar

    NERC

    Rules

    IE-Rules

    Conceptual

    hierarchy

    Format

    Rules

    Inference

    Rules

    Partial

    parsing

    Lexical

    Analizer

    Coreference

    resolution

    Discourse

    Analysis

    Scenario

    Patterns

    Output

    generator

    NERC

    • Sintactico-semantic interpretation

    • basic VP and NP chunks+head_semantics

    • IE-rules for relations (appositions, PP-attachments, limited conjunctions)

      • NP(A-person) , B-integer years old , → instance(X,person), name_of(X,A), age_of(X,B)

      • NP(A-position) of NP(B-company) → instance(X,person), position_of(X,A), company_of(X,B)

    Slot

    Value

    Class

    person

    Real implementation

    as objects

    name

    A

    age

    B

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Examples of IE systems

    PROTEUS system

    Lexicon

    Chunk

    grammar

    NERC

    Rules

    IE-Rules

    Conceptual

    hierarchy

    Format

    Rules

    Inference

    Rules

    Partial

    parsing

    Lexical

    Analizer

    Coreference

    resolution

    Discourse

    Analysis

    Scenario

    Patterns

    Output

    generator

    NERC

    • Sintactico-semantic interpretation

    • basic VP and NP chunks+head_semantics

    • IE-rules for relations (appositions, PP-attachments, limited conjunctions)

    • IE-rules for events (PET interface or ExDISCO)

      • NP(A-artifact) v-s4 NP(B-building) → instance(E1,s4), instrument_of(E1,A), phisical_target_of(E1,B)

    According_to-adv NP(unofficial-adj sources-n-s1) ,NP(the-detbomb-n-artifact)– allegedly-adv VP(detonated-v-s3) by-prep NP(urban-adj guerrilla-n commandos-n-person) –VP(blew_up-v-s4) NP(a-detpower_tower-n-building) in-prep NP(NE1-location) at-prep NP(NE2-time)

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Examples of IE systems

    PROTEUS system

    Lexicon

    Chunk

    grammar

    NERC

    Rules

    IE-Rules

    Conceptual

    hierarchy

    Format

    Rules

    Inference

    Rules

    Partial

    parsing

    Lexical

    Analizer

    Coreference

    resolution

    Discourse

    Analysis

    Scenario

    Patterns

    Output

    generator

    NERC

    • Discourse analysis

    • antecedents found seeking in sequential order.

    • constraints:

      • instance of a hyperclass

      • same number

      • share arguments

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Examples of IE systems

    PROTEUS system

    Lexicon

    Chunk

    grammar

    NERC

    Rules

    IE-Rules

    Conceptual

    hierarchy

    Format

    Rules

    Inference

    Rules

    Partial

    parsing

    Lexical

    Analizer

    Coreference

    resolution

    Discourse

    Analysis

    Scenario

    Patterns

    Output

    generator

    NERC

    • Discourse analysis

    • QLFs + inference rules = more complex QLFs

    • conversion of date expressions.

    • inference of slot values from the QLFs already achieved

    • inference of events from others explicitly described

      • Fred, the president of Cuban Cigar Corp., was appointed vice president of Microsoft

      • implies

      • Fred left the Cuban Cigar Corp.

    Adaptive Information Extraction


    Information extraction

    TE TR ST

    Examples of IE systems

    PROTEUS system

    Lexicon

    Chunk

    grammar

    NERC

    Rules

    IE-Rules

    Conceptual

    hierarchy

    Format

    Rules

    Inference

    Rules

    Partial

    parsing

    Lexical

    Analizer

    Coreference

    resolution

    Discourse

    Analysis

    Scenario

    Patterns

    Output

    generator

    NERC

    • Output template generation

    • use of rules to build the templates with the desired format

    Adaptive Information Extraction


    Information extraction

    Decision

    tree

    TE TR ST

    Examples of IE systems

    IE2 system

    Discourse

    Module

    Custom

    NameTag

    NetOwl

    Extractor 3.0

    TempGen

    PhraseTag

    EventTag

    Hand-crafted

    rules

    Adaptive Information Extraction


    Information extraction

    Decision

    tree

    TE TR ST

    Examples of IE systems

    IE2 system

    Discourse

    Module

    Custom

    NameTag

    NetOwl

    Extractor 3.0

    TempGen

    PhraseTag

    EventTag

    Hand-crafted

    rules

    • Preprocessing

    • only NERC

    • SGML-tagged

    • general NE types and subtypes

    • restricted-domain NE types and subtypes

    <person id=1>Jeff Bantle</person>, <entity id=2>NASA</entity>’s mission operations directorate representative for the shuttle flight

    Adaptive Information Extraction


    Information extraction

    Decision

    tree

    TE TR ST

    Examples of IE systems

    IE2 system

    Discourse

    Module

    Custom

    NameTag

    NetOwl

    Extractor 3.0

    TempGen

    PhraseTag

    EventTag

    Hand-crafted

    rules

    • Syntactico-semantic interpretation

    • SGML-tagging of phrases that are values of slots

    • NPs denoting persons (PNP), organizations (ENP), artifacts (ANP), …

    • local links (location-of, employee-of, owner-of, …)

    <person id=1>Jeff Bantle</person>, <PNP affil=2><entity id=2>NASA</entity>’s mission operations directorate representative for the shuttle flight</PNP>

    Adaptive Information Extraction


    Information extraction

    Decision

    tree

    TE TR ST

    Examples of IE systems

    IE2 system

    Discourse

    Module

    Custom

    NameTag

    NetOwl

    Extractor 3.0

    TempGen

    PhraseTag

    EventTag

    Hand-crafted

    rules

    • Syntactico-semantic interpretation

    • SGML-tagging of phrases that are values of slots in templates

    • NPs

    • local semantic relations (employee-of, location-of, product-of, …)

    • event IE-rules (note: the real implementation is not specified)

      • $Vehicle + LaunchN → launch_event::vehicle_info := $Vehicle

    <launch_event id=2 vehicle_info=1><ANP> The <vehicle id=1>Arian 5</vehicle> launch</ANP> was successfully achieved at 6am

    Adaptive Information Extraction


    Information extraction

    Decision

    tree

    TE TR ST

    Examples of IE systems

    IE2 system

    Discourse

    Module

    Custom

    NameTag

    NetOwl

    Extractor 3.0

    TempGen

    PhraseTag

    EventTag

    Hand-crafted

    rules

    • Discourse analysis

    • Three coreference resolution methods

      • Rule based

      • Machine learning based

      • Hybrid

    • Name alias resolution in addition to that performed by NetOwl

    • Definite NPs

    • Singular personal pronouns

    <person id=1>JeffBantle</person>, <PNP ref=1 affil=2><entity id=2>NASA</entity>’s mission operations directorate representative for the shuttle flight</PNP>

    Adaptive Information Extraction


    Information extraction

    Decision

    tree

    TE TR ST

    Examples of IE systems

    IE2 system

    Discourse

    Module

    Custom

    NameTag

    NetOwl

    Extractor 3.0

    TempGen

    PhraseTag

    EventTag

    Hand-crafted

    rules

    • Output template generation

    • Translates SGML output into templates in the desired format

    • Solves and normalizes time expressions

    • Performs event merging

    Adaptive Information Extraction


    Information extraction

    TE

    TR

    Examples of IE systems

    SIFT system

    Output

    generator

    Cross-sentece level

    Sentence level

    IdentifinderTM

    Statistical models

    Adaptive Information Extraction


    Information extraction

    TE

    TR

    Examples of IE systems

    SIFT system

    Output

    generator

    Cross-sentece level

    Sentence level

    IdentifinderTM

    Statistical models

    • Preprocessing

    • NERC using a HMM [Bikel et al. 97] + Viterbi maximizing Pr(W,F,C)

    • each word is tagged with one NE class

    start-sentence

    not-a-name

    person

    organization

    location

    end-sentence

    Adaptive Information Extraction


    Information extraction

    TE

    TR

    Examples of IE systems

    SIFT system

    Output

    generator

    Cross-sentece level

    Sentence level

    IdentifinderTM

    Statistical models

    • Syntactico-semantic interpretation

    • properties of NEs (TE) and relations (TR)

    • generative statistical model [Miller et al. 98, 00]

    • search the most likely augmented parse tree (bottom-up chart based)

    • prunning of low probability constituents

    Adaptive Information Extraction


    Information extraction

    TE

    TR

    Examples of IE systems

    SIFT system

    Output

    generator

    Cross-sentece level

    Sentence level

    IdentifinderTM

    Statistical models

    Syntactico-semantic interpretation

    per/np

    per-desc-r/np

    emp-of/pp-lnk

    org-ptr/pp

    per-r/np per-desc/np org-r/np

    per/nnp , det vbn per-desc/nn to org’/nnp org/nnp ,

    Nance , a paid consultant to ABC News , …

    Adaptive Information Extraction


    Information extraction

    TE

    TR

    Examples of IE systems

    SIFT system

    Output

    generator

    Cross-sentece level

    Sentence level

    IdentifinderTM

    Statistical models

    • Syntactico-semantic interpretation

    • relations between NEs across sentences

    • statistical model [Miller et al. 98]

    • classifier of pairs of entities

      • entities in different sentences

      • entities do not take part into local relations

      • their types are compatible with any relation

    Adaptive Information Extraction


    Information extraction

    TE TR

    Examples of IE systems

    TURBIO system

    Partial-tree

    grammar

    Lexicon

    NERC

    Rules

    IE-rule set

    scheduling

    IE-Rule set

    processor

    IE-Rule sets

    Partial

    parsing

    Lexical

    Analizer

    controller

    NERC

    Output

    generator

    Adaptive Information Extraction


    Information extraction

    TE TR

    Examples of IE systems

    TURBIO system

    Partial-tree

    grammar

    Lexicon

    NERC

    Rules

    IE-rule set

    scheduling

    IE-Rule set

    processor

    IE-Rule sets

    Partial

    parsing

    Lexical

    Analizer

    controller

    NERC

    Output

    generator

    • Preprocessing

    • WordNet synsets, lemmas, POS tags

    • NERC

    • parsed trees of noun, verbal, and adjectival phrases

    Adaptive Information Extraction


    Information extraction

    TE TR

    Examples of IE systems

    TURBIO system

    Partial-tree

    grammar

    Lexicon

    NERC

    Rules

    IE-rule set

    scheduling

    IE-Rule set

    processor

    IE-Rule sets

    Partial

    parsing

    Lexical

    Analizer

    controller

    NERC

    Output

    generator

    • Syntactico-semantic interpretation

    • Hypotesis: dependence among relations of NEs

    • Iterative execution of IE-rule sets depending on the scheduling

    • Example:

      • Scenario = Mushroom parts, their possible colors and the circumstances by which they are produced

      • There are colors in the documents that are not related to any mushroom part, but all colors related with a circumstance are colors related to mushroom parts.

    Adaptive Information Extraction


  • Login