Information Extraction from Scientific Texts - PowerPoint PPT Presentation

Information extraction from scientific texts
Download
1 / 114

Information Extraction from Scientific Texts. Junichi Tsujii Graduate School of Science University of Tokyo Japan. Texts are one of the major sources of information and knowledge. However, they are not transparent. They have to be systematically integrated with

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Information Extraction from Scientific Texts

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Information extraction from scientific texts

Information ExtractionfromScientific Texts

Junichi Tsujii

Graduate School of Science

University of Tokyo

Japan


Information extraction

Texts are one of the major sources of information and knowledge.

However, they are not transparent.

They have to be systematically integrated with

the other sources like data bases, numerical data,

etc.

Natural Language Processing--IE


Overview of genia system

Retrieval Module

Corpus

  • Request enhancement

  • Spawn request

  • Classify documents

Information Extraction

Module

  • Identify & classify terms

  • Identify events

Interface Module

Database

  • GUI

  • HTML conversion

  • System integration

Ontology

Markup

language

Data model

Raw(OCR)

Text

Structure

Annotated

Background Knowledge

Document

Named-Entity

Event

Overview of GENIA System

MEDLINE

Corpus Module

  • Markup generation / compilation

  • Annotated corpus construction

  • User

  • IR Request

  • Abstract

  • Full Paper

Security

Database Module

Concept Module

  • DB design / access / management

  • DB construction

  • BK design / construction / compilation


Information extraction

Plan

  • What is IE ?

  • General Framework of NLP

  • Basic IE techniques

  • IE in Biology

Automatic Term Recognition

(S. Ananiadou)


Information extraction

What is IE ?


Information extraction

Application Tasks of NLP

(1)Information Retrieval/Detection

To search and retrieve documents in response to queries

for information

(2)Passage Retrieval

To search and retrieve part of documents in response

to queries for information

(3)Information Extraction

To extract information that fits pre-defined database schemas

or templates, specifying the output formats

(4) Question/Answering Tasks

To answer general questions by using texts as knowledge

base: Fact retrieval, combination of IR and IE

(5)Text Understanding

To understand texts as people do: Artificial Intelligence


Information extraction

Ranges of Queries

(1)Information Retrieval/Detection

(2)Passage Retrieval

Pre-Defined: Fixed aspects

of information carried in texts

(3)Information Extraction

(4) Question/Answering Tasks

(5)Text Understanding


Information extraction

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Information extraction

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Information extraction

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Information extraction

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Information extraction

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Information extraction

production of

20, 000 iron and

metal wood clubs

[company]

[set up]

[Joint-Venture]

with

[company]

FASTUS

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

set up

new Twaiwan dallors

2.Basic Phrases:

Simple noun groups, verb groups and particles

a Japanese trading house

had set up

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Information extraction

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Information extraction

Information Extraction

……….

Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the

second floor of his Nanjing home early on Sunday.

The deputy general manager of Yaxing Benz, a Sino-German

joint venture that makes buses and bus chassis in nearby Yangzhou,

was hacked to death with 45 cm watermelon knives.

……….

Name of the Venture: Yaxing Benz

Products: buses and bus chassis

Location: Yangzhou,China

Companies involved: (1)Name: X?

Country: German

(2)Name: Y?

Country: China


Information extraction

Information Extraction

A German vehicle-firm executive was stabbed to death ….

……….

Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the

second floor of his Nanjing home early on Sunday.

The deputy general manager of Yaxing Benz, a Sino-German

joint venture that makes buses and bus chassis in nearby Yangzhou,

was hacked to death with 45 cm watermelon knives.

……….

Crime-Type: Murder

Type: Stabbing

The killed: Name: Jurgen Pfrang

Age: 51

Profession: Deputy general manager

Location: Nanjing, China

Different template

for crimes


Information extraction

User

User

System

System

System

Interpretation of Texts

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(4) Question/Answering Tasks

(5)Text Understanding


Information extraction

Characterization of Texts

Queries

IR System

Collection of Texts


Information extraction

Knowledge

Interpretation

Characterization of Texts

Queries

IR System

Collection of Texts


Information extraction

Knowledge

Interpretation

Characterization of Texts

Queries

Passage

IR System

Collection of Texts


Information extraction

Knowledge

Interpretation

Characterization of Texts

Queries

Structures

of

Sentences

NLP

Templates

Passage

IR System

IE System

Collection of Texts

Texts


Information extraction

Knowledge

Interpretation

IE System

Templates

Texts


Information extraction

Knowledge

General Framework

of

NLP/NLU

IE as

compromise NLP

Interpretation

IE System

Templates

Texts

Predefined


Information extraction

Rather clear

A bit vague

Rather clear

A bit vague

Very vague

Performance Evaluation

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(4) Question/Answering Tasks

(5)Text Understanding


Information extraction

Query

N: Correct Documents

M:Retrieved Documents

C: Correct Documents that are

actually retrieved

N

M

C

C

Precision:

Recall:

P

M

N

C

R

2P・R

F-Value:

P+R

Collection of Documents


Information extraction

Query

N: Correct Templates

M:Retrieved Templates

C: Correct Templates that are

actually retrieved

N

M

C

C

Precision:

Recall:

P

M

N

C

R

2P・R

F-Value:

P+R

Collection of Documents

More complicated due to partially

filled templates


Information extraction

General Framework of NLP


Information extraction

General Framework of NLP

John runs.

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


Information extraction

General Framework of NLP

John runs.

Morphological and

Lexical Processing

John run+s.

P-N V 3-pre

N plu

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


Information extraction

General Framework of NLP

John runs.

Morphological and

Lexical Processing

John run+s.

P-N V 3-pre

N plu

S

Syntactic Analysis

NP

VP

P-N

V

Semantic Analysis

John

run

Context processing

Interpretation


Information extraction

Pred: RUN

Agent:John

General Framework of NLP

John runs.

Morphological and

Lexical Processing

John run+s.

P-N V 3-pre

N plu

S

Syntactic Analysis

NP

VP

P-N

V

Semantic Analysis

John

run

Context processing

Interpretation


Information extraction

Pred: RUN

Agent:John

General Framework of NLP

John runs.

Morphological and

Lexical Processing

John run+s.

P-N V 3-pre

N plu

S

Syntactic Analysis

NP

VP

P-N

V

Semantic Analysis

John

run

Context processing

Interpretation

John is a student.

He runs.


Information extraction

General Framework of NLP

Tokenization

Morphological and

Lexical Processing

Part of Speech Tagging

Inflection/Derivation

Compounding

Syntactic Analysis

Term recognition

(Ananiadou)

Semantic Analysis

Context processing

Interpretation

Domain Analysis

Appelt:1999


Information extraction

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


Information extraction

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Incomplete Lexicons

Open class words

Terms

Term recognition

Named Entities

Company names

Locations

Numerical expressions

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


Information extraction

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Incomplete Grammar

Syntactic Coverage

Domain Specific

Constructions

Ungrammatical

Constructions

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


Information extraction

Predefined

Aspects of

Information

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Incomplete

Domain Knowledge

Interpretation Rules

Context processing

Interpretation


Information extraction

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

(2) Ambiguities:

Combinatorial

Explosion

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


Information extraction

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Most words in English

are ambiguous in terms

of their part of speeches.

runs: v/3pre, n/plu

clubs: v/3pre, n/plu

and two meanings

Morphological and

Lexical Processing

(2) Ambiguities:

Combinatorial

Explosion

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


Information extraction

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

(2) Ambiguities:

Combinatorial

Explosion

Syntactic Analysis

Structural Ambiguities

Predicate-argument

Ambiguities

Semantic Analysis

Context processing

Interpretation


Information extraction

Semantic Ambiguities(1)

John bought a car with Mary.

$3000 can buy a nice car.

Semantic Ambiguities(2)

Every man loves a woman.

Co-reference Ambiguities

Structural Ambiguities

(1)Attachment Ambiguities

John bought a carwith large seats.

John bought a car with $3000.

The manager of Yaxing Benz, a Sino-German joint venture

The manager of Yaxing Benz, Mr. John Smith

(2) Scope Ambiguities

young women and men in the room

(3)Analytical Ambiguities

Visiting relatives can be boring.


Information extraction

Combinatorial

Explosion

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

(2) Ambiguities:

Combinatorial

Explosion

Syntactic Analysis

Structural Ambiguities

Predicate-argument

Ambiguities

Semantic Analysis

Context processing

Interpretation


Information extraction

Note:

Ambiguities vs Robustness

More comprehensive knowledge: More Robust

big dictionaries

comprehensive grammar

More comprehensive knowledge: More ambiguities

Adaptability: Tuning, Learning


Information extraction

Framework of IE

IE as compromise NLP


Information extraction

Predefined

Aspects of

Information

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Incomplete

Domain Knowledge

Interpretation Rules

Context processing

Interpretation


Information extraction

Predefined

Aspects of

Information

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Incomplete

Domain Knowledge

Interpretation Rules

Context processing

Interpretation


Information extraction

Techniques in IE

(1) Domain Specific Partial Knowledge:

Knowledge relevant to information to be extracted

(2) Ambiguities:

Ignoring irrelevant ambiguities

Simpler NLP techniques

(3) Robustness:

Coping with Incomplete dictionaries

(open class words)

Ignoring irrelevant parts of sentences

(4) Adaptation Techniques:

Machine Learning, Trainable systems


Information extraction

95 %

FSA rules

Statistic taggers

Part of Speech Tagger

Local Context

Statistical Bias

F-Value

90

Domain

Dependent

General Framework of NLP

Open class words:

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Anaysis

Domain specific rules:

<Word><Word>, Inc.

Mr. <Cpt-L>. <Word>

Machine Learning:

HMM, Decision Trees

Rules + Machine Learning

Context processing

Interpretation


Information extraction

FASTUS

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Anaysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Information extraction

FASTUS

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Anaysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Information extraction

FASTUS

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Analysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Information extraction

Computationally more complex, Less Efficiency

Chomsky Hierarchy Hierarchy

of Grammar of Automata

Regular Grammar Finite State Automata

Context Free Grammar Push Down Automata

Context Sensitive Grammar Linear Bounded Automata

Type 0 Grammar Turing Machine


Information extraction

n

n

A B

Computationally more complex, Less Efficiency

Chomsky Hierarchy Hierarchy

of Grammar of Automata

Regular Grammar Finite State Automata

Context Free Grammar Push Down Automata

Context Sensitive Grammar Linear Bounded Automata

Type 0 Grammar Turing Machine


Information extraction

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN


Information extraction

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN


Information extraction

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN


Information extraction

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN


Information extraction

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

book with a nice cover

P

4

PN


Information extraction

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwith a nice cover

P

4

PN


Information extraction

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwith a nice cover

P

4

PN


Information extraction

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwitha nice cover

P

4

PN


Information extraction

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanice cover

P

4

PN


Information extraction

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanicecover

P

4

PN


Information extraction

Pattern-maching

PN ’s (ADJ)* N P Art (ADJ)* N

{PN ’s/ Art}(ADJ)* N(P Art (ADJ)* N)*

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanicecover

P

4

PN


Information extraction

FASTUS

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Analysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Information extraction

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Attachment

Ambiguities

are not made

explicit

Example of IE: FASTUS(1993)

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location


Information extraction

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Example of IE: FASTUS(1993)

{{

}}

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

a Japanese tea house

a [Japanese tea] house

a Japanese [tea house]


Information extraction

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Structural

Ambiguities of

NP are ignored

Example of IE: FASTUS(1993)

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location


Information extraction

Bridgestone Sports Co. said Friday it had set upa joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubsa month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location


Information extraction

[COMPNY] said Friday it [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] and [COMPNY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE], [COMPNY], capitalized at 20 million

[CURRENCY-UNIT][START] production in [TIME]

with production of 20,000 [PRODUCT] a month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

Some syntactic structures

like …


Information extraction

[COMPNY] said Friday it [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE] capitalized at [CURRENCY][START] production in [TIME]

with production of [PRODUCT] a month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

Syntactic structures relevant

to information to be extracted

are dealt with.


Information extraction

Syntactic variations

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.


Information extraction

S

NP

VP

GM

[SET-UP]

V

NP

signed

VP

N

agreement

V

setting up

Syntactic variations

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.

GM plans to set up a joint venture with Toyota.

GM expects to set up a joint venture with Toyota.


Information extraction

[SET-UP]

Syntactic variations

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.

S

NP

VP

GM

V

set up

GM plans to set up a joint venture with Toyota.

GM expects to set up a joint venture with Toyota.


Information extraction

[COMPNY] [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE] capitalized at [CURRENCY][START] production in [TIME]

with production of [PRODUCT] a month.

The attachment positions of PP are determined at this stage.

Irrelevant parts of sentences are ignored.

Example of IE: FASTUS(1993)

3.Complex Phrases

4.Domain Events

[COMPANY][SET-UP][JOINT-VENTURE]with[COMPNY]

[COMPANY][SET-UP][JOINT-VENTURE] (others)* with[COMPNY]


Information extraction

Complications caused by syntactic variations

Relative clause

The mayor, who was kidnapped yesterday, was found dead today.

[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG] Relpro {NG/others}* [VG]


Information extraction

Complications caused by syntactic variations

Relative clause

The mayor, who was kidnapped yesterday, was found dead today.

[NG]Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG]Relpro {NG/others}*[VG]


Information extraction

Patterns used

by Domain Event

Surface Pattern

Generator

Basic patterns

Relative clause construction

Passivization, etc.

Complications caused by syntactic variations

Relative clause

The mayor, who waskidnapped yesterday, was found dead today.

[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG] Relpro {NG/others}* [VG]


Information extraction

Piece-wise recognition

of basic templates

Reconstructing information

carried via syntactic structures

by merging basic templates

FASTUS

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Information extraction

FASTUS

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Piece-wise recognition

of basic templates

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

Reconstructing information

carried via syntactic structures

by merging basic templates


Information extraction

FASTUS

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Piece-wise recognition

of basic templates

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

Reconstructing information

carried via syntactic structures

by merging basic templates


Information extraction

Current state of the arts of IE

  • Carefully constructed IE systems

  • F-60 level (interannotater agreement: 60-80%)

  • Domain: telegraphic messages about naval operation

  • (MUC-1:87, MUC-2:89)

  • news articles and transcriptions of radio broadcasts

  • Latin American terrorism (MUC-3:91, MUC-4:1992)

  • News articles about joint ventures (MUC-5, 93)

  • News articles about management changes (MUC-6, 95)

  • News articles about space vehicle (MUC-7, 97)

  • Handcrafted rules (named entity recognition, domain events, etc)

Automatic learning from texts:

Supervised learning : corpus preparation

Non-supervised, or controlled learning


Information extraction

IE in Biology


Csndb national institute of health sciences

CSNDB(National Institute of Health Sciences)

  • A data- and knowledge- base for signaling pathways of human cells.

    • It compiles the information on biological molecules, sequences, structures, functions, and biological reactions which transfer the cellular signals.

    • Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically.

    • CSNDB is constructed on ACEDB and inference engine CLIPS, and has a linkage to TRANSFAC.

    • Final goal is to make a computerized model for various biological phenomena.


Example 1

Example. 1

Excerpted @[Takai98]

  • Signal_Reaction:

    “EGF receptor  Grb2”

  • From_molecule “EGF receptor”

  • To_molecule “Grb2”

  • Tissue “liver”

  • Effect “activation”

  • Interaction

    “SH2+phosphorylated Tyr”

  • Reference [Yamauchi_1997]

  • A Standard Reaction


Example 3

Example. 3

Excerpted @[Takai98]

  • Signal_Reaction:

    “Ah receptor + HSP90 ”

  • Component “Ah receptor” “HSP90”

  • Effect “activation dissociation”

  • Interaction

    “PAS domain”

    “of Ah receptor”

  • Activity

    “inactivation of Ah receptor”

  • Reference [Powell-Coffman_1998]

  • A Polymerization Reaction


Information extraction

FASTUS

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

2.Basic Phrases:

Simple noun groups, verb groups and particles

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Information extraction

FASTUS

Is separation of

stages possible ?

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

2.Basic Phrases:

Simple noun groups, verb groups and particles

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Information extraction

FASTUS

Is separation of

stages possible ?

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Open word classes:

techical terms

very long

specific formation rules

many semantic classes

acronyms

variants

fairly ambiguous

[[Term recognition]]

Coordination across word

formation

A or B and C D

2.Basic Phrases:

Simple noun groups, verb groups and particles

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Information extraction

FASTUS

Is separation of

stages possible ?

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

2.Basic Phrases:

Simple noun groups, verb groups and particles

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Information extraction

Syntax/Semantics

An active phorbol ester must therefore, presumably

by activation of protein kinase C, cause dissociation

of a cytoplasmic complex of NF-kappa B and I kappa B

by modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.


Information extraction

Syntax/Semantics

An active phorbol ester must therefore, presumably

by activation of protein kinase C, cause dissociation

of a cytoplasmic complex of NF-kappa B and I kappa B

by modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

E2: The active phorbol ester modifies I kappa B.


Information extraction

Part-Whole

Syntax/Semantics

An active phorbol ester must therefore, presumably

by activation of protein kinase C, cause dissociation

of a cytoplasmic complex of NF-kappa B and I kappa B

by modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

E2: The active phorbol ester modifies I kappa B.

E3: It dissociates a cytoplasmic complex of NF-kappa B

and I kappa B.


Information extraction

Syntax/Semantics

An active phorbol ester must therefore, presumably

by activation of protein kinase C, cause dissociation

of a cytoplasmic complex of NF-kappa B and I kappa B

by modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

E2: The active phorbol ester modifies I kappa B.

E3: It dissociates a cytoplasmic complex of NF-kappa B

and I kappa B.

Part-Whole


Information extraction

Full parser based on good grammar formalisms

  • Several attempts of using full parsers :

  • To improve the Precision

  • Systematic treatment of interaction of the

  • different phases :

  • Unification-based grammar formalisms

  • The two papers in the NLP session of PSB 2001


Information extraction

180 sentences from abstracts in MEDLINE

Theaverage parse time per sentence: 2.7 sec by a naïve parser

(This can be improved by the multi-stage parser by 50 times)

Experiment

(A.Yakushiji et.al, PSB2001)

XHPSG: HPSG-like Grammar translated from

XTAG of U-Penn (Y.Tateishi, TAG+ workshop 98)

Terms (Compound nouns) are chunked beforehand.

Automatic conversion: Detailed, empirical comparison of

grammars of different formalisms (+LFG)


Information extraction

68%

Argument Frame Extractor

133 argument structures, marked by a domain specialist

in 97 sentences among the 180 sentences

Extracted Uniquely

31

Extracted with ambiguity

32

Extractable from pp’s

26

Parsing

Failures

Not extractable

27

Memory limitation,etc

17


Information extraction

Open class words:

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Ontology: Knowledge of the Domain

More refined semantic

classes with part-whole

relationships, properties,

Etc.

Acronyms, variants,

Etc.


Information extraction

Open class words:

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Ontology: Knowledge of the Domain

More refined semantic

classes with part-whole

relationships, properties,

Etc.

Acronyms, variants,

Etc.


Information extraction

T

B

B

Bio Term Bank

Biological ontology committee Japan organized by T. Takagi and T. Takai, U.Tokyo in Genome Projects of MESSC (2000.4~2005.3)

  • A database for all sort of biological terms collected from genome databases and biological texts.

  • It will contain 2 million terms in 2001 and 5 million terms until 2005.

  • Terms are classified by biochemical and terminological attributes, grounded on their resources.


Information extraction

Open class words:

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Ontology: Knowledge of the Domain

More refined semantic

classes with part-whole

relationships, properties,

Etc.

Acronyms, variants,

Etc.


Genia ontology current version

GENIA ontology (current version)

+-name-+-source-+-natural-+-organism-+-multi-cell organism

| | | +-mono-cell organism

| | | +-virus

| | +-tissue

| | +-cell type

| | +-sub-location of cells

| +-artificial-+-cell line

|

+-substance-+-compound-+-organic-+-amino-+-protein-+-protein family or group

| |+-protein complex

| |+-individual protein molecule

| | +-subunit of protein complex

| | +-substructure of protein

| | +-domain or region of protein

| +-peptide

| +-amino acid monomer

|

+-nucleic-+-DNA-+-DNA family or group

| +-individual DNA molecule

| +-domain or region of DNA

|

+-RNA-+-RNA family or group

+-individual RNA molecule

+-domain or region of RNA


Expansion of genia ontology

Expansion of GENIA Ontology

  • Try to tag all NPs in some MEDLINE abstracts and find the classes that appears in abstracts but not in current ontology

  • Find frequent verbs and what class of arguments they take


Expansion of genia ontology1

Expansion of GENIA Ontology

  • Chemical class of substance and their substrucutres

  • Sources

  • Biological role, or function, of substances

  • Reaction

    • Biological reaction

    • Pathway

    • Disease

  • Structure themselves

  • Experiment , experimental results, and researchers

  • Measure


Example of entities in expanded

Example of Entities in Expanded

  • Biological role, or function, of substances

    • receptor, inhibitor, …

  • Biological reaction

    • activation, binding, inhibition, apoptosis, G2 arrest

    • pathway, signal

    • immune dysfunction, Ataxia telangiectasia (AT)

  • Structure themselves

    • alpha-helix,

  • Experiment, experimental results, researchers

    • our results, these studies, we


Verbs related to biological events frequent verbs in 100 medline abstracts

Verbs Related to Biological EventsFrequent Verbs in 100 MEDLINE Abstracts


Verbs related to biological events verbs that take biological entities as arguments

Verbs Related to Biological EventsVerbs that take biological entities as arguments

  • induce

    • noun BE INDUCED BY nounactivation of these PROTEINwas induced byPROTEIN

    • noun INDUCE nounPROTEINinducedthe tyrosine phosphorylation

  • bind

    • noun BIND TO nounthe drugsbind totwo different PROTEIN

    • noun BIND nounmotifs previously found tobindthe cellular factors

    • noun BINDING noun theTATA-box binding protein

    • the BINDING of noun the binding of PROTEIN

semantic class: substancestructuresourceexperimentfact reaction


Verbs related to biological events verbs that take description entities

Verbs Related to Biological EventsVerbs that take description entities

  • report

    • noun REPORT that-clausewereport here that PROTEIN is activated by PROTEIN

    • noun REPORT noun we report the characterization of PROTEIN

    • noun REPORT noun wereporta novel structureof PROTEIN

semantic class: substancestructure sourceexperimentfact reaction


Verbs related to biological events verbs whose arguments depend on syntactic patterns

Verbs Related to Biological EventsVerbs whose arguments depend on syntactic patterns

  • show

    • noun BE SHOWN to-infinitive PROTEIN has been shown totrigger cellular PROTEIN activity

    • noun SHOW that-clause the datashow thatPROTEIN stimulation is also not sufficient

    • noun SHOW noun SOURCEshoweda dose-dependent inhibition of PROTEIN activity

semantic class: substancesourceexperimentfact


Verbs related to biological events verbs that take both entities

Verbs Related to Biological EventsVerbs that take both entities

  • indicate

    • noun INDICATE that-clause the dataindicate that PROTEIN is required in CELL prolifiration

    • noun INDICATE noun these findingsindicatean unexpected role of DNA

    • noun INDICATE that-clause the structureindicatesthat it represents a unique class of PROTEIN

    • noun INDICATE noun the structureindicatesmechanismsforallosteric effector action

semantic class: substancestructure sourceexperimentfact reaction role


Example of ne annotation

Example of NE Annotation

UI - 85146267

TI - Characterization of <NE ti="3" class="protein" nm="aldosterone binding site" mt="SV" subclass="family_or_group" unsure="Class" cmt="">aldosterone binding sites</NE ti="3"> in circulating <NE ti="2" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="2">.

AB - <NE ti="4" class="protein" nm="Aldosterone binding sites" mt="SV" subclass="family_or_group" unsure="Class" cmt="">Aldosterone binding sites</NE ti="4"> in <NE ti="1" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="1"> were characterized after separation of cells from blood by a Percoll gradient. After washing and resuspension in <NE ti="5" class="other_organic_compounds" nm="RPMI-1640 medium" mt="SV" unsure="OK" cmt="">RPMI-1640 medium</NE ti="5">, cells were incubated at 37 degrees C for 1 h with different concentrations of <NE ti="6" class="other_organic_compounds" nm="[3H]aldosterone" mt="SV" unsure="OK" cmt="">[3H]aldosterone</NE ti="6"> plus a 100-fold concentration of <NE ti="7" class="other_organic_compounds" nm="RU-26988" mt="SV" unsure="OK" cmt="">RU-26988 </NE ti="7">(<NE ti=“17" class="other_organic_compounds" nm="11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one" mt="SV" unsure="OK" cmt="">11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one</NE ti=“17">), with or without an excess of unlabeled <NE ti="8" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="8">. <NE ti="9" class="other_organic_compounds" nm="Aldosterone" mt="SV" unsure="OK" cmt="">Aldosterone</NE ti="9"> binds to a single class of <NE ti="10" class="protein" nm="receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">receptors</NE ti="10"> with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14) and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity data show a hierarchy of affinity of <NE ti="11" class="other_organic_compounds" nm="desoxycorticosterone" mt="SV" unsure="OK" cmt="">desoxycorticosterone</NE ti="11"> = <NE ti="12" class="other_organic_compounds" nm="corticosterone" mt="SV" unsure="OK" cmt="">corticosterone</NE ti="12"> = <NE ti="13" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="13"> greater than <NE ti="14" class="other_organic_compounds" nm="hydrocortisone" mt="SV" unsure="OK" cmt="">hydrocortisone</NE ti="14"> greater than <NE ti="15" class="other_organic_compounds" nm="dexamethasone" mt="SV" unsure="OK" cmt="">dexamethasone</NE ti="15">. The results indicate that <NE ti="17" class="cell_type" nm="mononuclear leukocyte" mt="SV" unsure="OK" cmt="">mononuclear leukocytes</NE ti="17"> could be useful for studying the physiological significance of these <NE ti="16" class="protein" nm="mineralocorticoid receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">mineralocorticoid receptors</NE ti="16"> and their regulation in humans.


Information extraction

Available from our website:

Definition of ontological classes

Manual of GMPL: extention of XML to annonate texts

Manual of Text Annotation

Soon: Annotated texts (1000 abstracts) by the end of March


Information extraction

  • IE can contribute to Bio-informatics significantly.

2. However, the domains in Bio-chemistry seem more

structurally rich than the domains we have dealt with so

far.

Term formation, rich ontologies, complex syntactic

structures.

3. It requires substantial efforts in resource building.

4. However, those resources can contribute to other

applications :

Knowledge sharing, Intelligent IR, Knowledge discovery

One of the crucial techniques is ATR ….


Overview of genia system1

Retrieval Module

Corpus

  • Request enhancement

  • Spawn request

  • Classify documents

Information Extraction

Module

  • Identify & classify terms

  • Identify events

Interface Module

Database

  • GUI

  • HTML conversion

  • System integration

Ontology

Markup

language

Data model

Raw(OCR)

Text

Structure

Annotated

Background Knowledge

Document

Named-Entity

Event

Overview of GENIA System

MEDLINE

Corpus Module

  • Markup generation / compilation

  • Annotated corpus construction

  • User

  • IR Request

  • Abstract

  • Full Paper

Security

Database Module

Concept Module

  • DB design / access / management

  • DB construction

  • BK design / construction / compilation


  • Login