information extraction from scientific texts
Download
Skip this Video
Download Presentation
Information Extraction from Scientific Texts

Loading in 2 Seconds...

play fullscreen
1 / 114

Information Extraction from Scientific Texts - PowerPoint PPT Presentation


  • 417 Views
  • Uploaded on

Information Extraction from Scientific Texts. Junichi Tsujii Graduate School of Science University of Tokyo Japan. Texts are one of the major sources of information and knowledge. However, they are not transparent. They have to be systematically integrated with

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Extraction from Scientific Texts' - andrew


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information extraction from scientific texts

Information ExtractionfromScientific Texts

Junichi Tsujii

Graduate School of Science

University of Tokyo

Japan

slide2
Texts are one of the major sources of information and knowledge.

However, they are not transparent.

They have to be systematically integrated with

the other sources like data bases, numerical data,

etc.

Natural Language Processing--IE

overview of genia system
Retrieval Module

Corpus

  • Request enhancement
  • Spawn request
  • Classify documents

Information Extraction

Module

  • Identify & classify terms
  • Identify events

Interface Module

Database

  • GUI
  • HTML conversion
  • System integration

Ontology

Markup

language

Data model

Raw(OCR)

Text

Structure

Annotated

Background Knowledge

Document

Named-Entity

Event

Overview of GENIA System

MEDLINE

Corpus Module

  • Markup generation / compilation
  • Annotated corpus construction
  • User
  • IR Request
  • Abstract
  • Full Paper

Security

Database Module

Concept Module

  • DB design / access / management
  • DB construction
  • BK design / construction / compilation
slide4
Plan
  • What is IE ?
  • General Framework of NLP
  • Basic IE techniques
  • IE in Biology

Automatic Term Recognition

(S. Ananiadou)

slide6
Application Tasks of NLP

(1)Information Retrieval/Detection

To search and retrieve documents in response to queries

for information

(2)Passage Retrieval

To search and retrieve part of documents in response

to queries for information

(3)Information Extraction

To extract information that fits pre-defined database schemas

or templates, specifying the output formats

(4) Question/Answering Tasks

To answer general questions by using texts as knowledge

base: Fact retrieval, combination of IR and IE

(5)Text Understanding

To understand texts as people do: Artificial Intelligence

slide7
Ranges of Queries

(1)Information Retrieval/Detection

(2)Passage Retrieval

Pre-Defined: Fixed aspects

of information carried in texts

(3)Information Extraction

(4) Question/Answering Tasks

(5)Text Understanding

slide8
Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)

slide9
Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)

slide10
Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)

slide11
Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)

slide12
Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)

slide13
production of

20, 000 iron and

metal wood clubs

[company]

[set up]

[Joint-Venture]

with

[company]

FASTUS

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

set up

new Twaiwan dallors

2.Basic Phrases:

Simple noun groups, verb groups and particles

a Japanese trading house

had set up

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide14
Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)

slide15
Information Extraction

……….

Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the

second floor of his Nanjing home early on Sunday.

The deputy general manager of Yaxing Benz, a Sino-German

joint venture that makes buses and bus chassis in nearby Yangzhou,

was hacked to death with 45 cm watermelon knives.

……….

Name of the Venture: Yaxing Benz

Products: buses and bus chassis

Location: Yangzhou,China

Companies involved: (1)Name: X?

Country: German

(2)Name: Y?

Country: China

slide16
Information Extraction

A German vehicle-firm executive was stabbed to death ….

……….

Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the

second floor of his Nanjing home early on Sunday.

The deputy general manager of Yaxing Benz, a Sino-German

joint venture that makes buses and bus chassis in nearby Yangzhou,

was hacked to death with 45 cm watermelon knives.

……….

Crime-Type: Murder

Type: Stabbing

The killed: Name: Jurgen Pfrang

Age: 51

Profession: Deputy general manager

Location: Nanjing, China

Different template

for crimes

slide17
User

User

System

System

System

Interpretation of Texts

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(4) Question/Answering Tasks

(5)Text Understanding

slide18
Characterization of Texts

Queries

IR System

Collection of Texts

slide19
Knowledge

Interpretation

Characterization of Texts

Queries

IR System

Collection of Texts

slide20
Knowledge

Interpretation

Characterization of Texts

Queries

Passage

IR System

Collection of Texts

slide21
Knowledge

Interpretation

Characterization of Texts

Queries

Structures

of

Sentences

NLP

Templates

Passage

IR System

IE System

Collection of Texts

Texts

slide22
Knowledge

Interpretation

IE System

Templates

Texts

slide23
Knowledge

General Framework

of

NLP/NLU

IE as

compromise NLP

Interpretation

IE System

Templates

Texts

Predefined

slide24
Rather clear

A bit vague

Rather clear

A bit vague

Very vague

Performance Evaluation

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(4) Question/Answering Tasks

(5)Text Understanding

slide25
Query

N: Correct Documents

M:Retrieved Documents

C: Correct Documents that are

actually retrieved

N

M

C

C

Precision:

Recall:

P

M

N

C

R

2P・R

F-Value:

P+R

Collection of Documents

slide26
Query

N: Correct Templates

M:Retrieved Templates

C: Correct Templates that are

actually retrieved

N

M

C

C

Precision:

Recall:

P

M

N

C

R

2P・R

F-Value:

P+R

Collection of Documents

More complicated due to partially

filled templates

slide28
General Framework of NLP

John runs.

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation

slide29
General Framework of NLP

John runs.

Morphological and

Lexical Processing

John run+s.

P-N V 3-pre

N plu

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation

slide30
General Framework of NLP

John runs.

Morphological and

Lexical Processing

John run+s.

P-N V 3-pre

N plu

S

Syntactic Analysis

NP

VP

P-N

V

Semantic Analysis

John

run

Context processing

Interpretation

slide31
Pred: RUN

Agent:John

General Framework of NLP

John runs.

Morphological and

Lexical Processing

John run+s.

P-N V 3-pre

N plu

S

Syntactic Analysis

NP

VP

P-N

V

Semantic Analysis

John

run

Context processing

Interpretation

slide32
Pred: RUN

Agent:John

General Framework of NLP

John runs.

Morphological and

Lexical Processing

John run+s.

P-N V 3-pre

N plu

S

Syntactic Analysis

NP

VP

P-N

V

Semantic Analysis

John

run

Context processing

Interpretation

John is a student.

He runs.

slide33
General Framework of NLP

Tokenization

Morphological and

Lexical Processing

Part of Speech Tagging

Inflection/Derivation

Compounding

Syntactic Analysis

Term recognition

(Ananiadou)

Semantic Analysis

Context processing

Interpretation

Domain Analysis

Appelt:1999

slide34
Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation

slide35
Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Incomplete Lexicons

Open class words

Terms

Term recognition

Named Entities

Company names

Locations

Numerical expressions

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation

slide36
Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Incomplete Grammar

Syntactic Coverage

Domain Specific

Constructions

Ungrammatical

Constructions

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation

slide37
Predefined

Aspects of

Information

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Incomplete

Domain Knowledge

Interpretation Rules

Context processing

Interpretation

slide38
Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

(2) Ambiguities:

Combinatorial

Explosion

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation

slide39
Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Most words in English

are ambiguous in terms

of their part of speeches.

runs: v/3pre, n/plu

clubs: v/3pre, n/plu

and two meanings

Morphological and

Lexical Processing

(2) Ambiguities:

Combinatorial

Explosion

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation

slide40
Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

(2) Ambiguities:

Combinatorial

Explosion

Syntactic Analysis

Structural Ambiguities

Predicate-argument

Ambiguities

Semantic Analysis

Context processing

Interpretation

slide41
Semantic Ambiguities(1)

John bought a car with Mary.

$3000 can buy a nice car.

Semantic Ambiguities(2)

Every man loves a woman.

Co-reference Ambiguities

Structural Ambiguities

(1)Attachment Ambiguities

John bought a carwith large seats.

John bought a car with $3000.

The manager of Yaxing Benz, a Sino-German joint venture

The manager of Yaxing Benz, Mr. John Smith

(2) Scope Ambiguities

young women and men in the room

(3)Analytical Ambiguities

Visiting relatives can be boring.

slide42
Combinatorial

Explosion

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

(2) Ambiguities:

Combinatorial

Explosion

Syntactic Analysis

Structural Ambiguities

Predicate-argument

Ambiguities

Semantic Analysis

Context processing

Interpretation

slide43
Note:

Ambiguities vs Robustness

More comprehensive knowledge: More Robust

big dictionaries

comprehensive grammar

More comprehensive knowledge: More ambiguities

Adaptability: Tuning, Learning

slide44
Framework of IE

IE as compromise NLP

slide45
Predefined

Aspects of

Information

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Incomplete

Domain Knowledge

Interpretation Rules

Context processing

Interpretation

slide46
Predefined

Aspects of

Information

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Incomplete

Domain Knowledge

Interpretation Rules

Context processing

Interpretation

slide47
Techniques in IE

(1) Domain Specific Partial Knowledge:

Knowledge relevant to information to be extracted

(2) Ambiguities:

Ignoring irrelevant ambiguities

Simpler NLP techniques

(3) Robustness:

Coping with Incomplete dictionaries

(open class words)

Ignoring irrelevant parts of sentences

(4) Adaptation Techniques:

Machine Learning, Trainable systems

slide48
95 %

FSA rules

Statistic taggers

Part of Speech Tagger

Local Context

Statistical Bias

F-Value

90

Domain

Dependent

General Framework of NLP

Open class words:

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Anaysis

Domain specific rules:

, Inc.

Mr. .

Machine Learning:

HMM, Decision Trees

Rules + Machine Learning

Context processing

Interpretation

slide49
FASTUS

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Anaysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide50
FASTUS

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Anaysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide51
FASTUS

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Analysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide52
Computationally more complex, Less Efficiency

Chomsky Hierarchy Hierarchy

of Grammar of Automata

Regular Grammar Finite State Automata

Context Free Grammar Push Down Automata

Context Sensitive Grammar Linear Bounded Automata

Type 0 Grammar Turing Machine

slide53
n

n

A B

Computationally more complex, Less Efficiency

Chomsky Hierarchy Hierarchy

of Grammar of Automata

Regular Grammar Finite State Automata

Context Free Grammar Push Down Automata

Context Sensitive Grammar Linear Bounded Automata

Type 0 Grammar Turing Machine

slide54
1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN

slide55
1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN

slide56
1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN

slide57
1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN

slide58
1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

book with a nice cover

P

4

PN

slide59
1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwith a nice cover

P

4

PN

slide60
1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwith a nice cover

P

4

PN

slide61
1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwitha nice cover

P

4

PN

slide62
1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanice cover

P

4

PN

slide63
1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanicecover

P

4

PN

slide64
Pattern-maching

PN ’s (ADJ)* N P Art (ADJ)* N

{PN ’s/ Art}(ADJ)* N(P Art (ADJ)* N)*

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanicecover

P

4

PN

slide65
FASTUS

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Analysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide66
Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Attachment

Ambiguities

are not made

explicit

Example of IE: FASTUS(1993)

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

slide67
Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Example of IE: FASTUS(1993)

{{

}}

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

a Japanese tea house

a [Japanese tea] house

a Japanese [tea house]

slide68
Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Structural

Ambiguities of

NP are ignored

Example of IE: FASTUS(1993)

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

slide69
Bridgestone Sports Co. said Friday it had set upa joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubsa month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

slide70
[COMPNY] said Friday it [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] and [COMPNY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE], [COMPNY], capitalized at 20 million

[CURRENCY-UNIT][START] production in [TIME]

with production of 20,000 [PRODUCT] a month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

Some syntactic structures

like …

slide71
[COMPNY] said Friday it [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE] capitalized at [CURRENCY][START] production in [TIME]

with production of [PRODUCT] a month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

Syntactic structures relevant

to information to be extracted

are dealt with.

slide72
Syntactic variations

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.

slide73
S

NP

VP

GM

[SET-UP]

V

NP

signed

VP

N

agreement

V

setting up

Syntactic variations

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.

GM plans to set up a joint venture with Toyota.

GM expects to set up a joint venture with Toyota.

slide74
[SET-UP]

Syntactic variations

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.

S

NP

VP

GM

V

set up

GM plans to set up a joint venture with Toyota.

GM expects to set up a joint venture with Toyota.

slide75
[COMPNY] [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE] capitalized at [CURRENCY][START] production in [TIME]

with production of [PRODUCT] a month.

The attachment positions of PP are determined at this stage.

Irrelevant parts of sentences are ignored.

Example of IE: FASTUS(1993)

3.Complex Phrases

4.Domain Events

[COMPANY][SET-UP][JOINT-VENTURE]with[COMPNY]

[COMPANY][SET-UP][JOINT-VENTURE] (others)* with[COMPNY]

slide76
Complications caused by syntactic variations

Relative clause

The mayor, who was kidnapped yesterday, was found dead today.

[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG] Relpro {NG/others}* [VG]

slide77
Complications caused by syntactic variations

Relative clause

The mayor, who was kidnapped yesterday, was found dead today.

[NG]Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG]Relpro {NG/others}*[VG]

slide78
Patterns used

by Domain Event

Surface Pattern

Generator

Basic patterns

Relative clause construction

Passivization, etc.

Complications caused by syntactic variations

Relative clause

The mayor, who waskidnapped yesterday, was found dead today.

[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG] Relpro {NG/others}* [VG]

slide79
Piece-wise recognition

of basic templates

Reconstructing information

carried via syntactic structures

by merging basic templates

FASTUS

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide80
FASTUS

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Piece-wise recognition

of basic templates

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

Reconstructing information

carried via syntactic structures

by merging basic templates

slide81
FASTUS

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Piece-wise recognition

of basic templates

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

Reconstructing information

carried via syntactic structures

by merging basic templates

slide82
Current state of the arts of IE
  • Carefully constructed IE systems
  • F-60 level (interannotater agreement: 60-80%)
  • Domain: telegraphic messages about naval operation
  • (MUC-1:87, MUC-2:89)
  • news articles and transcriptions of radio broadcasts
  • Latin American terrorism (MUC-3:91, MUC-4:1992)
  • News articles about joint ventures (MUC-5, 93)
  • News articles about management changes (MUC-6, 95)
  • News articles about space vehicle (MUC-7, 97)
  • Handcrafted rules (named entity recognition, domain events, etc)

Automatic learning from texts:

Supervised learning : corpus preparation

Non-supervised, or controlled learning

csndb national institute of health sciences
CSNDB(National Institute of Health Sciences)
  • A data- and knowledge- base for signaling pathways of human cells.
    • It compiles the information on biological molecules, sequences, structures, functions, and biological reactions which transfer the cellular signals.
    • Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically.
    • CSNDB is constructed on ACEDB and inference engine CLIPS, and has a linkage to TRANSFAC.
    • Final goal is to make a computerized model for various biological phenomena.
example 1
Example. 1

Excerpted @[Takai98]

  • Signal_Reaction:

“EGF receptor  Grb2”

  • From_molecule “EGF receptor”
  • To_molecule “Grb2”
  • Tissue “liver”
  • Effect “activation”
  • Interaction

“SH2+phosphorylated Tyr”

  • Reference [Yamauchi_1997]
  • A Standard Reaction
example 3
Example. 3

Excerpted @[Takai98]

  • Signal_Reaction:

“Ah receptor + HSP90 ”

  • Component “Ah receptor” “HSP90”
  • Effect “activation dissociation”
  • Interaction

“PAS domain”

“of Ah receptor”

  • Activity

“inactivation of Ah receptor”

  • Reference [Powell-Coffman_1998]
  • A Polymerization Reaction
slide87
FASTUS

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

2.Basic Phrases:

Simple noun groups, verb groups and particles

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide88
FASTUS

Is separation of

stages possible ?

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

2.Basic Phrases:

Simple noun groups, verb groups and particles

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide89
FASTUS

Is separation of

stages possible ?

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Open word classes:

techical terms

very long

specific formation rules

many semantic classes

acronyms

variants

fairly ambiguous

[[Term recognition]]

Coordination across word

formation

A or B and C D

2.Basic Phrases:

Simple noun groups, verb groups and particles

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide90
FASTUS

Is separation of

stages possible ?

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

2.Basic Phrases:

Simple noun groups, verb groups and particles

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide91
Syntax/Semantics

An active phorbol ester must therefore, presumably

by activation of protein kinase C, cause dissociation

of a cytoplasmic complex of NF-kappa B and I kappa B

by modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

slide92
Syntax/Semantics

An active phorbol ester must therefore, presumably

by activation of protein kinase C, cause dissociation

of a cytoplasmic complex of NF-kappa B and I kappa B

by modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

E2: The active phorbol ester modifies I kappa B.

slide93
Part-Whole

Syntax/Semantics

An active phorbol ester must therefore, presumably

by activation of protein kinase C, cause dissociation

of a cytoplasmic complex of NF-kappa B and I kappa B

by modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

E2: The active phorbol ester modifies I kappa B.

E3: It dissociates a cytoplasmic complex of NF-kappa B

and I kappa B.

slide94
Syntax/Semantics

An active phorbol ester must therefore, presumably

by activation of protein kinase C, cause dissociation

of a cytoplasmic complex of NF-kappa B and I kappa B

by modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

E2: The active phorbol ester modifies I kappa B.

E3: It dissociates a cytoplasmic complex of NF-kappa B

and I kappa B.

Part-Whole

slide95
Full parser based on good grammar formalisms
  • Several attempts of using full parsers :
  • To improve the Precision
  • Systematic treatment of interaction of the
  • different phases :
  • Unification-based grammar formalisms
  • The two papers in the NLP session of PSB 2001
slide96
180 sentences from abstracts in MEDLINE

Theaverage parse time per sentence: 2.7 sec by a naïve parser

(This can be improved by the multi-stage parser by 50 times)

Experiment

(A.Yakushiji et.al, PSB2001)

XHPSG: HPSG-like Grammar translated from

XTAG of U-Penn (Y.Tateishi, TAG+ workshop 98)

Terms (Compound nouns) are chunked beforehand.

Automatic conversion: Detailed, empirical comparison of

grammars of different formalisms (+LFG)

slide97
68%

Argument Frame Extractor

133 argument structures, marked by a domain specialist

in 97 sentences among the 180 sentences

Extracted Uniquely

31

Extracted with ambiguity

32

Extractable from pp’s

26

Parsing

Failures

Not extractable

27

Memory limitation,etc

17

slide98
Open class words:

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Ontology: Knowledge of the Domain

More refined semantic

classes with part-whole

relationships, properties,

Etc.

Acronyms, variants,

Etc.

slide99
Open class words:

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Ontology: Knowledge of the Domain

More refined semantic

classes with part-whole

relationships, properties,

Etc.

Acronyms, variants,

Etc.

slide100
T

B

B

Bio Term Bank

Biological ontology committee Japan organized by T. Takagi and T. Takai, U.Tokyo in Genome Projects of MESSC (2000.4~2005.3)

  • A database for all sort of biological terms collected from genome databases and biological texts.
  • It will contain 2 million terms in 2001 and 5 million terms until 2005.
  • Terms are classified by biochemical and terminological attributes, grounded on their resources.
slide101
Open class words:

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Ontology: Knowledge of the Domain

More refined semantic

classes with part-whole

relationships, properties,

Etc.

Acronyms, variants,

Etc.

genia ontology current version
GENIA ontology (current version)

+-name-+-source-+-natural-+-organism-+-multi-cell organism

| | | +-mono-cell organism

| | | +-virus

| | +-tissue

| | +-cell type

| | +-sub-location of cells

| +-artificial-+-cell line

|

+-substance-+-compound-+-organic-+-amino-+-protein-+-protein family or group

| |+-protein complex

| |+-individual protein molecule

| | +-subunit of protein complex

| | +-substructure of protein

| | +-domain or region of protein

| +-peptide

| +-amino acid monomer

|

+-nucleic-+-DNA-+-DNA family or group

| +-individual DNA molecule

| +-domain or region of DNA

|

+-RNA-+-RNA family or group

+-individual RNA molecule

+-domain or region of RNA

expansion of genia ontology
Expansion of GENIA Ontology
  • Try to tag all NPs in some MEDLINE abstracts and find the classes that appears in abstracts but not in current ontology
  • Find frequent verbs and what class of arguments they take
expansion of genia ontology1
Expansion of GENIA Ontology
  • Chemical class of substance and their substrucutres
  • Sources
  • Biological role, or function, of substances
  • Reaction
    • Biological reaction
    • Pathway
    • Disease
  • Structure themselves
  • Experiment , experimental results, and researchers
  • Measure
example of entities in expanded
Example of Entities in Expanded
  • Biological role, or function, of substances
    • receptor, inhibitor, …
  • Biological reaction
    • activation, binding, inhibition, apoptosis, G2 arrest
    • pathway, signal
    • immune dysfunction, Ataxia telangiectasia (AT)
  • Structure themselves
    • alpha-helix,
  • Experiment, experimental results, researchers
    • our results, these studies, we
verbs related to biological events verbs that take biological entities as arguments
Verbs Related to Biological EventsVerbs that take biological entities as arguments
  • induce
    • noun BE INDUCED BY nounactivation of these PROTEINwas induced byPROTEIN
    • noun INDUCE nounPROTEINinducedthe tyrosine phosphorylation
  • bind
    • noun BIND TO nounthe drugsbind totwo different PROTEIN
    • noun BIND nounmotifs previously found tobindthe cellular factors
    • noun BINDING noun theTATA-box binding protein
    • the BINDING of noun the binding of PROTEIN

semantic class: substancestructuresourceexperimentfact reaction

verbs related to biological events verbs that take description entities
Verbs Related to Biological EventsVerbs that take description entities
  • report
    • noun REPORT that-clause wereport here that PROTEIN is activated by PROTEIN
    • noun REPORT noun we report the characterization of PROTEIN
    • noun REPORT noun wereporta novel structureof PROTEIN

semantic class: substancestructure sourceexperimentfact reaction

verbs related to biological events verbs whose arguments depend on syntactic patterns
Verbs Related to Biological EventsVerbs whose arguments depend on syntactic patterns
  • show
    • noun BE SHOWN to-infinitive PROTEIN has been shown totrigger cellular PROTEIN activity
    • noun SHOW that-clause the datashow thatPROTEIN stimulation is also not sufficient
    • noun SHOW noun SOURCEshoweda dose-dependent inhibition of PROTEIN activity

semantic class: substancesourceexperimentfact

verbs related to biological events verbs that take both entities
Verbs Related to Biological EventsVerbs that take both entities
  • indicate
    • noun INDICATE that-clause the dataindicate that PROTEIN is required in CELL prolifiration
    • noun INDICATE noun these findingsindicatean unexpected role of DNA
    • noun INDICATE that-clause the structureindicatesthat it represents a unique class of PROTEIN
    • noun INDICATE noun the structureindicatesmechanismsforallosteric effector action

semantic class: substancestructure sourceexperimentfact reaction role

example of ne annotation
Example of NE Annotation

UI - 85146267

TI - Characterization of aldosterone binding sites in circulating human mononuclear leukocytes.

AB - Aldosterone binding sites in human mononuclear leukocytes were characterized after separation of cells from blood by a Percoll gradient. After washing and resuspension in RPMI-1640 medium, cells were incubated at 37 degrees C for 1 h with different concentrations of [3H]aldosterone plus a 100-fold concentration of RU-26988 (11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one), with or without an excess of unlabeled aldosterone. Aldosterone binds to a single class of receptors with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14) and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity data show a hierarchy of affinity of desoxycorticosterone = corticosterone = aldosterone greater than hydrocortisone greater than dexamethasone. The results indicate that mononuclear leukocytes could be useful for studying the physiological significance of these mineralocorticoid receptors and their regulation in humans.

slide112
Available from our website:

Definition of ontological classes

Manual of GMPL: extention of XML to annonate texts

Manual of Text Annotation

Soon: Annotated texts (1000 abstracts) by the end of March

slide113
IE can contribute to Bio-informatics significantly.

2. However, the domains in Bio-chemistry seem more

structurally rich than the domains we have dealt with so

far.

Term formation, rich ontologies, complex syntactic

structures.

3. It requires substantial efforts in resource building.

4. However, those resources can contribute to other

applications :

Knowledge sharing, Intelligent IR, Knowledge discovery

One of the crucial techniques is ATR ….

overview of genia system1
Retrieval Module

Corpus

  • Request enhancement
  • Spawn request
  • Classify documents

Information Extraction

Module

  • Identify & classify terms
  • Identify events

Interface Module

Database

  • GUI
  • HTML conversion
  • System integration

Ontology

Markup

language

Data model

Raw(OCR)

Text

Structure

Annotated

Background Knowledge

Document

Named-Entity

Event

Overview of GENIA System

MEDLINE

Corpus Module

  • Markup generation / compilation
  • Annotated corpus construction
  • User
  • IR Request
  • Abstract
  • Full Paper

Security

Database Module

Concept Module

  • DB design / access / management
  • DB construction
  • BK design / construction / compilation
ad