Information extraction from scientific texts
Download
1 / 114

Information Extraction - PowerPoint PPT Presentation


  • 415 Views
  • Updated On :

Information Extraction from Scientific Texts. Junichi Tsujii Graduate School of Science University of Tokyo Japan. Texts are one of the major sources of information and knowledge. However, they are not transparent. They have to be systematically integrated with

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Extraction' - andrew


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Information extraction from scientific texts

Information ExtractionfromScientific Texts

Junichi Tsujii

Graduate School of Science

University of Tokyo

Japan


Texts are one of the major sources of information and knowledge.

However, they are not transparent.

They have to be systematically integrated with

the other sources like data bases, numerical data,

etc.

Natural Language Processing--IE


Overview of genia system

Retrieval Module knowledge.

Corpus

  • Request enhancement

  • Spawn request

  • Classify documents

Information Extraction

Module

  • Identify & classify terms

  • Identify events

Interface Module

Database

  • GUI

  • HTML conversion

  • System integration

Ontology

Markup

language

Data model

Raw(OCR)

Text

Structure

Annotated

Background Knowledge

Document

Named-Entity

Event

Overview of GENIA System

MEDLINE

Corpus Module

  • Markup generation / compilation

  • Annotated corpus construction

  • User

  • IR Request

  • Abstract

  • Full Paper

Security

Database Module

Concept Module

  • DB design / access / management

  • DB construction

  • BK design / construction / compilation


Plan knowledge.

  • What is IE ?

  • General Framework of NLP

  • Basic IE techniques

  • IE in Biology

Automatic Term Recognition

(S. Ananiadou)


What is IE ? knowledge.


Application Tasks of NLP knowledge.

(1)Information Retrieval/Detection

To search and retrieve documents in response to queries

for information

(2)Passage Retrieval

To search and retrieve part of documents in response

to queries for information

(3)Information Extraction

To extract information that fits pre-defined database schemas

or templates, specifying the output formats

(4) Question/Answering Tasks

To answer general questions by using texts as knowledge

base: Fact retrieval, combination of IR and IE

(5)Text Understanding

To understand texts as people do: Artificial Intelligence


Ranges of Queries knowledge.

(1)Information Retrieval/Detection

(2)Passage Retrieval

Pre-Defined: Fixed aspects

of information carried in texts

(3)Information Extraction

(4) Question/Answering Tasks

(5)Text Understanding


Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Bridgestone Sports Co. venture said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


production of venture

20, 000 iron and

metal wood clubs

[company]

[set up]

[Joint-Venture]

with

[company]

FASTUS

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

set up

new Twaiwan dallors

2.Basic Phrases:

Simple noun groups, verb groups and particles

a Japanese trading house

had set up

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Information Extraction venture

……….

Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the

second floor of his Nanjing home early on Sunday.

The deputy general manager of Yaxing Benz, a Sino-German

joint venture that makes buses and bus chassis in nearby Yangzhou,

was hacked to death with 45 cm watermelon knives.

……….

Name of the Venture: Yaxing Benz

Products: buses and bus chassis

Location: Yangzhou,China

Companies involved: (1)Name: X?

Country: German

(2)Name: Y?

Country: China


Information Extraction venture

A German vehicle-firm executive was stabbed to death ….

……….

Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the

second floor of his Nanjing home early on Sunday.

The deputy general manager of Yaxing Benz, a Sino-German

joint venture that makes buses and bus chassis in nearby Yangzhou,

was hacked to death with 45 cm watermelon knives.

……….

Crime-Type: Murder

Type: Stabbing

The killed: Name: Jurgen Pfrang

Age: 51

Profession: Deputy general manager

Location: Nanjing, China

Different template

for crimes


User venture

User

System

System

System

Interpretation of Texts

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(4) Question/Answering Tasks

(5)Text Understanding


Characterization of Texts venture

Queries

IR System

Collection of Texts


Knowledge venture

Interpretation

Characterization of Texts

Queries

IR System

Collection of Texts


Knowledge venture

Interpretation

Characterization of Texts

Queries

Passage

IR System

Collection of Texts


Knowledge venture

Interpretation

Characterization of Texts

Queries

Structures

of

Sentences

NLP

Templates

Passage

IR System

IE System

Collection of Texts

Texts


Knowledge venture

Interpretation

IE System

Templates

Texts


Knowledge venture

General Framework

of

NLP/NLU

IE as

compromise NLP

Interpretation

IE System

Templates

Texts

Predefined


Rather clear venture

A bit vague

Rather clear

A bit vague

Very vague

Performance Evaluation

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(4) Question/Answering Tasks

(5)Text Understanding


Query venture

N: Correct Documents

M:Retrieved Documents

C: Correct Documents that are

actually retrieved

N

M

C

C

Precision:

Recall:

P

M

N

C

R

2P・R

F-Value:

P+R

Collection of Documents


Query venture

N: Correct Templates

M:Retrieved Templates

C: Correct Templates that are

actually retrieved

N

M

C

C

Precision:

Recall:

P

M

N

C

R

2P・R

F-Value:

P+R

Collection of Documents

More complicated due to partially

filled templates



General Framework of NLP venture

John runs.

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


General Framework of NLP venture

John runs.

Morphological and

Lexical Processing

John run+s.

P-N V 3-pre

N plu

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


General Framework of NLP venture

John runs.

Morphological and

Lexical Processing

John run+s.

P-N V 3-pre

N plu

S

Syntactic Analysis

NP

VP

P-N

V

Semantic Analysis

John

run

Context processing

Interpretation


Pred: RUN venture

Agent:John

General Framework of NLP

John runs.

Morphological and

Lexical Processing

John run+s.

P-N V 3-pre

N plu

S

Syntactic Analysis

NP

VP

P-N

V

Semantic Analysis

John

run

Context processing

Interpretation


Pred: RUN venture

Agent:John

General Framework of NLP

John runs.

Morphological and

Lexical Processing

John run+s.

P-N V 3-pre

N plu

S

Syntactic Analysis

NP

VP

P-N

V

Semantic Analysis

John

run

Context processing

Interpretation

John is a student.

He runs.


General Framework of NLP venture

Tokenization

Morphological and

Lexical Processing

Part of Speech Tagging

Inflection/Derivation

Compounding

Syntactic Analysis

Term recognition

(Ananiadou)

Semantic Analysis

Context processing

Interpretation

Domain Analysis

Appelt:1999


Difficulties of NLP venture

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


Difficulties of NLP venture

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Incomplete Lexicons

Open class words

Terms

Term recognition

Named Entities

Company names

Locations

Numerical expressions

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


Difficulties of NLP venture

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Incomplete Grammar

Syntactic Coverage

Domain Specific

Constructions

Ungrammatical

Constructions

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


Predefined venture

Aspects of

Information

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Incomplete

Domain Knowledge

Interpretation Rules

Context processing

Interpretation


Difficulties of NLP venture

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

(2) Ambiguities:

Combinatorial

Explosion

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


Difficulties of NLP venture

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Most words in English

are ambiguous in terms

of their part of speeches.

runs: v/3pre, n/plu

clubs: v/3pre, n/plu

and two meanings

Morphological and

Lexical Processing

(2) Ambiguities:

Combinatorial

Explosion

Syntactic Analysis

Semantic Analysis

Context processing

Interpretation


Difficulties of NLP venture

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

(2) Ambiguities:

Combinatorial

Explosion

Syntactic Analysis

Structural Ambiguities

Predicate-argument

Ambiguities

Semantic Analysis

Context processing

Interpretation


Semantic Ambiguities(1) venture

John bought a car with Mary.

$3000 can buy a nice car.

Semantic Ambiguities(2)

Every man loves a woman.

Co-reference Ambiguities

Structural Ambiguities

(1)Attachment Ambiguities

John bought a carwith large seats.

John bought a car with $3000.

The manager of Yaxing Benz, a Sino-German joint venture

The manager of Yaxing Benz, Mr. John Smith

(2) Scope Ambiguities

young women and men in the room

(3)Analytical Ambiguities

Visiting relatives can be boring.


Combinatorial venture

Explosion

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

(2) Ambiguities:

Combinatorial

Explosion

Syntactic Analysis

Structural Ambiguities

Predicate-argument

Ambiguities

Semantic Analysis

Context processing

Interpretation


Note: venture

Ambiguities vs Robustness

More comprehensive knowledge: More Robust

big dictionaries

comprehensive grammar

More comprehensive knowledge: More ambiguities

Adaptability: Tuning, Learning


Framework of IE venture

IE as compromise NLP


Predefined venture

Aspects of

Information

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Incomplete

Domain Knowledge

Interpretation Rules

Context processing

Interpretation


Predefined venture

Aspects of

Information

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Incomplete

Domain Knowledge

Interpretation Rules

Context processing

Interpretation


Techniques in IE venture

(1) Domain Specific Partial Knowledge:

Knowledge relevant to information to be extracted

(2) Ambiguities:

Ignoring irrelevant ambiguities

Simpler NLP techniques

(3) Robustness:

Coping with Incomplete dictionaries

(open class words)

Ignoring irrelevant parts of sentences

(4) Adaptation Techniques:

Machine Learning, Trainable systems


95 % venture

FSA rules

Statistic taggers

Part of Speech Tagger

Local Context

Statistical Bias

F-Value

90

Domain

Dependent

General Framework of NLP

Open class words:

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Anaysis

Domain specific rules:

<Word><Word>, Inc.

Mr. <Cpt-L>. <Word>

Machine Learning:

HMM, Decision Trees

Rules + Machine Learning

Context processing

Interpretation


FASTUS venture

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Anaysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


FASTUS venture

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Anaysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


FASTUS venture

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Analysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Computationally more complex, Less Efficiency venture

Chomsky Hierarchy Hierarchy

of Grammar of Automata

Regular Grammar Finite State Automata

Context Free Grammar Push Down Automata

Context Sensitive Grammar Linear Bounded Automata

Type 0 Grammar Turing Machine


n venture

n

A B

Computationally more complex, Less Efficiency

Chomsky Hierarchy Hierarchy

of Grammar of Automata

Regular Grammar Finite State Automata

Context Free Grammar Push Down Automata

Context Sensitive Grammar Linear Bounded Automata

Type 0 Grammar Turing Machine


1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN


1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN


1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN


1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN


1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

book with a nice cover

P

4

PN


1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwith a nice cover

P

4

PN


1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwith a nice cover

P

4

PN


1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwitha nice cover

P

4

PN


1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanice cover

P

4

PN


1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanicecover

P

4

PN


Pattern-maching venture

PN ’s (ADJ)* N P Art (ADJ)* N

{PN ’s/ Art}(ADJ)* N(P Art (ADJ)* N)*

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanicecover

P

4

PN


FASTUS venture

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Analysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Bridgestone Sports Co. venture said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Attachment

Ambiguities

are not made

explicit

Example of IE: FASTUS(1993)

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location


Bridgestone Sports Co. venture said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Example of IE: FASTUS(1993)

{{

}}

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

a Japanese tea house

a [Japanese tea] house

a Japanese [tea house]


Bridgestone Sports Co. venture said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Structural

Ambiguities of

NP are ignored

Example of IE: FASTUS(1993)

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location


Bridgestone Sports Co. venture said Friday it had set upa joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubsa month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location


[COMPNY] venturesaid Friday it [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] and [COMPNY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE], [COMPNY], capitalized at 20 million

[CURRENCY-UNIT][START] production in [TIME]

with production of 20,000 [PRODUCT] a month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

Some syntactic structures

like …


[COMPNY] venturesaid Friday it [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE] capitalized at [CURRENCY][START] production in [TIME]

with production of [PRODUCT] a month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

Syntactic structures relevant

to information to be extracted

are dealt with.


Syntactic variations venture

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.


S venture

NP

VP

GM

[SET-UP]

V

NP

signed

VP

N

agreement

V

setting up

Syntactic variations

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.

GM plans to set up a joint venture with Toyota.

GM expects to set up a joint venture with Toyota.


[SET-UP] venture

Syntactic variations

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.

S

NP

VP

GM

V

set up

GM plans to set up a joint venture with Toyota.

GM expects to set up a joint venture with Toyota.


[COMPNY] venture [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE] capitalized at [CURRENCY][START] production in [TIME]

with production of [PRODUCT] a month.

The attachment positions of PP are determined at this stage.

Irrelevant parts of sentences are ignored.

Example of IE: FASTUS(1993)

3.Complex Phrases

4.Domain Events

[COMPANY][SET-UP][JOINT-VENTURE]with[COMPNY]

[COMPANY][SET-UP][JOINT-VENTURE] (others)* with[COMPNY]


Complications caused by syntactic variations venture

Relative clause

The mayor, who was kidnapped yesterday, was found dead today.

[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG] Relpro {NG/others}* [VG]


Complications caused by syntactic variations venture

Relative clause

The mayor, who was kidnapped yesterday, was found dead today.

[NG]Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG]Relpro {NG/others}*[VG]


Patterns used venture

by Domain Event

Surface Pattern

Generator

Basic patterns

Relative clause construction

Passivization, etc.

Complications caused by syntactic variations

Relative clause

The mayor, who waskidnapped yesterday, was found dead today.

[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG] Relpro {NG/others}* [VG]


Piece-wise recognition venture

of basic templates

Reconstructing information

carried via syntactic structures

by merging basic templates

FASTUS

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


FASTUS venture

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Piece-wise recognition

of basic templates

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

Reconstructing information

carried via syntactic structures

by merging basic templates


FASTUS venture

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Piece-wise recognition

of basic templates

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

Reconstructing information

carried via syntactic structures

by merging basic templates


Current state of the arts of IE venture

  • Carefully constructed IE systems

  • F-60 level (interannotater agreement: 60-80%)

  • Domain: telegraphic messages about naval operation

  • (MUC-1:87, MUC-2:89)

  • news articles and transcriptions of radio broadcasts

  • Latin American terrorism (MUC-3:91, MUC-4:1992)

  • News articles about joint ventures (MUC-5, 93)

  • News articles about management changes (MUC-6, 95)

  • News articles about space vehicle (MUC-7, 97)

  • Handcrafted rules (named entity recognition, domain events, etc)

Automatic learning from texts:

Supervised learning : corpus preparation

Non-supervised, or controlled learning


IE in Biology venture


Csndb national institute of health sciences
CSNDB venture(National Institute of Health Sciences)

  • A data- and knowledge- base for signaling pathways of human cells.

    • It compiles the information on biological molecules, sequences, structures, functions, and biological reactions which transfer the cellular signals.

    • Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically.

    • CSNDB is constructed on ACEDB and inference engine CLIPS, and has a linkage to TRANSFAC.

    • Final goal is to make a computerized model for various biological phenomena.


Example 1
Example. 1 venture

Excerpted @[Takai98]

  • Signal_Reaction:

    “EGF receptor  Grb2”

  • From_molecule “EGF receptor”

  • To_molecule “Grb2”

  • Tissue “liver”

  • Effect “activation”

  • Interaction

    “SH2+phosphorylated Tyr”

  • Reference [Yamauchi_1997]

  • A Standard Reaction


Example 3
Example. 3 venture

Excerpted @[Takai98]

  • Signal_Reaction:

    “Ah receptor + HSP90 ”

  • Component “Ah receptor” “HSP90”

  • Effect “activation dissociation”

  • Interaction

    “PAS domain”

    “of Ah receptor”

  • Activity

    “inactivation of Ah receptor”

  • Reference [Powell-Coffman_1998]

  • A Polymerization Reaction


FASTUS venture

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

2.Basic Phrases:

Simple noun groups, verb groups and particles

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


FASTUS venture

Is separation of

stages possible ?

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

2.Basic Phrases:

Simple noun groups, verb groups and particles

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


FASTUS venture

Is separation of

stages possible ?

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Open word classes:

techical terms

very long

specific formation rules

many semantic classes

acronyms

variants

fairly ambiguous

[[Term recognition]]

Coordination across word

formation

A or B and C D

2.Basic Phrases:

Simple noun groups, verb groups and particles

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


FASTUS venture

Is separation of

stages possible ?

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

2.Basic Phrases:

Simple noun groups, verb groups and particles

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Syntax/Semantics venture

An active phorbol ester must therefore, presumably

by activation of protein kinase C, cause dissociation

of a cytoplasmic complex of NF-kappa B and I kappa B

by modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.


Syntax/Semantics venture

An active phorbol ester must therefore, presumably

by activation of protein kinase C, cause dissociation

of a cytoplasmic complex of NF-kappa B and I kappa B

by modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

E2: The active phorbol ester modifies I kappa B.


Part-Whole venture

Syntax/Semantics

An active phorbol ester must therefore, presumably

by activation of protein kinase C, cause dissociation

of a cytoplasmic complex of NF-kappa B and I kappa B

by modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

E2: The active phorbol ester modifies I kappa B.

E3: It dissociates a cytoplasmic complex of NF-kappa B

and I kappa B.


Syntax/Semantics venture

An active phorbol ester must therefore, presumably

by activation of protein kinase C, cause dissociation

of a cytoplasmic complex of NF-kappa B and I kappa B

by modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

E2: The active phorbol ester modifies I kappa B.

E3: It dissociates a cytoplasmic complex of NF-kappa B

and I kappa B.

Part-Whole


Full parser based on good grammar formalisms venture

  • Several attempts of using full parsers :

  • To improve the Precision

  • Systematic treatment of interaction of the

  • different phases :

  • Unification-based grammar formalisms

  • The two papers in the NLP session of PSB 2001


180 sentences from abstracts in MEDLINE venture

Theaverage parse time per sentence: 2.7 sec by a naïve parser

(This can be improved by the multi-stage parser by 50 times)

Experiment

(A.Yakushiji et.al, PSB2001)

XHPSG: HPSG-like Grammar translated from

XTAG of U-Penn (Y.Tateishi, TAG+ workshop 98)

Terms (Compound nouns) are chunked beforehand.

Automatic conversion: Detailed, empirical comparison of

grammars of different formalisms (+LFG)


68% venture

Argument Frame Extractor

133 argument structures, marked by a domain specialist

in 97 sentences among the 180 sentences

Extracted Uniquely

31

Extracted with ambiguity

32

Extractable from pp’s

26

Parsing

Failures

Not extractable

27

Memory limitation,etc

17


Open class words: venture

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Ontology: Knowledge of the Domain

More refined semantic

classes with part-whole

relationships, properties,

Etc.

Acronyms, variants,

Etc.


Open class words: venture

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Ontology: Knowledge of the Domain

More refined semantic

classes with part-whole

relationships, properties,

Etc.

Acronyms, variants,

Etc.


T venture

B

B

Bio Term Bank

Biological ontology committee Japan organized by T. Takagi and T. Takai, U.Tokyo in Genome Projects of MESSC (2000.4~2005.3)

  • A database for all sort of biological terms collected from genome databases and biological texts.

  • It will contain 2 million terms in 2001 and 5 million terms until 2005.

  • Terms are classified by biochemical and terminological attributes, grounded on their resources.


Open class words: venture

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Ontology: Knowledge of the Domain

More refined semantic

classes with part-whole

relationships, properties,

Etc.

Acronyms, variants,

Etc.


Genia ontology current version
GENIA ontology venture(current version)

+-name-+-source-+-natural-+-organism-+-multi-cell organism

| | | +-mono-cell organism

| | | +-virus

| | +-tissue

| | +-cell type

| | +-sub-location of cells

| +-artificial-+-cell line

|

+-substance-+-compound-+-organic-+-amino-+-protein-+-protein family or group

| |+-protein complex

| |+-individual protein molecule

| | +-subunit of protein complex

| | +-substructure of protein

| | +-domain or region of protein

| +-peptide

| +-amino acid monomer

|

+-nucleic-+-DNA-+-DNA family or group

| +-individual DNA molecule

| +-domain or region of DNA

|

+-RNA-+-RNA family or group

+-individual RNA molecule

+-domain or region of RNA


Expansion of genia ontology
Expansion of GENIA Ontology venture

  • Try to tag all NPs in some MEDLINE abstracts and find the classes that appears in abstracts but not in current ontology

  • Find frequent verbs and what class of arguments they take


Expansion of genia ontology1
Expansion of GENIA Ontology venture

  • Chemical class of substance and their substrucutres

  • Sources

  • Biological role, or function, of substances

  • Reaction

    • Biological reaction

    • Pathway

    • Disease

  • Structure themselves

  • Experiment , experimental results, and researchers

  • Measure


Example of entities in expanded
Example of Entities in Expanded venture

  • Biological role, or function, of substances

    • receptor, inhibitor, …

  • Biological reaction

    • activation, binding, inhibition, apoptosis, G2 arrest

    • pathway, signal

    • immune dysfunction, Ataxia telangiectasia (AT)

  • Structure themselves

    • alpha-helix,

  • Experiment, experimental results, researchers

    • our results, these studies, we


Verbs related to biological events frequent verbs in 100 medline abstracts
Verbs Related to Biological Events ventureFrequent Verbs in 100 MEDLINE Abstracts


Verbs related to biological events verbs that take biological entities as arguments
Verbs Related to Biological Events ventureVerbs that take biological entities as arguments

  • induce

    • noun BE INDUCED BY nounactivation of these PROTEINwas induced byPROTEIN

    • noun INDUCE nounPROTEINinducedthe tyrosine phosphorylation

  • bind

    • noun BIND TO nounthe drugsbind totwo different PROTEIN

    • noun BIND nounmotifs previously found tobindthe cellular factors

    • noun BINDING noun theTATA-box binding protein

    • the BINDING of noun the binding of PROTEIN

semantic class: substancestructuresourceexperimentfact reaction


Verbs related to biological events verbs that take description entities
Verbs Related to Biological Events ventureVerbs that take description entities

  • report

    • noun REPORT that-clause wereport here that PROTEIN is activated by PROTEIN

    • noun REPORT noun we report the characterization of PROTEIN

    • noun REPORT noun wereporta novel structureof PROTEIN

semantic class: substancestructure sourceexperimentfact reaction


Verbs related to biological events verbs whose arguments depend on syntactic patterns
Verbs Related to Biological Events ventureVerbs whose arguments depend on syntactic patterns

  • show

    • noun BE SHOWN to-infinitive PROTEIN has been shown totrigger cellular PROTEIN activity

    • noun SHOW that-clause the datashow thatPROTEIN stimulation is also not sufficient

    • noun SHOW noun SOURCEshoweda dose-dependent inhibition of PROTEIN activity

semantic class: substancesourceexperimentfact


Verbs related to biological events verbs that take both entities
Verbs Related to Biological Events ventureVerbs that take both entities

  • indicate

    • noun INDICATE that-clause the dataindicate that PROTEIN is required in CELL prolifiration

    • noun INDICATE noun these findingsindicatean unexpected role of DNA

    • noun INDICATE that-clause the structureindicatesthat it represents a unique class of PROTEIN

    • noun INDICATE noun the structureindicatesmechanismsforallosteric effector action

semantic class: substancestructure sourceexperimentfact reaction role


Example of ne annotation
Example of NE Annotation venture

UI - 85146267

TI - Characterization of <NE ti="3" class="protein" nm="aldosterone binding site" mt="SV" subclass="family_or_group" unsure="Class" cmt="">aldosterone binding sites</NE ti="3"> in circulating <NE ti="2" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="2">.

AB - <NE ti="4" class="protein" nm="Aldosterone binding sites" mt="SV" subclass="family_or_group" unsure="Class" cmt="">Aldosterone binding sites</NE ti="4"> in <NE ti="1" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="1"> were characterized after separation of cells from blood by a Percoll gradient. After washing and resuspension in <NE ti="5" class="other_organic_compounds" nm="RPMI-1640 medium" mt="SV" unsure="OK" cmt="">RPMI-1640 medium</NE ti="5">, cells were incubated at 37 degrees C for 1 h with different concentrations of <NE ti="6" class="other_organic_compounds" nm="[3H]aldosterone" mt="SV" unsure="OK" cmt="">[3H]aldosterone</NE ti="6"> plus a 100-fold concentration of <NE ti="7" class="other_organic_compounds" nm="RU-26988" mt="SV" unsure="OK" cmt="">RU-26988 </NE ti="7">(<NE ti=“17" class="other_organic_compounds" nm="11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one" mt="SV" unsure="OK" cmt="">11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one</NE ti=“17">), with or without an excess of unlabeled <NE ti="8" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="8">. <NE ti="9" class="other_organic_compounds" nm="Aldosterone" mt="SV" unsure="OK" cmt="">Aldosterone</NE ti="9"> binds to a single class of <NE ti="10" class="protein" nm="receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">receptors</NE ti="10"> with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14) and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity data show a hierarchy of affinity of <NE ti="11" class="other_organic_compounds" nm="desoxycorticosterone" mt="SV" unsure="OK" cmt="">desoxycorticosterone</NE ti="11"> = <NE ti="12" class="other_organic_compounds" nm="corticosterone" mt="SV" unsure="OK" cmt="">corticosterone</NE ti="12"> = <NE ti="13" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="13"> greater than <NE ti="14" class="other_organic_compounds" nm="hydrocortisone" mt="SV" unsure="OK" cmt="">hydrocortisone</NE ti="14"> greater than <NE ti="15" class="other_organic_compounds" nm="dexamethasone" mt="SV" unsure="OK" cmt="">dexamethasone</NE ti="15">. The results indicate that <NE ti="17" class="cell_type" nm="mononuclear leukocyte" mt="SV" unsure="OK" cmt="">mononuclear leukocytes</NE ti="17"> could be useful for studying the physiological significance of these <NE ti="16" class="protein" nm="mineralocorticoid receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">mineralocorticoid receptors</NE ti="16"> and their regulation in humans.


Available from our website: venture

Definition of ontological classes

Manual of GMPL: extention of XML to annonate texts

Manual of Text Annotation

Soon: Annotated texts (1000 abstracts) by the end of March


2. However, the domains in Bio-chemistry seem more

structurally rich than the domains we have dealt with so

far.

Term formation, rich ontologies, complex syntactic

structures.

3. It requires substantial efforts in resource building.

4. However, those resources can contribute to other

applications :

Knowledge sharing, Intelligent IR, Knowledge discovery

One of the crucial techniques is ATR ….


Overview of genia system1

Retrieval Module venture

Corpus

  • Request enhancement

  • Spawn request

  • Classify documents

Information Extraction

Module

  • Identify & classify terms

  • Identify events

Interface Module

Database

  • GUI

  • HTML conversion

  • System integration

Ontology

Markup

language

Data model

Raw(OCR)

Text

Structure

Annotated

Background Knowledge

Document

Named-Entity

Event

Overview of GENIA System

MEDLINE

Corpus Module

  • Markup generation / compilation

  • Annotated corpus construction

  • User

  • IR Request

  • Abstract

  • Full Paper

Security

Database Module

Concept Module

  • DB design / access / management

  • DB construction

  • BK design / construction / compilation


ad