Information extraction
Download
1 / 87

Information Extraction - PowerPoint PPT Presentation


  • 187 Views
  • Updated On :

Information Extraction. Adapted from slides by Junichi Tsujii, Ronen Feldman and others. Email Insurance claims News articles Web pages Patent portfolios …. Customer complaint letters Contracts Transcripts of phone calls with customers Technical documents ….

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Extraction' - khanh


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Information extraction l.jpg

Information Extraction

Adapted from slides by Junichi Tsujii, Ronen Feldman and others


Most data are unstructured text or semi structured l.jpg

Email

Insurance claims

News articles

Web pages

Patent portfolios

Customer complaint letters

Contracts

Transcripts of phone calls with customers

Technical documents

Most Data are Unstructured (Text) or Semi-Structured…

Text data mining has become more and more important…

(Adapted from J. Dorre et al. “Text Mining: Finding Nuggets in Mountains of Textual Data”)


Slide3 l.jpg

Application Tasks of NLP

(1)Information Retrieval/Detection

To search and retrieve documents in response to queries

for information

(2)Passage Retrieval

To search and retrieve part of documents in response

to queries for information

(3)Information Extraction

To extract information that fits pre-defined database schemas

or templates, specifying the output formats

(4) Question/Answering Tasks

To answer general questions by using texts as knowledge

base: Fact retrieval, combination of IR and IE

(5)Text Understanding

To understand texts as people do: Artificial Intelligence


Information extraction a pragmatic approach l.jpg
Information Extraction:A Pragmatic Approach

  • Let application requirements drive semantic analysis

  • Identify the types of entities that are relevant to a particular task

  • Identify the range of facts that one is interested in for those entities

  • Ignore everything else


Ie definitions l.jpg
IE definitions

  • Entity: an object of interest such as a person or organization

  • Attribute: A property of an entity such as name, alias, descriptor or type

  • Fact: A relationship held between two or more entities such as Position of Person in Company

  • Event: An activity involving several entities such as terrorist act, airline crash, product information


Ie accuracy typical figures by information type l.jpg
IE accuracy typical figures by information type

  • Entity: 90-98%

  • Attribute: 80%

  • Fact: 60-70%

  • Event: 50-60%


Muc conferences l.jpg
MUC conferences

  • MUC 1 to MUC 7

  • 1987 to 1997

  • Topics:

    • Naval operations (2)

    • Terrorist Activity (2)

    • Joint venture and microelectronics

    • Management changes

    • Space Vehicles and Missile launches


Muc and scenario templates l.jpg
MUC and Scenario Templates

  • Define a set of “interesting entities”

    • Persons, organizations, locations…

  • Define a complex scenario involving interesting events and relations over entities

    • Example: management succession: persons, companies, positions, reasons for succession

  • This collection of entities and relations is called a “scenario template.”


Problems with scenario template l.jpg
Problems with Scenario Template

  • Encouraged development of highly domain specific ontologies, rule systems, heuristics, etc.

  • Most of the effort expended on building a scenario template system was not directly applicable to a different scenario template.


Addressing the problem l.jpg
Addressing the Problem

  • Address a large number of smaller, more focused scenario templates (Event-99)

  • Develop a more systematic ground-up approach to semantics by focusing on elementary entities, relations, and events (ACE)


The ace evaluation l.jpg
The ACE Evaluation

  • The ACE program – challenge of extracting content from human language. Research effort directed to master

    • first the extraction of “entities”

    • Then the extraction of “relations” among these entities

    • Finally the extraction of “events” that are causally related sets of relations

  • After two years, top systems successfully capture well over 50 % of the value at the entity level


The ace program l.jpg
The ACE Program

  • “Automated Content Extraction”

  • Develop core information extraction technology by focusing on extracting specific semantic entities and relations over a very wide range of texts.

  • Corpora: Newswire and broadcast transcripts, but broad range of topics and genres.

    • Third person reports

    • Interviews

    • Editorials

    • Topics: foreign relations, significant events, human interest, sports, weather

  • Discourage highly domain- and genre-dependent solutions


Applications of ie l.jpg
Applications of IE

  • Routing of information

  • Infrastructure for IR and categorization (higher level features)

  • Event based summarization

  • Automatic creation of databases and knowledge bases


Where would ie be useful l.jpg
Where would IE be useful?

  • Semi-structured text

  • Generic documents like news articles

  • Most of the information in the doc is centered around a set of easily identifiable entities


The problem l.jpg
The Problem

Date

Time: Start - End

Location

Speaker

Person


What is information extraction l.jpg
What is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Courtesy of William W. Cohen


What is information extraction17 l.jpg
What is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

IE

NAME TITLE ORGANIZATION

Bill GatesCEOMicrosoft

Bill VeghteVPMicrosoft

Richard StallmanfounderFree Soft..

Courtesy of William W. Cohen


What is information extraction18 l.jpg
What is “Information Extraction”

Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

aka “named entity extraction”

Courtesy of William W. Cohen


What is information extraction19 l.jpg
What is “Information Extraction”

Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

Courtesy of William W. Cohen


What is information extraction20 l.jpg
What is “Information Extraction”

Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

Courtesy of William W. Cohen


What is information extraction21 l.jpg

NAME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill

Veghte

VP

Microsoft

Stallman

founder

Free Soft..

Richard

What is “Information Extraction”

Information Extraction =

segmentation + classification + association+ clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

*

*

*

*

Courtesy of William W. Cohen


Landscape of ie tasks single field record l.jpg
Landscape of IE Tasks:Single Field/Record

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Single entity

Binary relationship

N-ary record

Person: Jack Welch

Relation: Person-Title

Person: Jack Welch

Title: CEO

Relation: Succession

Company: General Electric

Title: CEO

Out: Jack Welsh

In: Jeffrey Immelt

Person: Jeffrey Immelt

Relation: Company-Location

Company: General Electric

Location: Connecticut

Location: Connecticut

“Named entity” extraction


Landscape of ie techniques l.jpg

Classify Pre-segmentedCandidates

Sliding Window

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Classifier

Classifier

which class?

which class?

Try alternatewindow sizes:

Context Free Grammars

Boundary Models

Finite State Machines

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

BEGIN

Most likely state sequence?

NNP

NNP

V

V

P

NP

Most likely parse?

Classifier

PP

which class?

VP

NP

VP

BEGIN

END

BEGIN

END

S

Landscape of IE Techniques

Lexicons

Abraham Lincoln was born in Kentucky.

member?

Alabama

Alaska

Wisconsin

Wyoming

Courtesy of William W. Cohen


Ie with hidden markov models l.jpg
IE with Hidden Markov Models

Given a sequence of observations:

Yesterday Pedro Domingos spoke this example sentence.

and a trained HMM:

person name

location name

background

Find the most likely state sequence: (Viterbi)

YesterdayPedro Domingosspoke this example sentence.

Any words said to be generated by the designated “person name”

state extract as a person name:

Person name: Pedro Domingos


Hmm for segmentation l.jpg
HMM for Segmentation

  • Simplest Model: One state per entity type


Discriminative approaches l.jpg
Discriminative Approaches

Yesterday Pedro Domingos spoke this example sentence.

Is this phrase (X) a name? Y=1 (yes); Y=0 (no)

Learn from many examples to predict Y from X

parameters

Maximum Entropy, Logistic Regression:

Features (e.g., is the phrase capitalized?)

More sophisticated: Consider dependency between different labels

(e.g. Conditional Random Fields)


Slide27 l.jpg

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Slide28 l.jpg

Bridgestone Sports Co. venture said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Slide29 l.jpg

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Slide30 l.jpg

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Slide31 l.jpg

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Slide32 l.jpg

production of venture

20, 000 iron and

metal wood clubs

[company]

[set up]

[Joint-Venture]

with

[company]

FASTUS

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

set up

new Twaiwan dallors

2.Basic Phrases:

Simple noun groups, verb groups and particles

a Japanese trading house

had set up

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Slide33 l.jpg

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)


Slide34 l.jpg

Information Extraction venture

……….

Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the

second floor of his Nanjing home early on Sunday.

The deputy general manager of Yaxing Benz, a Sino-German

joint venture that makes buses and bus chassis in nearby Yangzhou,

was hacked to death with 45 cm watermelon knives.

……….

Name of the Venture: Yaxing Benz

Products: buses and bus chassis

Location: Yangzhou,China

Companies involved: (1)Name: X?

Country: German

(2)Name: Y?

Country: China


Slide35 l.jpg

Information Extraction venture

A German vehicle-firm executive was stabbed to death ….

……….

Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the

second floor of his Nanjing home early on Sunday.

The deputy general manager of Yaxing Benz, a Sino-German

joint venture that makes buses and bus chassis in nearby Yangzhou,

was hacked to death with 45 cm watermelon knives.

……….

Crime-Type: Murder

Type: Stabbing

The killed: Name: Jurgen Pfrang

Age: 51

Profession: Deputy general manager

Location: Nanjing, China

Different template

for crimes


Slide36 l.jpg

User venture

User

System

System

System

Interpretation of Texts

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(4) Question/Answering Tasks

(5)Text Understanding


Slide37 l.jpg

Characterization of Texts venture

Queries

IR System

Collection of Texts


Slide38 l.jpg

Knowledge venture

Interpretation

Characterization of Texts

Queries

IR System

Collection of Texts


Slide39 l.jpg

Knowledge venture

Interpretation

Characterization of Texts

Queries

Passage

IR System

Collection of Texts


Slide40 l.jpg

Knowledge venture

Interpretation

Characterization of Texts

Queries

Structures

of

Sentences

NLP

Templates

Passage

IR System

IE System

Collection of Texts

Texts


Slide41 l.jpg

Knowledge venture

Interpretation

IE System

Templates

Texts


Slide42 l.jpg

Knowledge venture

General Framework

of

NLP/NLU

IE as

compromise NLP

Interpretation

IE System

Templates

Texts

Predefined


Slide43 l.jpg

Rather clear venture

A bit vague

Rather clear

A bit vague

Very vague

Performance Evaluation

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(4) Question/Answering Tasks

(5)Text Understanding


Slide44 l.jpg

Query venture

N: Correct Documents

M:Retrieved Documents

C: Correct Documents that are

actually retrieved

N

M

C

C

Precision:

Recall:

P

M

N

C

R

2P・R

F-Value:

P+R

Collection of Documents


Slide45 l.jpg

Query venture

N: Correct Templates

M:Retrieved Templates

C: Correct Templates that are

actually retrieved

N

M

C

C

Precision:

Recall:

P

M

N

C

R

2P・R

F-Value:

P+R

Collection of Documents

More complicated due to partially

filled templates


Slide46 l.jpg

Framework of IE venture

IE as compromise NLP


Slide47 l.jpg

Predefined venture

Aspects of

Information

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Incomplete

Domain Knowledge

Interpretation Rules

Context processing

Interpretation


Slide48 l.jpg

Predefined venture

Aspects of

Information

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Incomplete

Domain Knowledge

Interpretation Rules

Context processing

Interpretation


Approaches for building ie systems l.jpg
Approaches for building IE systems venture

  • Knowledge Engineering Approach

    • Rules crafted by linguists in cooperation with domain experts

    • Most of the work done by insoecting a set of relevant documents


Approaches for building ie systems50 l.jpg
Approaches for building IE systems venture

  • Automatically trainable systems

    • Techniques based on statistics and almost no linguistic knowledge

    • Language independent

    • Main input – annotated corpus

    • Small effort for creating rules, but crating annotated corpus laborious


Slide51 l.jpg

Techniques in IE venture

(1) Domain Specific Partial Knowledge:

Knowledge relevant to information to be extracted

(2) Ambiguities:

Ignoring irrelevant ambiguities

Simpler NLP techniques

(3) Robustness:

Coping with Incomplete dictionaries

(open class words)

Ignoring irrelevant parts of sentences

(4) Adaptation Techniques:

Machine Learning, Trainable systems


Slide52 l.jpg

95 % venture

FSA rules

Statistic taggers

Part of Speech Tagger

Local Context

Statistical Bias

F-Value

90

Domain

Dependent

General Framework of NLP

Open class words:

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Anaysis

Domain specific rules:

<Word><Word>, Inc.

Mr. <Cpt-L>. <Word>

Machine Learning:

HMM, Decision Trees

Rules + Machine Learning

Context processing

Interpretation


Slide53 l.jpg

FASTUS venture

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Anaysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Slide54 l.jpg

FASTUS venture

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Anaysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Slide55 l.jpg

FASTUS venture

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Analysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Slide56 l.jpg

Computationally more complex, Less Efficiency venture

Chomsky Hierarchy Hierarchy

of Grammar of Automata

Regular Grammar Finite State Automata

Context Free Grammar Push Down Automata

Context Sensitive Grammar Linear Bounded Automata

Type 0 Grammar Turing Machine


Slide57 l.jpg

n venture

n

A B

Computationally more complex, Less Efficiency

Chomsky Hierarchy Hierarchy

of Grammar of Automata

Regular Grammar Finite State Automata

Context Free Grammar Push Down Automata

Context Sensitive Grammar Linear Bounded Automata

Type 0 Grammar Turing Machine


Slide58 l.jpg

1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN


Slide59 l.jpg

1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN


Slide60 l.jpg

1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN


Slide61 l.jpg

1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN


Slide62 l.jpg

1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

book with a nice cover

P

4

PN


Slide63 l.jpg

1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwith a nice cover

P

4

PN


Slide64 l.jpg

1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwith a nice cover

P

4

PN


Slide65 l.jpg

1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwitha nice cover

P

4

PN


Slide66 l.jpg

1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanice cover

P

4

PN


Slide67 l.jpg

1 venture

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanicecover

P

4

PN


Slide68 l.jpg

Pattern-maching venture

PN ’s (ADJ)* N P Art (ADJ)* N

{PN ’s/ Art}(ADJ)* N(P Art (ADJ)* N)*

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanicecover

P

4

PN


Slide69 l.jpg

FASTUS venture

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Analysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Slide70 l.jpg

Bridgestone Sports Co. venture said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Attachment

Ambiguities

are not made

explicit

Example of IE: FASTUS(1993)

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location


Slide71 l.jpg

Bridgestone Sports Co. venture said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Example of IE: FASTUS(1993)

{{

}}

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

a Japanese tea house

a [Japanese tea] house

a Japanese [tea house]


Slide72 l.jpg

Bridgestone Sports Co. venture said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Structural

Ambiguities of

NP are ignored

Example of IE: FASTUS(1993)

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location


Slide73 l.jpg

Bridgestone Sports Co. venture said Friday it had set upa joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubsa month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location


Slide74 l.jpg

[COMPNY] venturesaid Friday it [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] and [COMPNY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE], [COMPNY], capitalized at 20 million

[CURRENCY-UNIT][START] production in [TIME]

with production of 20,000 [PRODUCT] a month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

Some syntactic structures

like …


Slide75 l.jpg

[COMPNY] venturesaid Friday it [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE] capitalized at [CURRENCY][START] production in [TIME]

with production of [PRODUCT] a month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

Syntactic structures relevant

to information to be extracted

are dealt with.


Slide76 l.jpg

Syntactic variations venture

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.


Slide77 l.jpg

S venture

NP

VP

GM

[SET-UP]

V

NP

signed

VP

N

agreement

V

setting up

Syntactic variations

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.

GM plans to set up a joint venture with Toyota.

GM expects to set up a joint venture with Toyota.


Slide78 l.jpg

[SET-UP] venture

Syntactic variations

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.

S

NP

VP

GM

V

set up

GM plans to set up a joint venture with Toyota.

GM expects to set up a joint venture with Toyota.


Slide79 l.jpg

[COMPNY] venture [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE] capitalized at [CURRENCY][START] production in [TIME]

with production of [PRODUCT] a month.

The attachment positions of PP are determined at this stage.

Irrelevant parts of sentences are ignored.

Example of IE: FASTUS(1993)

3.Complex Phrases

4.Domain Events

[COMPANY][SET-UP][JOINT-VENTURE]with[COMPNY]

[COMPANY][SET-UP][JOINT-VENTURE] (others)* with[COMPNY]


Slide80 l.jpg

Complications caused by syntactic variations venture

Relative clause

The mayor, who was kidnapped yesterday, was found dead today.

[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG] Relpro {NG/others}* [VG]


Slide81 l.jpg

Complications caused by syntactic variations venture

Relative clause

The mayor, who was kidnapped yesterday, was found dead today.

[NG]Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG]Relpro {NG/others}*[VG]


Slide82 l.jpg

Patterns used venture

by Domain Event

Surface Pattern

Generator

Basic patterns

Relative clause construction

Passivization, etc.

Complications caused by syntactic variations

Relative clause

The mayor, who waskidnapped yesterday, was found dead today.

[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG] Relpro {NG/others}* [VG]


Slide83 l.jpg

Piece-wise recognition venture

of basic templates

Reconstructing information

carried via syntactic structures

by merging basic templates

FASTUS

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.


Slide84 l.jpg

FASTUS venture

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Piece-wise recognition

of basic templates

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

Reconstructing information

carried via syntactic structures

by merging basic templates


Slide85 l.jpg

FASTUS venture

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Piece-wise recognition

of basic templates

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

Reconstructing information

carried via syntactic structures

by merging basic templates


Slide86 l.jpg

FASTUS venture

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Piece-wise recognition

of basic templates

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

Reconstructing information

carried via syntactic structures

by merging basic templates


Slide87 l.jpg

Current state of the arts of IE venture

  • Carefully constructed IE systems

  • F-60 level (interannotater agreement: 60-80%)

  • Domain: telegraphic messages about naval operation

  • (MUC-1:87, MUC-2:89)

  • news articles and transcriptions of radio broadcasts

  • Latin American terrorism (MUC-3:91, MUC-4:1992)

  • News articles about joint ventures (MUC-5, 93)

  • News articles about management changes (MUC-6, 95)

  • News articles about space vehicle (MUC-7, 97)

  • Handcrafted rules (named entity recognition, domain events, etc)

Automatic learning from texts:

Supervised learning : corpus preparation

Non-supervised, or controlled learning


ad