information extraction l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Information Extraction PowerPoint Presentation
Download Presentation
Information Extraction

Loading in 2 Seconds...

play fullscreen
1 / 87

Information Extraction - PowerPoint PPT Presentation


  • 195 Views
  • Uploaded on

Information Extraction. Adapted from slides by Junichi Tsujii, Ronen Feldman and others. Email Insurance claims News articles Web pages Patent portfolios …. Customer complaint letters Contracts Transcripts of phone calls with customers Technical documents ….

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Extraction' - khanh


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information extraction

Information Extraction

Adapted from slides by Junichi Tsujii, Ronen Feldman and others

most data are unstructured text or semi structured
Email

Insurance claims

News articles

Web pages

Patent portfolios

Customer complaint letters

Contracts

Transcripts of phone calls with customers

Technical documents

Most Data are Unstructured (Text) or Semi-Structured…

Text data mining has become more and more important…

(Adapted from J. Dorre et al. “Text Mining: Finding Nuggets in Mountains of Textual Data”)

slide3

Application Tasks of NLP

(1)Information Retrieval/Detection

To search and retrieve documents in response to queries

for information

(2)Passage Retrieval

To search and retrieve part of documents in response

to queries for information

(3)Information Extraction

To extract information that fits pre-defined database schemas

or templates, specifying the output formats

(4) Question/Answering Tasks

To answer general questions by using texts as knowledge

base: Fact retrieval, combination of IR and IE

(5)Text Understanding

To understand texts as people do: Artificial Intelligence

information extraction a pragmatic approach
Information Extraction:A Pragmatic Approach
  • Let application requirements drive semantic analysis
  • Identify the types of entities that are relevant to a particular task
  • Identify the range of facts that one is interested in for those entities
  • Ignore everything else
ie definitions
IE definitions
  • Entity: an object of interest such as a person or organization
  • Attribute: A property of an entity such as name, alias, descriptor or type
  • Fact: A relationship held between two or more entities such as Position of Person in Company
  • Event: An activity involving several entities such as terrorist act, airline crash, product information
ie accuracy typical figures by information type
IE accuracy typical figures by information type
  • Entity: 90-98%
  • Attribute: 80%
  • Fact: 60-70%
  • Event: 50-60%
muc conferences
MUC conferences
  • MUC 1 to MUC 7
  • 1987 to 1997
  • Topics:
    • Naval operations (2)
    • Terrorist Activity (2)
    • Joint venture and microelectronics
    • Management changes
    • Space Vehicles and Missile launches
muc and scenario templates
MUC and Scenario Templates
  • Define a set of “interesting entities”
    • Persons, organizations, locations…
  • Define a complex scenario involving interesting events and relations over entities
    • Example: management succession: persons, companies, positions, reasons for succession
  • This collection of entities and relations is called a “scenario template.”
problems with scenario template
Problems with Scenario Template
  • Encouraged development of highly domain specific ontologies, rule systems, heuristics, etc.
  • Most of the effort expended on building a scenario template system was not directly applicable to a different scenario template.
addressing the problem
Addressing the Problem
  • Address a large number of smaller, more focused scenario templates (Event-99)
  • Develop a more systematic ground-up approach to semantics by focusing on elementary entities, relations, and events (ACE)
the ace evaluation
The ACE Evaluation
  • The ACE program – challenge of extracting content from human language. Research effort directed to master
    • first the extraction of “entities”
    • Then the extraction of “relations” among these entities
    • Finally the extraction of “events” that are causally related sets of relations
  • After two years, top systems successfully capture well over 50 % of the value at the entity level
the ace program
The ACE Program
  • “Automated Content Extraction”
  • Develop core information extraction technology by focusing on extracting specific semantic entities and relations over a very wide range of texts.
  • Corpora: Newswire and broadcast transcripts, but broad range of topics and genres.
    • Third person reports
    • Interviews
    • Editorials
    • Topics: foreign relations, significant events, human interest, sports, weather
  • Discourage highly domain- and genre-dependent solutions
applications of ie
Applications of IE
  • Routing of information
  • Infrastructure for IR and categorization (higher level features)
  • Event based summarization
  • Automatic creation of databases and knowledge bases
where would ie be useful
Where would IE be useful?
  • Semi-structured text
  • Generic documents like news articles
  • Most of the information in the doc is centered around a set of easily identifiable entities
the problem
The Problem

Date

Time: Start - End

Location

Speaker

Person

what is information extraction
What is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Courtesy of William W. Cohen

what is information extraction17
What is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

IE

NAME TITLE ORGANIZATION

Bill GatesCEOMicrosoft

Bill VeghteVPMicrosoft

Richard StallmanfounderFree Soft..

Courtesy of William W. Cohen

what is information extraction18
What is “Information Extraction”

Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

aka “named entity extraction”

Courtesy of William W. Cohen

what is information extraction19
What is “Information Extraction”

Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

Courtesy of William W. Cohen

what is information extraction20
What is “Information Extraction”

Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

Courtesy of William W. Cohen

what is information extraction21

NAME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill

Veghte

VP

Microsoft

Stallman

founder

Free Soft..

Richard

What is “Information Extraction”

Information Extraction =

segmentation + classification + association+ clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

*

*

*

*

Courtesy of William W. Cohen

landscape of ie tasks single field record
Landscape of IE Tasks:Single Field/Record

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Single entity

Binary relationship

N-ary record

Person: Jack Welch

Relation: Person-Title

Person: Jack Welch

Title: CEO

Relation: Succession

Company: General Electric

Title: CEO

Out: Jack Welsh

In: Jeffrey Immelt

Person: Jeffrey Immelt

Relation: Company-Location

Company: General Electric

Location: Connecticut

Location: Connecticut

“Named entity” extraction

landscape of ie techniques

Classify Pre-segmentedCandidates

Sliding Window

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Classifier

Classifier

which class?

which class?

Try alternatewindow sizes:

Context Free Grammars

Boundary Models

Finite State Machines

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

BEGIN

Most likely state sequence?

NNP

NNP

V

V

P

NP

Most likely parse?

Classifier

PP

which class?

VP

NP

VP

BEGIN

END

BEGIN

END

S

Landscape of IE Techniques

Lexicons

Abraham Lincoln was born in Kentucky.

member?

Alabama

Alaska

Wisconsin

Wyoming

Courtesy of William W. Cohen

ie with hidden markov models
IE with Hidden Markov Models

Given a sequence of observations:

Yesterday Pedro Domingos spoke this example sentence.

and a trained HMM:

person name

location name

background

Find the most likely state sequence: (Viterbi)

YesterdayPedro Domingosspoke this example sentence.

Any words said to be generated by the designated “person name”

state extract as a person name:

Person name: Pedro Domingos

hmm for segmentation
HMM for Segmentation
  • Simplest Model: One state per entity type
discriminative approaches
Discriminative Approaches

Yesterday Pedro Domingos spoke this example sentence.

Is this phrase (X) a name? Y=1 (yes); Y=0 (no)

Learn from many examples to predict Y from X

parameters

Maximum Entropy, Logistic Regression:

Features (e.g., is the phrase capitalized?)

More sophisticated: Consider dependency between different labels

(e.g. Conditional Random Fields)

slide27

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)

slide28

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)

slide29

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)

slide30

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)

slide31

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)

slide32

production of

20, 000 iron and

metal wood clubs

[company]

[set up]

[Joint-Venture]

with

[company]

FASTUS

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

set up

new Twaiwan dallors

2.Basic Phrases:

Simple noun groups, verb groups and particles

a Japanese trading house

had set up

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide33

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 iron and “metal wood” clubs a month.

ACTIVITY-1

Activity: PRODUCTION

Company:

“Bridgestone Sports Taiwan Co.”

Product:

“iron and ‘metal wood’ clubs”

Start Date:

DURING: January 1990

TIE-UP-1

Relationship: TIE-UP

Entities: “Bridgestone Sport Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company:

“Bridgestone Sports Taiwan Co.”

Activity: ACTIVITY-1

Amount: NT$200000000

Example of IE: FASTUS(1993)

slide34

Information Extraction

……….

Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the

second floor of his Nanjing home early on Sunday.

The deputy general manager of Yaxing Benz, a Sino-German

joint venture that makes buses and bus chassis in nearby Yangzhou,

was hacked to death with 45 cm watermelon knives.

……….

Name of the Venture: Yaxing Benz

Products: buses and bus chassis

Location: Yangzhou,China

Companies involved: (1)Name: X?

Country: German

(2)Name: Y?

Country: China

slide35

Information Extraction

A German vehicle-firm executive was stabbed to death ….

……….

Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the

second floor of his Nanjing home early on Sunday.

The deputy general manager of Yaxing Benz, a Sino-German

joint venture that makes buses and bus chassis in nearby Yangzhou,

was hacked to death with 45 cm watermelon knives.

……….

Crime-Type: Murder

Type: Stabbing

The killed: Name: Jurgen Pfrang

Age: 51

Profession: Deputy general manager

Location: Nanjing, China

Different template

for crimes

slide36

User

User

System

System

System

Interpretation of Texts

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(4) Question/Answering Tasks

(5)Text Understanding

slide37

Characterization of Texts

Queries

IR System

Collection of Texts

slide38

Knowledge

Interpretation

Characterization of Texts

Queries

IR System

Collection of Texts

slide39

Knowledge

Interpretation

Characterization of Texts

Queries

Passage

IR System

Collection of Texts

slide40

Knowledge

Interpretation

Characterization of Texts

Queries

Structures

of

Sentences

NLP

Templates

Passage

IR System

IE System

Collection of Texts

Texts

slide41

Knowledge

Interpretation

IE System

Templates

Texts

slide42

Knowledge

General Framework

of

NLP/NLU

IE as

compromise NLP

Interpretation

IE System

Templates

Texts

Predefined

slide43

Rather clear

A bit vague

Rather clear

A bit vague

Very vague

Performance Evaluation

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(4) Question/Answering Tasks

(5)Text Understanding

slide44

Query

N: Correct Documents

M:Retrieved Documents

C: Correct Documents that are

actually retrieved

N

M

C

C

Precision:

Recall:

P

M

N

C

R

2P・R

F-Value:

P+R

Collection of Documents

slide45

Query

N: Correct Templates

M:Retrieved Templates

C: Correct Templates that are

actually retrieved

N

M

C

C

Precision:

Recall:

P

M

N

C

R

2P・R

F-Value:

P+R

Collection of Documents

More complicated due to partially

filled templates

slide46

Framework of IE

IE as compromise NLP

slide47

Predefined

Aspects of

Information

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Incomplete

Domain Knowledge

Interpretation Rules

Context processing

Interpretation

slide48

Predefined

Aspects of

Information

Difficulties of NLP

General Framework of NLP

(1) Robustness:

Incomplete Knowledge

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Analysis

Incomplete

Domain Knowledge

Interpretation Rules

Context processing

Interpretation

approaches for building ie systems
Approaches for building IE systems
  • Knowledge Engineering Approach
    • Rules crafted by linguists in cooperation with domain experts
    • Most of the work done by insoecting a set of relevant documents
approaches for building ie systems50
Approaches for building IE systems
  • Automatically trainable systems
    • Techniques based on statistics and almost no linguistic knowledge
    • Language independent
    • Main input – annotated corpus
    • Small effort for creating rules, but crating annotated corpus laborious
slide51

Techniques in IE

(1) Domain Specific Partial Knowledge:

Knowledge relevant to information to be extracted

(2) Ambiguities:

Ignoring irrelevant ambiguities

Simpler NLP techniques

(3) Robustness:

Coping with Incomplete dictionaries

(open class words)

Ignoring irrelevant parts of sentences

(4) Adaptation Techniques:

Machine Learning, Trainable systems

slide52

95 %

FSA rules

Statistic taggers

Part of Speech Tagger

Local Context

Statistical Bias

F-Value

90

Domain

Dependent

General Framework of NLP

Open class words:

Named entity recognition

(ex) Locations

Persons

Companies

Organizations

Position names

Morphological and

Lexical Processing

Syntactic Analysis

Semantic Anaysis

Domain specific rules:

<Word><Word>, Inc.

Mr. <Cpt-L>. <Word>

Machine Learning:

HMM, Decision Trees

Rules + Machine Learning

Context processing

Interpretation

slide53

FASTUS

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Anaysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide54

FASTUS

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Anaysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide55

FASTUS

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Analysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide56

Computationally more complex, Less Efficiency

Chomsky Hierarchy Hierarchy

of Grammar of Automata

Regular Grammar Finite State Automata

Context Free Grammar Push Down Automata

Context Sensitive Grammar Linear Bounded Automata

Type 0 Grammar Turing Machine

slide57

n

n

A B

Computationally more complex, Less Efficiency

Chomsky Hierarchy Hierarchy

of Grammar of Automata

Regular Grammar Finite State Automata

Context Free Grammar Push Down Automata

Context Sensitive Grammar Linear Bounded Automata

Type 0 Grammar Turing Machine

slide58

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN

slide59

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN

slide60

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN

slide61

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’s interesting

book with a nice cover

P

4

PN

slide62

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

book with a nice cover

P

4

PN

slide63

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwith a nice cover

P

4

PN

slide64

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwith a nice cover

P

4

PN

slide65

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwitha nice cover

P

4

PN

slide66

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanice cover

P

4

PN

slide67

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanicecover

P

4

PN

slide68

Pattern-maching

PN ’s (ADJ)* N P Art (ADJ)* N

{PN ’s/ Art}(ADJ)* N(P Art (ADJ)* N)*

1

’s

PN

Art

2

0

ADJ

N

’s

Art

3

John’sinteresting

bookwithanicecover

P

4

PN

slide69

FASTUS

General Framework of NLP

Based on finite states automata (FSA)

1.Complex Words:

Recognition of multi-words and proper names

Morphological and

Lexical Processing

2.Basic Phrases:

Simple noun groups, verb groups and particles

Syntactic Analysis

3.Complex phrases:

Complex noun groups and verb groups

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Semantic Analysis

Context processing

Interpretation

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide70

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Attachment

Ambiguities

are not made

explicit

Example of IE: FASTUS(1993)

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

slide71

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Example of IE: FASTUS(1993)

{{

}}

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

a Japanese tea house

a [Japanese tea] house

a Japanese [tea house]

slide72

Bridgestone Sports Co. said Friday it had set up a joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubs a month.

Structural

Ambiguities of

NP are ignored

Example of IE: FASTUS(1993)

1.Complex words

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

slide73

Bridgestone Sports Co. said Friday it had set upa joint venture

in Taiwan with a local concern and a Japanese trading house to

produce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20

million new Taiwan dollars, will start production in January 1990

with production of 20,000 “metal wood” clubsa month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

slide74

[COMPNY] said Friday it [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] and [COMPNY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE], [COMPNY], capitalized at 20 million

[CURRENCY-UNIT][START] production in [TIME]

with production of 20,000 [PRODUCT] a month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

Some syntactic structures

like …

slide75

[COMPNY] said Friday it [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE] capitalized at [CURRENCY][START] production in [TIME]

with production of [PRODUCT] a month.

Example of IE: FASTUS(1993)

3.Complex Phrases

2.Basic Phrases:

Bridgestone Sports Co.: Company name

said : Verb Group

Friday : Noun Group

it : Noun Group

had set up : Verb Group

a joint venture : Noun Group

in : Preposition

Taiwan : Location

Syntactic structures relevant

to information to be extracted

are dealt with.

slide76

Syntactic variations

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.

slide77

S

NP

VP

GM

[SET-UP]

V

NP

signed

VP

N

agreement

V

setting up

Syntactic variations

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.

GM plans to set up a joint venture with Toyota.

GM expects to set up a joint venture with Toyota.

slide78

[SET-UP]

Syntactic variations

GM set up a joint venture with Toyota.

GM announced it was setting up a joint venture with Toyota.

GM signed an agreement setting up a joint venture with Toyota.

GM announced it was signing an agreement to set up a joint

venture with Toyota.

S

NP

VP

GM

V

set up

GM plans to set up a joint venture with Toyota.

GM expects to set up a joint venture with Toyota.

slide79

[COMPNY] [SET-UP] [JOINT-VENTURE]

in [LOCATION] with [COMPANY] to

produce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE] capitalized at [CURRENCY][START] production in [TIME]

with production of [PRODUCT] a month.

The attachment positions of PP are determined at this stage.

Irrelevant parts of sentences are ignored.

Example of IE: FASTUS(1993)

3.Complex Phrases

4.Domain Events

[COMPANY][SET-UP][JOINT-VENTURE]with[COMPNY]

[COMPANY][SET-UP][JOINT-VENTURE] (others)* with[COMPNY]

slide80

Complications caused by syntactic variations

Relative clause

The mayor, who was kidnapped yesterday, was found dead today.

[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG] Relpro {NG/others}* [VG]

slide81

Complications caused by syntactic variations

Relative clause

The mayor, who was kidnapped yesterday, was found dead today.

[NG]Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG]Relpro {NG/others}*[VG]

slide82

Patterns used

by Domain Event

Surface Pattern

Generator

Basic patterns

Relative clause construction

Passivization, etc.

Complications caused by syntactic variations

Relative clause

The mayor, who waskidnapped yesterday, was found dead today.

[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]

[NG] Relpro {NG/others}* [VG]

slide83

Piece-wise recognition

of basic templates

Reconstructing information

carried via syntactic structures

by merging basic templates

FASTUS

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

slide84

FASTUS

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Piece-wise recognition

of basic templates

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

Reconstructing information

carried via syntactic structures

by merging basic templates

slide85

FASTUS

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Piece-wise recognition

of basic templates

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

Reconstructing information

carried via syntactic structures

by merging basic templates

slide86

FASTUS

Based on finite states automata (FSA)

NP, who was kidnapped, was found.

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:

Patterns for events of interest to the application

Basic templates are to be built.

Piece-wise recognition

of basic templates

5. Merging Structures:

Templates from different parts of the texts are

merged if they provide information about the

same entity or event.

Reconstructing information

carried via syntactic structures

by merging basic templates

slide87

Current state of the arts of IE

  • Carefully constructed IE systems
  • F-60 level (interannotater agreement: 60-80%)
  • Domain: telegraphic messages about naval operation
  • (MUC-1:87, MUC-2:89)
  • news articles and transcriptions of radio broadcasts
  • Latin American terrorism (MUC-3:91, MUC-4:1992)
  • News articles about joint ventures (MUC-5, 93)
  • News articles about management changes (MUC-6, 95)
  • News articles about space vehicle (MUC-7, 97)
  • Handcrafted rules (named entity recognition, domain events, etc)

Automatic learning from texts:

Supervised learning : corpus preparation

Non-supervised, or controlled learning