Toward unified models of information extraction and data mining
Download
1 / 95

Toward Unified Models of Information Extraction and Data Mining - PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on

Toward Unified Models of Information Extraction and Data Mining. Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department University of Massachusetts Amherst Joint work with Aron Culotta, Charles Sutton, Ben Wellner, Khashayar Rohanimanesh, Wei Li.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Toward Unified Models of Information Extraction and Data Mining' - emile


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Toward unified models of information extraction and data mining l.jpg

Toward Unified Models of Information Extraction and Data Mining

Andrew McCallum

Information Extraction and Synthesis Laboratory

Computer Science Department

University of Massachusetts Amherst

Joint work with

Aron Culotta, Charles Sutton, Ben Wellner, Khashayar Rohanimanesh, Wei Li


Slide2 l.jpg

Goal:

Improving our abilityto mine actionable knowledgefrom unstructured text.


Pages containing the phrase high tech job openings l.jpg
Pages Containing the Phrase“high tech job openings”


Extracting job openings from the web l.jpg

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Extracting Job Openings from the Web



Slide6 l.jpg

Job Openings:

Category = High Tech

Keyword = Java

Location = U.S.



Ie from chinese documents regarding weather l.jpg
IE fromChinese Documents regarding Weather

Department of Terrestrial System, Chinese Academy of Sciences

200k+ documents

several millennia old

- Qing Dynasty Archives

- memos

- newspaper articles

- diaries


What is information extraction l.jpg
What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification + clustering + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation


What is information extraction10 l.jpg
What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation


What is information extraction11 l.jpg
What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification+ association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation


What is information extraction12 l.jpg
What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification+ association+ clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

*

Free Soft..

Microsoft

Microsoft

TITLE ORGANIZATION

*

founder

*

CEO

VP

*

Stallman

NAME

Veghte

Bill Gates

Richard

Bill


Larger context l.jpg
Larger Context

Spider

Filter

Data

Mining

IE

Segment

Classify

Associate

Cluster

Discover patterns - entity types - links / relations - events

Database

Documentcollection

Actionableknowledge

Prediction

Outlier detection

Decision support


Slide14 l.jpg

Problem:

Combined in serial juxtaposition,

IE and KD are unaware of each others’

weaknesses and opportunities.

KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties.

IE is unaware of emerging patterns and regularities in the DB.

The accuracy of both suffers, and significant mining of complex text sources is beyond reach.


Slide15 l.jpg

Solution:

Uncertainty Info

Spider

Filter

Data

Mining

IE

Segment

Classify

Associate

Cluster

Discover patterns - entity types - links / relations - events

Database

Documentcollection

Actionableknowledge

Emerging Patterns

Prediction

Outlier detection

Decision support


Slide16 l.jpg

Discriminatively-trained undirected graphical models

Conditional Random Fields

[Lafferty, McCallum, Pereira]

Conditional PRMs

[Koller…], [Jensen…],

[Geetor…], [Domingos…]

Complex Inference and Learning

Just what we researchers like to sink our teeth into!

Solution:

Unified Model

Spider

Filter

Data

Mining

IE

Segment

Classify

Associate

Cluster

Discover patterns - entity types - links / relations - events

Probabilistic

Model

Documentcollection

Actionableknowledge

Prediction

Outlier detection

Decision support


Outline l.jpg
Outline

a

  • The need for unified IE and DM.

  • Review of Conditional Random Fields for IE.

  • Preliminary steps toward unification:

    • Joint Co-reference Resolution (Graph Partitioning)

    • Joint Labeling of Cascaded Sequences (Belief Propagation)

    • Joint Segmentation and Co-ref (Iterated Conditional Samples.)

  • Conclusions


Hidden markov models l.jpg
Hidden Markov Models

HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …

Graphical model

Finite state model

S

S

S

transitions

t

-

1

t

t+1

...

...

observations

...

Generates:

State

sequence

Observation

sequence

O

O

O

t

-

t

+1

t

1

o1 o2 o3 o4 o5 o6 o7 o8

Parameters: for all states S={s1,s2,…}

Start state probabilities: P(st )

Transition probabilities: P(st|st-1 )

Observation (emission) probabilities: P(ot|st )

Training:

Maximize probability of training observations (w/ prior)

Usually a multinomial over atomic, fixed alphabet


Ie with hidden markov models l.jpg
IE with Hidden Markov Models

Given a sequence of observations:

Yesterday Rich Caruana spoke this example sentence.

and a trained HMM:

person name

location name

background

Find the most likely state sequence: (Viterbi)

YesterdayRich Caruanaspoke this example sentence.

Any words said to be generated by the designated “person name”

state extract as a person name:

Person name: Rich Caruana


We want more than an atomic view of words l.jpg
We want More than an Atomic View of Words

Would like richer representation of text:

many arbitrary, overlapping features of the words.

S

S

S

identity of word

ends in “-ski”

is capitalized

is part of a noun phrase

is in a list of city names

is under node X in WordNet

is in bold font

is indented

is in hyperlink anchor

last person name was female

next two words are “and Associates”

t

-

1

t

t+1

is “Wisniewski”

part ofnoun phrase

ends in

“-ski”

O

O

O

t

-

t

+1

t

1


Problems with richer representation and a joint model l.jpg
Problems with Richer Representationand a Joint Model

These arbitrary features are not independent.

  • Multiple levels of granularity (chars, words, phrases)

  • Multiple dependent modalities (words, formatting, layout)

  • Past & future

    Two choices:

Ignore the dependencies.

This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

Model the dependencies.

Each state would have its own Bayes Net. But we are already starved for training data!

S

S

S

S

S

S

t

-

1

t

t+1

t

-

1

t

t+1

O

O

O

O

O

O

t

-

t

+1

t

-

t

+1

t

1

t

1


Conditional sequence models l.jpg
Conditional Sequence Models

  • We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(s|o) instead of P(s,o):

    • Can examine features, but not responsible for generating them.

    • Don’t have to explicitly model their dependencies.

    • Don’t “waste modeling effort” trying to generate what we are given at test time anyway.


From hmms to conditional random fields l.jpg
From HMMs to Conditional Random Fields

[Lafferty, McCallum, Pereira 2001]

St-1

St

St+1

Joint

...

...

Ot-1

Ot

Ot+1

Conditional

St-1

St

St+1

...

Ot-1

Ot

Ot+1

...

where

(A super-special case of

Conditional Random Fields.)

Set parameters by maximum likelihood, using optimization method on dL.


Conditional random fields l.jpg
Conditional Random Fields

[Lafferty, McCallum, Pereira 2001]

1. FSM special-case:

linear chain among unknowns, parameters tied across time steps.

St

St+1

St+2

St+3

St+4

O = Ot, Ot+1, Ot+2, Ot+3, Ot+4

2. In general:

CRFs = "Conditionally-trained Markov Network"

arbitrary structure among unknowns

3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]:

Parameters tied across hits from SQL-like queries ("clique templates")


Training crfs l.jpg
Training CRFs

Feature count using correct labels

Feature count using predicted labels

-

-

Smoothing penalty


Linear chain crfs vs hmms l.jpg
Linear-chain CRFs vs. HMMs

  • Comparable computational efficiency for inference

  • Features may be arbitrary functions of any or all observations

    • Parameters need not fully specify generation of observations; can require less training data

    • Easy to incorporate domain knowledge


Main point 1 l.jpg
Main Point #1

Conditional probability sequence models

give great flexibility regarding features used,

and have efficient dynamic-programming-based algorithms for inference.


Table extraction from government reports l.jpg
Table Extraction from Government Reports

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was

slightly below 1994. Producer returns averaged $12.93 per hundredweight,

$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,

1 percent above 1994. Marketings include whole milk sold to plants and dealers

as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced,

8 percent less than 1994. Calves were fed 78 percent of this milk with the

remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat:

United States, 1993-95

--------------------------------------------------------------------------------

: : Production of Milk and Milkfat 2/

: Number :-------------------------------------------------------

Year : of : Per Milk Cow : Percentage : Total

:Milk Cows 1/:-------------------: of Fat in All :------------------

: : Milk : Milkfat : Milk Produced : Milk : Milkfat

--------------------------------------------------------------------------------

: 1,000 Head --- Pounds --- Percent Million Pounds

:

1993 : 9,589 15,704 575 3.66 150,582 5,514.4

1994 : 9,500 16,175 592 3.66 153,664 5,623.7

1995 : 9,461 16,451 602 3.66 155,644 5,694.3

--------------------------------------------------------------------------------

1/ Average number during year, excluding heifers not yet fresh.

2/ Excludes milk sucked by calves.


Table extraction from government reports29 l.jpg
Table Extraction from Government Reports

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

100+ documents from www.fedstats.gov

Labels:

CRF

  • Non-Table

  • Table Title

  • Table Header

  • Table Data Row

  • Table Section Data Row

  • Table Footnote

  • ... (12 in all)

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was

slightly below 1994. Producer returns averaged $12.93 per hundredweight,

$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,

1 percent above 1994. Marketings include whole milk sold to plants and dealers

as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced,

8 percent less than 1994. Calves were fed 78 percent of this milk with the

remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat:

United States, 1993-95

--------------------------------------------------------------------------------

: : Production of Milk and Milkfat 2/

: Number :-------------------------------------------------------

Year : of : Per Milk Cow : Percentage : Total

:Milk Cows 1/:-------------------: of Fat in All :------------------

: : Milk : Milkfat : Milk Produced : Milk : Milkfat

--------------------------------------------------------------------------------

: 1,000 Head --- Pounds --- Percent Million Pounds

:

1993 : 9,589 15,704 575 3.66 150,582 5,514.4

1994 : 9,500 16,175 592 3.66 153,664 5,623.7

1995 : 9,461 16,451 602 3.66 155,644 5,694.3

--------------------------------------------------------------------------------

1/ Average number during year, excluding heifers not yet fresh.

2/ Excludes milk sucked by calves.

Features:

  • Percentage of digit chars

  • Percentage of alpha chars

  • Indented

  • Contains 5+ consecutive spaces

  • Whitespace in this line aligns with prev.

  • ...

  • Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.


Table extraction experimental results l.jpg
Table Extraction Experimental Results

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

Line labels,

percent correct

Table segments,

F1

HMM

65 %

64 %

Stateless

MaxEnt

85 %

-

D error

= 85%

D error

= 77%

CRF w/out

conjunctions

52 %

68 %

CRF

95 %

92 %


Ie from research papers l.jpg
IE from Research Papers

[McCallum et al ‘99]


Ie from research papers32 l.jpg
IE from Research Papers

Field-level F1

Hidden Markov Models (HMMs) 75.6

[Seymore, McCallum, Rosenfeld, 1999]

Support Vector Machines (SVMs) 89.7

[Han, Giles, et al, 2003]

Conditional Random Fields (CRFs) 93.9

[Peng, McCallum, 2004]

D error

40%


Main point 2 l.jpg
Main Point #2

Conditional Random Fields were more accurate in practice than a generative model

... on a research paper extraction task,

... and others, including

- a table extraction task

- noun phrase segmentation

- named entity extraction

- …


Outline34 l.jpg
Outline

a

  • The need for unified IE and DM.

  • Review of Conditional Random Fields for IE.

  • Preliminary steps toward unification:

    • Joint Co-reference Resolution(Graph Partitioning)

    • Joint Labeling of Cascaded Sequences (Belief Propagation)

    • Joint Segmentation and Co-ref (Iterated Conditional Samples.)

  • Conclusions

a


Ie in context l.jpg
IE in Context

Create ontology

Spider

Filter by relevance

IE

Segment

Classify

Associate

Cluster

Database

Load DB

Documentcollection

Train extraction models

Query,

Search

Data mining

Prediction

Outlier detection

Decision support

Label training data


Ie in context36 l.jpg
IE in Context

Create ontology

Spider

Filter by relevance

IE

Segment

Classify

Associate

Cluster

Database

Load DB

Documentcollection

Train extraction models

Query,

Search

Data mining

Prediction

Outlier detection

Decision support

Label training data


Coreference resolution l.jpg
Coreference Resolution

AKA "record linkage", "database record deduplication",

"citation matching", "object correspondence", "identity uncertainty"

Output

Input

News article,

with named-entity "mentions" tagged

Number of entities, N = 3

#1

Secretary of State Colin Powell

he

Mr. Powell

Powell

#2

Condoleezza Rice

she

Rice

#3

President Bush

Bush

Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . he . . . . . .

. . . . . . . . . . . . . Condoleezza Rice . . . . .

. . . . Mr Powell . . . . . . . . . .she . . . . . . .

. . . . . . . . . . . . . . Powell . . . . . . . . . . . .

. . . President Bush . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . Rice . . . . . . . . . .

. . . . . . Bush . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .


Inside the traditional solution l.jpg
Inside the Traditional Solution

Pair-wise Affinity Metric

Mention (3)

Mention (4)

Y/N?

. . . Powell . . .

. . . Mr Powell . . .

N Two words in common 29

Y One word in common 13

Y "Normalized" mentions are string identical 39

Y Capitalized word in common 17

Y > 50% character tri-gram overlap 19

N < 25% character tri-gram overlap -34

Y In same sentence 9

Y Within two sentences 8

N Further than 3 sentences apart -1

Y "Hobbs Distance" < 3 11

N Number of entities in between two mentions = 0 12

N Number of entities in between two mentions > 4 -3

Y Font matches 1

Y Default -19

OVERALL SCORE = 98 > threshold=0


The problem l.jpg
The Problem

Pair-wise merging

decisions are being

made independently

from each other

. . . Mr Powell . . .

affinity = 98

Y

. . . Powell . . .

N

affinity = -104

They should be made

in relational dependence

with each other.

Y

affinity = 11

. . . she . . .

Affinity measures are noisy and imperfect.


A generative model solution l.jpg

Issues:

1) Generative model

makes it difficult

to use complex

features.

A Generative Model Solution

[Russell 2001], [Pasula et al 2002]

(Applied to citation matching, and

object correspondence in vision)

N

id

context

words

id

surname

distance

fonts

gender

age

.

.

.

2) Number of entities

is hard-coded into

the model structure,

but we are supposed

to predict num entities!

Thus we must modify

model structure during

inference---MCMC.

.

.

.


A markov random field for co reference l.jpg
A Markov Random Field for Co-reference

(MRF)

[McCallum & Wellner, 2003]

. . . Mr Powell . . .

Make pair-wise merging

decisions in dependent relation to each other by

- calculating a joint prob.

- including all edge weights

- adding dependence on

consistent triangles.

45

. . . Powell . . .

Y/N

Y/N

-30

Y/N

11

. . . she . . .


A markov random field for co reference42 l.jpg
A Markov Random Field for Co-reference

(MRF)

[McCallum & Wellner, 2003]

. . . Mr Powell . . .

-(45)

. . . Powell . . .

N

N

-(-30)

Y

+(11)

-4

. . . she . . .


A markov random field for co reference43 l.jpg
A Markov Random Field for Co-reference

(MRF)

[McCallum & Wellner, 2003]

. . . Mr Powell . . .

+(45)

. . . Powell . . .

Y

N

-(-30)

Y

+(11)

-infinity

. . . she . . .


A markov random field for co reference44 l.jpg
A Markov Random Field for Co-reference

(MRF)

[McCallum & Wellner, 2003]

. . . Mr Powell . . .

+(45)

. . . Powell . . .

Y

N

-(-30)

N

-(11)

. . . she . . .

64


Inference in these mrfs graph partitioning l.jpg
Inference in these MRFs = Graph Partitioning

[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . .

45

. . . Powell . . .

-106

-30

-134

11

. . . Condoleezza Rice . . .

. . . she . . .

10


Inference in these mrfs graph partitioning46 l.jpg
Inference in these MRFs = Graph Partitioning

[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . .

45

. . . Powell . . .

-106

-30

-134

11

. . . Condoleezza Rice . . .

. . . she . . .

10


Inference in these mrfs graph partitioning47 l.jpg
Inference in these MRFs = Graph Partitioning

[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . .

45

. . . Powell . . .

-106

-30

-134

11

. . . Condoleezza Rice . . .

. . . she . . .

10

= -22


Inference in these mrfs graph partitioning48 l.jpg
Inference in these MRFs = Graph Partitioning

[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . .

45

. . . Powell . . .

-106

-30

-134

11

. . . Condoleezza Rice . . .

. . . she . . .

10

= 314


Markov random fields for co reference l.jpg
Markov Random Fields for Co-reference

  • Train edge weight function by maximum likelihood

    • (Can approximate gradient by Gibbs sampling, or by stochastic gradient ascent, e.g. voted perceptron).

    • Given labeled training data in which partitions are given, learn an affinity measure for which partitioning will re-produce those partitions.

  • Interested in better algorithms for graph partitioning

    • Standard algorithms (e.g. Fiducia Mathesis) do not apply with negative edge weights.

    • The action is in the interplay between positive and negative edges.

    • Currently using modified version of “Correlational Clustering" [Bansal, Blum Chawala, 2002]---a very simple greedy algorithm.


Co reference experimental results l.jpg
Co-reference Experimental Results

[McCallum & Wellner, 2003]

Proper noun co-reference, among nouns having coreferents

DARPA ACE broadcast news transcripts, 117 stories

MUC-style F1

Single-link threshold 91.65 %

Best prev match [Morton] 90.98 %

MRFs 93.96 %

Derror=28%

DARPA MUC-6 newswire article corpus, 30 stories

MUC-style F1

Single-link threshold 60.83%

Best prev match [Morton] 88.83 %

MRFs 91.59 %

Derror=24%


Outline51 l.jpg
Outline

a

  • The need for unified IE and DM.

  • Review of Conditional Random Fields for IE.

  • Preliminary steps toward unification:

    • Joint Co-reference Resolution (Graph Partitioning)

    • Joint Labeling of Cascaded Sequences(Belief Prop.)

    • Joint Segmentation and Co-ref (Iterated Conditional Samples.)

  • Conclusions

a

a


Cascaded predictions l.jpg
Cascaded Predictions

Named-entity tag

Part-of-speech

Segmentation

(output prediction)

Chinese character

(input observation)


Cascaded predictions53 l.jpg
Cascaded Predictions

Named-entity tag

Part-of-speech

(output prediction)

Segmentation

(input observation)

Chinese character

(input observation)


Cascaded predictions54 l.jpg
Cascaded Predictions

Named-entity tag

(output prediction)

Part-of-speech

(input obseration)

Segmentation

(input observation)

Chinese character

(input observation)


Joint prediction cross product over labels l.jpg
Joint PredictionCross-Product over Labels

O(|V| x 14852) parameters

O(|o| x 14852) running time

3 x 45 x 11 = 1485 possible states

e.g.: state label = (Wordbeg, Noun, Person)

Segmentation+POS+NE

(output prediction)

Chinese character

(input observation)


Joint prediction factorial crf l.jpg
Joint PredictionFactorial CRF

O(|V| x 2785) parameters

Named-entity tag

(output prediction)

Part-of-speech

(output prediction)

Segmentation

(output prediction)

Chinese character

(input observation)


Linear chain to factorial crfs model definition l.jpg
Linear-Chain to Factorial CRFsModel Definition

Linear-chain

...

y

...

x

Factorial

...

u

...

v

...

w

...

x

where


Dynamic crfs undirected conditionally trained analogue to dynamic bayes nets dbns l.jpg
Dynamic CRFsUndirected conditionally-trained analogue to Dynamic Bayes Nets (DBNs)

Factorial

Higher-Order

Hierarchical


Training crfs59 l.jpg
Training CRFs

Feature count using correct labels

Feature count using predicted labels

-

-

Smoothing penalty

Same form as general CRFs


Training dcrfs l.jpg
Training DCRFs

Feature count using correct labels

Feature count using predicted labels

-

-

Smoothing penalty

Same form as general CRFs


Slide61 l.jpg

Inference (Exact)Junction Tree

Max-clique: 3 x 45 x 45 = 6075 assignments

NP

POS


Slide62 l.jpg

Inference (Exact)Junction Tree

Max-clique: 3 x 45 x 45 x 11 = 66825 assignments

NER

POS

SEG


Slide63 l.jpg

m2(v3)

m1(v2)

m1(v4)

m2(v5)

m3(v6)

m5(v2)

m6(v3)

m4(v1)

Inference (Approximate)Loopy Belief Approximation

v1

v2

v3

m2(v1)

m3(v2)

m5(v4)

m5(v4)

v5

v6

v4

m4(v5)

m5(v6)


Slide64 l.jpg

Inference (Approximate)Tree Re-parameterization

[Wainwright, Jaakkola, Willsky 2001]


Slide65 l.jpg

Inference (Approximate)Tree Re-parameterization

[Wainwright, Jaakkola, Willsky 2001]


Slide66 l.jpg

Inference (Approximate)Tree Re-parameterization

[Wainwright, Jaakkola, Willsky 2001]


Slide67 l.jpg

Inference (Approximate)Tree Re-parameterization

[Wainwright, Jaakkola, Willsky 2001]


Experiments simultaneous noun phrase part of speech tagging l.jpg
ExperimentsSimultaneous noun-phrase & part-of-speech tagging

B I I B I I O O O

N N N O N N V O V

Rockwell International Corp. 's Tulsa unit said it signed

B I I O B I O B IO J N V O N O N N

a tentative agreement extending its contract with Boeing Co.

  • Data from CoNLL Shared Task 2000 (Newswire)

    • 8936 training instances

    • 45 POS tags, 3 NP tags

    • Features: word identity, capitalization, regex’s, lexicons…


Experiments simultaneous noun phrase part of speech tagging69 l.jpg
ExperimentsSimultaneous noun-phrase & part-of-speech tagging

B I I B I I O O O

N N N O N N V O V

Rockwell International Corp. 's Tulsa unit said it signed

B I I O B I O B IO J N V O N O N N

a tentative agreement extending its contract with Boeing Co.

Two experiments

  • Compare exact and approximate inference

  • Compare Noun Phrase Segmentation F1 of

    • Cascaded CRF+CRF

    • Cascaded Brill+CRF

    • Joint Factorial DCRFs




Outline72 l.jpg
Outline

a

  • The need for unified IE and DM.

  • Review of Conditional Random Fields for IE.

  • Preliminary steps toward unification:

    • Joint Co-reference Resolution (Graph Partitioning)

    • Joint Labeling of Cascaded Sequences (Belief Propagation)

    • Joint Segmentation and Co-ref(Iterated Cond. Samples.)

  • Conclusions

a

a

a


Citation segmentation and coreference l.jpg
Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , B. Laurel (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Laurel , The Art of Human-Computer Interface Design , 355-366 ,

1990 .


Citation segmentation and coreference74 l.jpg

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , B. Laurel (ed) ,

Addison-Wesley , 1990 .

Citation Segmentation and Coreference

Brenda Laurel . Interface Agents: Metaphors with Character , in

Laurel , The Art of Human-Computer Interface Design , 355-366 ,

1990 .

  • Segment citation fields


Citation segmentation and coreference75 l.jpg
Citation Segmentation and Coreference in

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , B. Laurel (ed) ,

Addison-Wesley , 1990 .

Y/N

Brenda Laurel . Interface Agents: Metaphors with Character , in

Laurel , The Art of Human-Computer Interface Design , 355-366 ,

1990 .

  • Segment citation fields

  • Resolve coreferent papers


Slide76 l.jpg

Incorrect Segmentation Hurts Coreference in

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , B. Laurel (ed) ,

Addison-Wesley , 1990 .

?

Brenda Laurel . Interface Agents: Metaphors with Character , in

Laurel , The Art of Human-Computer Interface Design , 355-366 ,

1990 .


Slide77 l.jpg

Incorrect Segmentation Hurts Coreference in

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , B. Laurel (ed) ,

Addison-Wesley , 1990 .

?

Brenda Laurel . Interface Agents: Metaphors with Character , in

Laurel , The Art of Human-Computer Interface Design , 355-366 ,

1990 .

Solution: Perform segmentation and coreference jointly.

Use segmentation uncertainty to improve coreference

and use coreference to improve segmentation.


Segmentation coreference model l.jpg
Segmentation + Coreference Model in

s

CRF Segmentation

Observed citation

o


Segmentation coreference model79 l.jpg
Segmentation + Coreference Model in

c

Citation attributes

s

CRF Segmentation

Observed citation

o


Segmentation coreference model80 l.jpg
Segmentation + Coreference Model in

o

s

c

c

c

Citation attributes

s

s

CRF Segmentation

Observed citation

o

o


Segmentation coreference model81 l.jpg
Segmentation + Coreference Model in

o

s

c

y

y

pairwise coref

c

c

Citation attributes

y

s

s

CRF Segmentation

Observed citation

o

o



Approximate inference 1 l.jpg
Approximate Inference 1 intractable, so…

m1(v2)

m2(v3)

v1

v2

v3

m2(v1)

m3(v2)

messages passed between nodes

  • Loopy Belief

    Propagation

v4

v5

v6


Approximate inference 184 l.jpg

v intractable, so…1

v2

v3

v4

v5

v6

v7

v9

v8

Approximate Inference 1

m1(v2)

m2(v3)

v1

v2

v3

m2(v1)

m3(v2)

messages passed between nodes

  • Loopy Belief

    Propagation

  • Generalized Belief

    Propagation

v4

v5

v6

messages passed between regions

Here, a message is a conditional probability table passed among nodes.But when message size grows exponentially with region size!


Approximate inference 2 l.jpg

= held constant intractable, so…

Approximate Inference 2

v2

v1

v3

  • Iterated Conditional

    Modes (ICM)

    [Besag 1986]

v6i+1 =argmax P(v6i| v \ v6i)

v4

v5

v6

v6i


Approximate inference 286 l.jpg

= held constant intractable, so…

Approximate Inference 2

v2

v1

v3

  • Iterated Conditional

    Modes (ICM)

    [Besag 1986]

v5j+1 =argmax P(v5j| v \ v5j)

v4

v5

v6

v5j


Approximate inference 287 l.jpg

= held constant intractable, so…

Approximate Inference 2

v2

v1

v3

  • Iterated Conditional

    Modes (ICM)

    [Besag 1986]

v4k+1 =argmax P(v4k| v \ v4k)

v4

v5

v6

v4k

But greedy, and easily falls into local minima.


Approximate inference 288 l.jpg

= held constant intractable, so…

Approximate Inference 2

v2

v1

v3

  • Iterated Conditional

    Modes (ICM)

    [Besag 1986]

  • Iterated Conditional Sampling (ICS) (our proposal; related work?)

    Instead of passing only argmax, sample of argmaxes of P(v4k| v\ v4k)

    i.e. an N-best list (the top N values)

v4k+1 =argmax P(v4k| v \ v4k)

v4

v5

v6

v4k

v2

v1

v3

Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once.

Here, a “message” grows onlylinearlywith region size and N!

v4

v5

v6


Sample n best list from crf segmentation l.jpg
Sample = N-best List from CRF Segmentation intractable, so…

Do exact inference over these linear-chain regions

o

s

c

Pass N-best List to coreference

p

y

y

prototype

p

pairwise vars

c

c

y

s

s

o

o


Sample n best list from viterbi l.jpg
Sample = N-best List from Viterbi intractable, so…

Parameterized by N-Best lists

c

c

y

pairwise vars

s

s

o

o


Sample n best list from viterbi91 l.jpg
Sample = N-best List from Viterbi intractable, so…

When calculating similarity with another citation, have more opportunity to find correct, matching fields.

c

c

y

s

s

o

o


Results on 4 sections of citeseer citations l.jpg
Results on 4 Sections of CiteSeer Citations intractable, so…

Coreference F1 performance

  • Average error reduction is 35%.

  • “Optimal” makes best use of N-best list by using true labels.

  • Indicates that even more improvement can be obtained


Outline93 l.jpg
Outline intractable, so…

a

  • The need for unified IE and DM.

  • Review of Conditional Random Fields for IE.

  • Preliminary steps toward unification:

    • Joint Co-reference Resolution (Graph Partitioning)

    • Joint Labeling of Cascaded Sequences (Belief Propagation)

    • Joint Segmentation and Co-ref (Iterated Conditional Samples.)

  • Conclusions

a

a

a

a


Conclusions l.jpg
Conclusions intractable, so…

  • Conditional Random Fields combine the benefits of

    • Conditional probability models (arbitrary features)

    • Markov models (for sequences or other relations)

  • Success in

    • Coreference analysis

    • Factorial finite state models

    • Segmentation uncertainty aiding coreference

  • Future work:

    • Structure learning, semi-supervised learning.

    • Further tight integration of IE and Data Mining

    • Application to Social Network Analysis.


Slide95 l.jpg

End of Talk intractable, so…