Information extraction and integration an overview
Download
1 / 79

Information Extraction and Integration: an Overview - PowerPoint PPT Presentation


  • 188 Views
  • Uploaded on

Information Extraction and Integration: an Overview. William W. Cohen Carnegie Mellon University April 26, 2004. Example: The Problem. Martin Baker , a person. Genomics job. Employers job posting form. Example: A Solution. foodscience.com-Job2 JobTitle: Ice Cream Guru

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Extraction and Integration: an Overview' - obert


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Information extraction and integration an overview l.jpg

Information Extractionand Integration: an Overview

William W. Cohen

Carnegie Mellon University

April 26, 2004


Example the problem l.jpg
Example: The Problem

Martin Baker, a person

Genomics job

Employers job posting form



Extracting job openings from the web l.jpg

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Extracting Job Openings from the Web


Slide5 l.jpg

Job Openings:

Category = Food Services

Keyword = Baker

Location = Continental U.S.



What is information extraction l.jpg
What is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION


What is information extraction8 l.jpg
What is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

IE

NAME TITLE ORGANIZATION

Bill GatesCEOMicrosoft

Bill VeghteVPMicrosoft

Richard StallmanfounderFree Soft..


What is information extraction9 l.jpg
What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification + clustering + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

aka “named entity extraction”


What is information extraction10 l.jpg
What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation


What is information extraction11 l.jpg
What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification+ association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation


What is information extraction12 l.jpg

NAME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill

Veghte

VP

Microsoft

Free Soft..

Richard

Stallman

founder

What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification+ association+ clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

*

*

*

*


Tutorial outline l.jpg
Tutorial Outline

  • IE History

  • Landscape of problems and solutions

  • Models for named entity recognition:

    • Sliding window

    • Boundary finding

    • Finite state machines

  • Overview of related problems and solutions

    • Association, Clustering

    • Integration with Data Mining


Ie history l.jpg
IE History

Pre-Web

  • Mostly news articles

    • De Jong’s FRUMP [1982]

      • Hand-built system to fill Schank-style “scripts” from news wire

    • Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]

  • Early work dominated by hand-built models

    • E.g. SRI’s FASTUS, hand-built FSMs.

    • But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]

      Web

  • AAAI ’94 Spring Symposium on “Software Agents”

    • Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.

  • Tom Mitchell’s WebKB, ‘96

    • Build KB’s from the Web.

  • Wrapper Induction

    • Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…

    • Citeseer; Cora; FlipDog; contEd courses, corpInfo, …


Ie history15 l.jpg
IE History

Biology

  • Gene/protein entity extraction

  • Protein/protein fact interaction

  • Automated curation/integration of databases

    • At CMU: SLIF (Murphy et al, subcellular information from images + text in journal articles)

      Email

  • EPCA, PAL, RADAR, CALO: intelligent office assistant that “understands” some part of email

    • At CMU: web site update requests, office-space requests; calendar scheduling requests; social network analysis of email.


Ie is different in different domains l.jpg
IE is different in different domains!

Example: on web there is less grammar, but more formatting & linking

Newswire

Web

www.apple.com/retail

Apple to Open Its First Retail Store

in New York City

MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.

"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."

www.apple.com/retail/soho

www.apple.com/retail/soho/theatre.html

The directory structure, link structure, formatting & layout of the Web is its own new grammar.


Landscape of ie tasks 1 4 degree of formatting l.jpg
Landscape of IE Tasks (1/4):Degree of Formatting

Text paragraphs

without formatting

Grammatical sentencesand some formatting & links

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Non-grammatical snippets,rich formatting & links

Tables


Landscape of ie tasks 2 4 intended breadth of coverage l.jpg
Landscape of IE Tasks (2/4):Intended Breadth of Coverage

Web site specific

Genre specific

Wide, non-specific

Formatting

Layout

Language

Amazon.com Book Pages

Resumes

University Names


Landscape of ie tasks 3 4 complexity l.jpg
Landscape of IE Tasks (3/4):Complexity

E.g. word patterns:

Regular set

Closed set

U.S. phone numbers

U.S. states

Phone: (413) 545-1323

He was born in Alabama…

The CALD main office can be reached at 412-268-1299

The big Wyoming sky…

Ambiguous patterns,needing context andmany sources of evidence

Complex pattern

U.S. postal addresses

Person names

University of Arkansas

P.O. Box 140

Hope, AR 71802

…was among the six houses sold by Hope Feldman that year.

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

Headquarters:

1128 Main Street, 4th Floor

Cincinnati, Ohio 45210


Landscape of ie tasks 4 4 single field record l.jpg
Landscape of IE Tasks (4/4):Single Field/Record

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Single entity

Binary relationship

N-ary record

Person: Jack Welch

Relation: Person-Title

Person: Jack Welch

Title: CEO

Relation: Succession

Company: General Electric

Title: CEO

Out: Jack Welsh

In: Jeffrey Immelt

Person: Jeffrey Immelt

Relation: Company-Location

Company: General Electric

Location: Connecticut

Location: Connecticut

“Named entity” extraction


Landscape of ie techniques 1 1 models l.jpg

Classify Pre-segmentedCandidates

Sliding Window

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Classifier

Classifier

which class?

which class?

Try alternatewindow sizes:

Context Free Grammars

Boundary Models

Finite State Machines

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

BEGIN

Most likely state sequence?

NNP

NNP

V

V

P

NP

Most likely parse?

Classifier

PP

which class?

VP

NP

VP

BEGIN

END

BEGIN

END

S

Landscape of IE Techniques (1/1):Models

Lexicons

Abraham Lincoln was born in Kentucky.

member?

Alabama

Alaska

Wisconsin

Wyoming

Any of these models can be used to capture words, formatting or both.



Extraction by sliding window l.jpg
Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement


Extraction by sliding window24 l.jpg
Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement


Extraction by sliding window25 l.jpg
Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement


Extraction by sliding window26 l.jpg
Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement


A na ve bayes sliding window model l.jpg
A “Naïve Bayes” Sliding Window Model

[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun

w t-m

w t-1

w t

w t+n

w t+n+1

w t+n+m

prefix

contents

suffix

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

If P(“Wean Hall Rm 5409” = LOCATION)is above some threshold, extract it.


Na ve bayes sliding window results l.jpg
“Naïve Bayes” Sliding Window Results

Domain: CMU UseNet Seminar Announcements

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Field F1

Person Name: 30%

Location: 61%

Start Time: 98%


Srv a realistic sliding window classifier ie system l.jpg
SRV: a realistic sliding-window-classifier IE system

[Frietag AAAI ‘98]

  • What windows to consider?

    • all windows containing as many tokens as the shortest example, but no more tokens than the longest example

  • How to represent a classifier? It might:

    • Restrict the length of window;

    • Restrict the vocabulary or formatting used before/after/inside window;

    • Restrict the relative order of tokens;

    • Use inductive logic programming techniques to express all these…

<title>Course Information for CS213</title>

<h1>CS 213 C++ Programming</h1>


Srv a rule learner for sliding window classification l.jpg
SRV: a rule-learner for sliding-window classification

  • Primitive predicates used by SRV:

    • token(X,W), allLowerCase(W), numerical(W), …

    • nextToken(W,U), previousToken(W,V)

  • HTML-specific predicates:

    • inTitleTag(W), inH1Tag(W), inEmTag(W),…

    • emphasized(W) = “inEmTag(W) or inBTag(W) or …”

    • tableNextCol(W,U) = “U is some token in the column after the column W is in”

    • tablePreviousCol(W,V), tableRowHeader(W,T),…


Srv a rule learner for sliding window classification31 l.jpg

courseNumber(X) :-

tokenLength(X,=,2),

every(X, inTitle, false),

some(X, A, <previousToken>, inTitle, true),

some(X, B, <>. tripleton, true)

Non-primitive conditions make greedy search easier

SRV: a rule-learner for sliding-window classification

  • Non-primitive “conditions” used by SRV:

    • every(+X,f, c) = for all W in X : f(W)=c

    • some(+X, W, <f1,…,fk>, g, c)= exists W: g(fk(…(f1(W)…))=c

    • tokenLength(+X, relop,c):

    • position(+W,direction,relop, c):

      • e.g., tokenLength(X,>,4), position(W,fromEnd,<,2)



Rule learning approaches to sliding window classification summary l.jpg
Rule-learning approaches to sliding-window classification: Summary

  • SRV, Rapier, and WHISK [Soderland KDD ‘97]

    • Representations for classifiers allow restriction of the relationships between tokens, etc

    • Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog)

    • Use of these “heavyweight” representations is complicated, but seems to pay off in results

  • Some questions to consider:

    • Can simpler, propositional representations for classifiers work (see Roth and Yih)

    • What learning methods to consider (NB, ILP, boosting, semi-supervised – see Collins & Singer)

    • When do we want to use this method vs fancier ones?


Bwi learning to detect boundaries l.jpg
BWI: Learning to detect boundaries Summary

[Freitag & Kushmerick, AAAI 2000]

  • Another formulation: learn three probabilistic classifiers:

    • START(i) = Prob( position i starts a field)

    • END(j) = Prob( position j ends a field)

    • LEN(k) = Prob( an extracted field has length k)

  • Then score a possible extraction (i,j) by

    START(i) * END(j) * LEN(j-i)

  • LEN(k) is estimated from a histogram


Bwi learning to detect boundaries35 l.jpg
BWI: Learning to detect boundaries Summary

Field F1

Person Name: 30%

Location: 61%

Start Time: 98%


Problems with sliding windows and boundary finders l.jpg
Problems with Sliding Windows Summaryand Boundary Finders

  • Decisions in neighboring parts of the input are made independently from each other.

    • Expensive for long entity names

    • Sliding Window may predict a “seminar end time” before the “seminar start time”.

    • It is possible for two overlapping windows to both be above threshold.

    • In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.



Ie with hidden markov models l.jpg
IE with Hidden Markov Models Summary

Given a sequence of observations:

Yesterday Pedro Domingos spoke this example sentence.

and a trained HMM:

person name

location name

background

Find the most likely state sequence: (Viterbi)

YesterdayPedro Domingosspoke this example sentence.

Any words said to be generated by the designated “person name”

state extract as a person name:

Person name: Pedro Domingos


Hmm for segmentation l.jpg
HMM for Segmentation Summary

  • Simplest Model: One state per entity type


What is a symbol l.jpg
What is a “symbol” ??? Summary

Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ?

4601 => “4601”, “9999”, “9+”, “number”, … ?

Datamold: choose best abstraction level using holdout set


Hmm example nymble l.jpg
HMM Example: “Nymble” Summary

[Bikel, et al 1998], [BBN “IdentiFinder”]

Task: Named Entity Extraction

Transitionprobabilities

Observationprobabilities

Person

end-of-sentence

P(ot | st , st-1)

P(st | st-1, ot-1)

start-of-sentence

Org

P(ot | st , ot-1)

or

(Five other name classes)

Back-off to:

Back-off to:

P(st | st-1 )

P(ot | st )

Other

P(st )

P(ot )

Train on ~500k words of news wire text.

Results:

Case Language F1 .

Mixed English 93%

Upper English 91%

Mixed Spanish 90%

Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]


What is a symbol42 l.jpg
What is a symbol? Summary

Bikel et al mix symbols from two abstraction levels


What is a symbol43 l.jpg
What is a symbol? Summary

Ideally we would like to use many, arbitrary, overlapping features of words.

S

S

S

identity of word

ends in “-ski”

is capitalized

is part of a noun phrase

is in a list of city names

is under node X in WordNet

is in bold font

is indented

is in hyperlink anchor

t

-

1

t

t+1

is “Wisniewski”

part ofnoun phrase

ends in

“-ski”

O

O

O

t

-

t

+1

t

1

Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …


What is a symbol44 l.jpg
What is a symbol? Summary

S

S

S

identity of word

ends in “-ski”

is capitalized

is part of a noun phrase

is in a list of city names

is under node X in WordNet

is in bold font

is indented

is in hyperlink anchor

t

-

1

t

t+1

is “Wisniewski”

part ofnoun phrase

ends in

“-ski”

O

O

O

t

-

t

+1

t

1

Idea: replace generative model in HMM with a maxent model, where state depends on observations


What is a symbol45 l.jpg
What is a symbol? Summary

S

S

S

identity of word

ends in “-ski”

is capitalized

is part of a noun phrase

is in a list of city names

is under node X in WordNet

is in bold font

is indented

is in hyperlink anchor

t

-

1

t

t+1

is “Wisniewski”

part ofnoun phrase

ends in

“-ski”

O

O

O

t

-

t

+1

t

1

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state


What is a symbol46 l.jpg
What is a symbol? Summary

S

S

identity of word

ends in “-ski”

is capitalized

is part of a noun phrase

is in a list of city names

is under node X in WordNet

is in bold font

is indented

is in hyperlink anchor

S

t

-

1

t+1

t

is “Wisniewski”

part ofnoun phrase

ends in

“-ski”

O

O

O

t

-

t

+1

t

1

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history


Ratnaparkhi s mxpost l.jpg
Ratnaparkhi’s MXPOST Summary

  • Sequential learning problem: predict POS tags of words.

  • Uses MaxEnt model described above.

  • Rich feature set.

  • To smooth, discard features occurring < 10 times.


Conditional markov models cmms aka memms aka maxent taggers vs hmms l.jpg
Conditional Markov Models (CMMs) aka MEMMs aka Maxent Taggers vs HMMS

St-1

St

St+1

...

Ot-1

Ot

Ot+1

St-1

St

St+1

...

Ot-1

Ot

Ot+1


Label bias problem l.jpg
Label Bias Problem Taggers

  • Consider this MEMM, and enough training data to perfectly model it:

Pr(0123|rib)=1

Pr(0453|rob)=1

Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3

= 0.5 * 1 * 1

Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’

= 0.5 * 1 *1


Another view of label bias sha pereira l.jpg
Another view of label bias Taggers [Sha & Pereira]

So what’s the alternative?


Cmms to crfs l.jpg

New model Taggers

CMMs to CRFs


What s the new model look like l.jpg

x1 Taggers

x2

x3

y1

y2

y3

What’s the new model look like?

What’s independent?


Graphical comparison among hmms memms and crfs l.jpg
Graphical comparison among Taggers HMMs, MEMMs and CRFs

HMM MEMM CRF


Sha pereira results l.jpg
Sha & Pereira results Taggers

CRF beats MEMM (McNemar’s test); MEMM probablybeats voted perceptron


Sha pereira results55 l.jpg
Sha & Pereira results Taggers

in minutes, 375k examples



Problems with tagging based ner l.jpg
Problems with Tagging-based NER Taggers

  • Decisions are made token-by-token

  • Feature engineering can be awkward:

    • How long is the extracted name?

    • Does the extracted name appear in a dictionary?

    • Is the extracted name similar to (an acronym for, a short version of...) a name appearing earlier in the document?

  • “Tag engineering” is also crucial:

    • Do we use one state per entity? a special state for beginning an entity? a special state for end?


Semi markov models for ie l.jpg

Train on sequences of labeled Taggers segments, not labeled words.

S=(start,end,label)

Build probability model of segment sequences, not word sequences

Define features f of segments

(Approximately) optimize feature weights on training data

Semi-Markov models for IE

with Sunita Sarawagi, IIT Bombay

f(S) = words xt...xu, length, previous words, case information, ..., distance to known name

maximize:


Segments vs tagging l.jpg
Segments Taggers vs tagging

t

x

y

f(xt,yt)

t,u

x

y

f(xj,yj)


A training algorithm for csmm s 1 l.jpg
A training algorithm for CSMM’s (1) Taggers

Review: Collins’ perceptron training algorithm

Correct tags

Viterbi tags


A training algorithm for csmm s 2 l.jpg
A training algorithm for CSMM’s (2) Taggers

Variant of Collins’ perceptron training algorithm:

voted perceptron learner for TTRANS

like Viterbi


A training algorithm for csmm s 3 l.jpg
A training algorithm for CSMM’s (3) Taggers

Variant of Collins’ perceptron training algorithm:

voted perceptron learner for TSEGTRANS

like Viterbi


Results l.jpg
Results Taggers





Broader view l.jpg
Broader View Taggers

Up to now we have been focused on segmentation and classification

Create ontology

Spider

Filter by relevance

IE

Segment

Classify

Associate

Cluster

Database

Load DB

Documentcollection

Train extraction models

Query,

Search

Data mine

Label training data


Broader view69 l.jpg
Broader View Taggers

Now touch on some other issues

Create ontology

3

Spider

Filter by relevance

IE

Tokenize

Segment

Classify

Associate

Cluster

1

2

Database

Load DB

Documentcollection

Train extraction models

Query,

Search

4

Data mine

5

Label training data


1 association as binary classification l.jpg
(1) Taggers Association as Binary Classification

Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair.

Person

Person

Role

Person-Role (Christos Faloutsos, KDD 2003 General Chair)  NO

Person-Role ( Ted Senator, KDD 2003 General Chair)  YES

Do this with SVMs and tree kernels over parse trees.

[Zelenko et al, 2002]


1 association with finite state machines l.jpg
(1) Taggers Association with Finite State Machines

[Ray & Craven, 2001]

… This enzyme, UBC6, localizes to the endoplasmic reticulum, with the catalytic domain facing the cytosol. …

DET this

N enzyme

N ubc6

V localizes

PREP to

ART the

ADJ endoplasmic

N reticulum

PREP with

ART the

ADJ catalytic

N domain

V facing

ART theN cytosol

Subcellular-localization (UBC6, endoplasmic reticulum)


1 association with graphical models l.jpg

Random variable Taggers over the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}

Random variableover the class ofentity #2, e.g. over{person, location,…}

Local languagemodels contributeevidence to relationclassification.

Local languagemodels contributeevidence to entityclassification.

Dependencies between classesof entities and relations!

Inference with loopy belief propagation.

(1) Association with Graphical Models

[Roth & Yih 2002]

Capture arbitrary-distance dependencies among predictions.


1 association with graphical models73 l.jpg
(1) Taggers Association with Graphical Models

[Roth & Yih 2002]

Also capture long-distance dependencies among predictions.

Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}

person

Random variableover the class ofentity #1, e.g. over{person, location,…}

lives-in

Local languagemodels contributeevidence to relationclassification.

location

Local languagemodels contributeevidence to entityclassification.

Dependencies between classesof entities and relations!

Inference with loopy belief propagation.


Broader view74 l.jpg
Broader View Taggers

Now touch on some other issues

Create ontology

3

Spider

Filter by relevance

IE

Tokenize

Segment

Classify

Associate

Cluster

1

2

Database

Load DB

Documentcollection

Train extraction models

Query,

Search

4

Data mine

5

Label training data

When do two extracted strings

refer to the same object?


2 information integration l.jpg
(2) Information Integration Taggers

[Minton, Knoblock, et al 2001], [Doan, Domingos, Halevy 2001],

[Richardson & Domingos 2003]

Goal might be to merge results of two IE systems:


2 other information integration issues l.jpg
(2) Taggers Other Information Integration Issues

  • Distance metrics for text – which work well?

    • [Cohen, Ravikumar, Fienberg, 2003]

  • Finessing integration by soft database operations based on similarity

    • [Cohen, 2000]

  • Integration of complex structured databases: (capture dependencies among multiple merges)

    • [Cohen, MacAllister, Kautz KDD 2000; Pasula, Marthi, Milch, Russell, Shpitser, NIPS 2002; McCallum and Wellner, KDD WS 2003]


Ie resources l.jpg
IE Resources Taggers

  • Data

    • RISE, http://www.isi.edu/~muslea/RISE/index.html

    • Linguistic Data Consortium (LDC)

      • Penn Treebank, Named Entities, Relations, etc.

    • http://www.biostat.wisc.edu/~craven/ie

  • Papers, tutorials, lectures, code

    • http://www.cs.cmu.edu/~wcohen/10-707


References l.jpg
References Taggers

  • [Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In Proceedings of ANLP’97, p194-201.

  • [Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).

  • [Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)

  • [Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).

  • [Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, in Proceedings of ACM SIGMOD-98.

  • [Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language, ACM Transactions on Information Systems, 18(3).

  • [Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning: Proceedings of the Seventeeth International Conference (ML-2000).

  • [Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.

  • [De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural Language Processing. Larence Erlbaum, 1982, 149-176.

  • [Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98).

  • [Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon University.

  • [Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101 (2000).

  • Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99)

  • [Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.

  • [Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68).

  • [Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001.

  • [Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego.

  • [McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information extraction and segmentation, In Proceedings of ICML-2000

  • [Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.


References79 l.jpg
References Taggers

  • [Muslea et al, 1999] Muslea, I.; Minton, S.; Knoblock, C. A.: A Hierarchical Approach to Wrapper Induction. Proceedings of Autonomous Agents-99.

  • [Muslea et al, 2000] Musclea, I.; Minton, S.; and Knoblock, C. Hierarhical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems.

  • [Nahm & Mooney, 2000] Nahm, Y.; and Mooney, R. A mutually beneficial integration of data mining and information extraction. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, pages 627--632, Austin, TX.

  • [Punyakanok & Roth 2001] Punyakanok, V.; and Roth, D. The use of classifiers in sequential inference. Advances in Neural Information Processing Systems 13.

  • [Ratnaparkhi 1996] Ratnaparkhi, A., A maximum entropy part-of-speech tagger, in Proc. Empirical Methods in Natural Language Processing Conference, p133-141.

  • [Ray & Craven 2001] Ray, S.; and Craven, Ml. Representing Sentence Structure in Hidden Markov Models for Information Extraction. Proceedings of the 17th International Joint Conference on Artificial Intelligence, Seattle, WA. Morgan Kaufmann.

  • [Soderland 1997]: Soderland, S.: Learning to Extract Text-Based Information from the World Wide Web. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97).

  • [Soderland 1999] Soderland, S. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1/3):233-277.


ad