information extraction from the world wide web
Download
Skip this Video
Download Presentation
Information Extraction from the World Wide Web

Loading in 2 Seconds...

play fullscreen
1 / 64

Information Extraction from the World Wide Web - PowerPoint PPT Presentation


  • 147 Views
  • Uploaded on

Information Extraction from the World Wide Web. CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts Amherst From KDD 2003. Course Overview. Introduction. Machine Learning. Scaling. Bag of Words. Hadoop. Naïve Bayes .

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Extraction from the World Wide Web' - aaliyah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information extraction from the world wide web

Information Extractionfrom the World Wide Web

CSE 454

Based on Slides by

William W. Cohen

Carnegie Mellon University

Andrew McCallum

University of Massachusetts Amherst

From KDD 2003

course overview
Course Overview

Introduction

Machine Learning

Scaling

Bag of Words

Hadoop

Naïve Bayes

Search

Info Extraction

Info Extraction

Self-Supervised

Overfitting,

Ensembles,

Cotraining

Clustering

Topics

administrivia
Administrivia

© Daniel S. Weld

  • Proposals
    • Due Thursday 10/22
    • Pre-screen
    • Time Estimation / Pert Chart
  • Scaling Up: Way Up
    • Today 3:30 – EE 105
    • David “Pablo” Cohen – Google News
    • “Content + Connection” Algorithms; Attribute Factoring
example the problem
Example: The Problem

Martin Baker, a person

Genomics job

Employers job posting form

Slides from Cohen & McCallum

example a solution
Example: A Solution

Slides from Cohen & McCallum

extracting job openings from the web
foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Extracting Job Openings from the Web

Slides from Cohen & McCallum

slide7
Job Openings:

Category = Food Services

Keyword = Baker

Location = Continental U.S.

Slides from Cohen & McCallum

what is information extraction
What is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Slides from Cohen & McCallum

what is information extraction1
What is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

IE

NAME TITLE ORGANIZATION

Bill GatesCEOMicrosoft

Bill VeghteVPMicrosoft

Richard StallmanfounderFree Soft..

Slides from Cohen & McCallum

what is information extraction2
What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification + clustering + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

Slides from Cohen & McCallum

what is information extraction3
What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

Slides from Cohen & McCallum

what is information extraction4
What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification+ association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

Slides from Cohen & McCallum

what is information extraction5
NAME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill

Veghte

VP

Microsoft

Free Soft..

Richard

Stallman

founder

What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification+ association+ clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

*

*

*

*

Slides from Cohen & McCallum

ie in context
IE in Context

Create ontology

Spider

Filter by relevance

IE

Segment

Classify

Associate

Cluster

Database

Load DB

Documentcollection

Train extraction models

Query,

Search

Data mine

Label training data

Slides from Cohen & McCallum

ie history
IE History

Pre-Web

  • Mostly news articles
    • De Jong’s FRUMP [1982]
      • Hand-built system to fill Schank-style “scripts” from news wire
    • Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]
  • Most early work dominated by hand-built models
    • E.g. SRI’s FASTUS, hand-built FSMs.
    • But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]

Web

  • AAAI ’94 Spring Symposium on “Software Agents”
    • Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
  • Tom Mitchell’s WebKB, ‘96
    • Build KB’s from the Web.
  • Wrapper Induction
    • First by hand, then ML: [Doorenbos ‘96], Soderland ’96], [Kushmerick ’97],…

Slides from Cohen & McCallum

what makes ie from the web different
What makes IE from the Web Different?

Less grammar, but more formatting & linking

Newswire

Web

www.apple.com/retail

Apple to Open Its First Retail Store

in New York City

MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.

"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."

www.apple.com/retail/soho

www.apple.com/retail/soho/theatre.html

The directory structure, link structure, formatting & layout of the Web is its own new grammar.

Slides from Cohen & McCallum

landscape of ie tasks 1 4 pattern feature domain
Landscape of IE Tasks (1/4):Pattern Feature Domain

Text paragraphs

without formatting

Grammatical sentencesand some formatting & links

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Non-grammatical snippets,rich formatting & links

Tables

Slides from Cohen & McCallum

landscape of ie tasks 2 4 pattern scope
Landscape of IE Tasks (2/4):Pattern Scope

Web site specific

Genre specific

Wide, non-specific

Formatting

Layout

Language

Amazon.com Book Pages

Resumes

University Names

Slides from Cohen & McCallum

landscape of ie tasks 3 4 pattern complexity
Landscape of IE Tasks (3/4):Pattern Complexity

E.g. word patterns:

Regular set

Closed set

U.S. phone numbers

U.S. states

Phone: (413) 545-1323

He was born in Alabama…

The CALD main office can be reached at 412-268-1299

The big Wyoming sky…

Ambiguous patterns,needing context andmany sources of evidence

Complex pattern

U.S. postal addresses

Person names

University of Arkansas

P.O. Box 140

Hope, AR 71802

…was among the six houses sold by Hope Feldman that year.

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

Headquarters:

1128 Main Street, 4th Floor

Cincinnati, Ohio 45210

Slides from Cohen & McCallum

landscape of ie tasks 4 4 pattern combinations
Landscape of IE Tasks (4/4):Pattern Combinations

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Single entity

Binary relationship

N-ary record

Person: Jack Welch

Relation: Person-Title

Person: Jack Welch

Title: CEO

Relation: Succession

Company: General Electric

Title: CEO

Out: Jack Welsh

In: Jeffrey Immelt

Person: Jeffrey Immelt

Relation: Company-Location

Company: General Electric

Location: Connecticut

Location: Connecticut

“Named entity” extraction

Slides from Cohen & McCallum

evaluation of single entity extraction
Evaluation of Single Entity Extraction

TRUTH:

Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.

PRED:

Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by RichardM. Karpe and Martin Cooke.

# correctly predicted segments 2

Precision = =

# predicted segments 6

# correctly predicted segments 2

Recall = =

# true segments 4

1

F1 = Harmonic mean of Precision & Recall =

((1/P) + (1/R)) / 2

Slides from Cohen & McCallum

state of the art performance
State of the Art Performance
  • Named entity recognition
    • Person, Location, Organization, …
    • F1 in high 80’s or low- to mid-90’s
  • Binary relation extraction
    • Contained-in (Location1, Location2)Member-of (Person1, Organization1)
    • F1 in 60’s or 70’s or 80’s
  • Wrapper induction
    • Extremely accurate performance obtainable
    • Human effort (~30min) required on each site

Slides from Cohen & McCallum

landscape of ie techniques 1 1 models
Classify Pre-segmentedCandidates

Sliding Window

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Classifier

Classifier

which class?

which class?

Try alternatewindow sizes:

Context Free Grammars

Boundary Models

Finite State Machines

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

BEGIN

Most likely state sequence?

NNP

NNP

V

V

P

NP

Most likely parse?

Classifier

PP

which class?

VP

NP

VP

BEGIN

END

BEGIN

END

S

Landscape of IE Techniques (1/1):Models

Lexicons

Abraham Lincoln was born in Kentucky.

member?

Alabama

Alaska

Wisconsin

Wyoming

…and beyond

Any of these models can be used to capture words, formatting or both.

Slides from Cohen & McCallum

landscape focus of this tutorial
Landscape:Focus of this Tutorial

Pattern complexity

closed set

regular

complex

ambiguous

Pattern feature domain

words

words + formatting

formatting

Pattern scope

site-specific

genre-specific

general

Pattern combinations

entity

binary

n-ary

Models

lexicon

regex

window

boundary

FSM

CFG

Slides from Cohen & McCallum

sliding windows
Sliding Windows

Slides from Cohen & McCallum

extraction by sliding window
Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Slides from Cohen & McCallum

extraction by sliding window1
Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Slides from Cohen & McCallum

extraction by sliding window2
Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Slides from Cohen & McCallum

extraction by sliding window3
Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Slides from Cohen & McCallum

a na ve bayes sliding window model
A “Naïve Bayes” Sliding Window Model

[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun

w t-m

w t-1

w t

w t+n

w t+n+1

w t+n+m

prefix

contents

suffix

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

If P(“Wean Hall Rm 5409” = LOCATION)is above some threshold, extract it.

Other examples of sliding window: [Baluja et al 2000]

(decision tree over individual words & their context)

Slides from Cohen & McCallum

na ve bayes sliding window results
“Naïve Bayes” Sliding Window Results

Domain: CMU UseNet Seminar Announcements

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Field F1

Person Name: 30%

Location: 61%

Start Time: 98%

Slides from Cohen & McCallum

realistic sliding window classifier ie
Realistic sliding-window-classifier IE
  • What windows to consider?
    • all windows containing as many tokens as the shortest example, but no more tokens than the longest example
  • How to represent a classifier? It might:
    • Restrict the length of window;
    • Restrict the vocabulary or formatting used before/after/inside window;
    • Restrict the relative order of tokens, etc.
  • Learning Method
    • SRV: Top-Down Rule Learning [Frietag AAAI ‘98]
    • Rapier: Bottom-Up [Califf & Mooney, AAAI ‘99]

Slides from Cohen & McCallum

rapier results precision recall
Rapier: results – precision/recall

Slides from Cohen & McCallum

rule learning approaches to sliding window classification summary
Rule-learning approaches to sliding-window classification: Summary
  • SRV, Rapier, and WHISK [Soderland KDD ‘97]
    • Representations for classifiers allow restriction of the relationships between tokens, etc
    • Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog)
    • Use of these “heavyweight” representations is complicated, but seems to pay off in results
  • Can simpler representations for classifiers work?

Slides from Cohen & McCallum

bwi learning to detect boundaries
BWI: Learning to detect boundaries

[Freitag & Kushmerick, AAAI 2000]

  • Another formulation: learn three probabilistic classifiers:
    • START(i) = Prob( position i starts a field)
    • END(j) = Prob( position j ends a field)
    • LEN(k) = Prob( an extracted field has length k)
  • Then score a possible extraction (i,j) by

START(i) * END(j) * LEN(j-i)

  • LEN(k) is estimated from a histogram

Slides from Cohen & McCallum

bwi learning to detect boundaries1
BWI: Learning to detect boundaries
  • BWI uses boosting to find “detectors” for START and END
  • Each weak detector has a BEFORE and AFTER pattern (on tokens before/after position i).
  • Each “pattern” is a sequence of
    • tokens and/or
    • wildcards like: anyAlphabeticToken, anyNumber, …
  • Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patterns

Slides from Cohen & McCallum

bwi learning to detect boundaries2
BWI: Learning to detect boundaries

Field F1

Person Name: 30%

Location: 61%

Start Time: 98%

Slides from Cohen & McCallum

problems with sliding windows and boundary finders
Problems with Sliding Windows and Boundary Finders
  • Decisions in neighboring parts of the input are made independently from each other.
    • Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”.
    • It is possible for two overlapping windows to both be above threshold.
    • In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.

Slides from Cohen & McCallum

finite state machines
Finite State Machines

Slides from Cohen & McCallum

hidden markov models hmms
Hidden Markov Models (HMMs)

standard sequence model in genomics, speech, NLP, …

Graphical model

Finite state model

S

S

S

transitions

t

-

1

t

t+1

...

...

observations

...

O

O

O

t

-

t

+1

Generates:

State

sequence

Observation

sequence

t

1

o1 o2 o3 o4 o5 o6 o7 o8

Parameters: for all states S={s1,s2,…}

Start state probabilities: P(st )

Transition probabilities: P(st|st-1 )

Observation (emission) probabilities: P(ot|st )

Training:

Maximize probability of training observations (w/ prior)

Usually a multinomial over atomic, fixed alphabet

Slides from Cohen & McCallum

ie with hidden markov models
IE with Hidden Markov Models

Given a sequence of observations:

Yesterday Pedro Domingos spoke this example sentence.

and a trained HMM:

person name

location name

background

Find the most likely state sequence: (Viterbi)

YesterdayPedro Domingosspoke this example sentence.

Any words said to be generated by the designated “person name”

state extract as a person name:

Person name: Pedro Domingos

Slides from Cohen & McCallum

hmm example nymble
HMM Example: “Nymble”

[Bikel, et al 1998], [BBN “IdentiFinder”]

Task: Named Entity Extraction

Transitionprobabilities

Observationprobabilities

Person

end-of-sentence

P(ot | st , st-1)

P(st | st-1, ot-1)

start-of-sentence

Org

P(ot | st , ot-1)

or

(Five other name classes)

Back-off to:

Back-off to:

P(st | st-1 )

P(ot | st )

Other

P(st )

P(ot )

Train on ~500k words of news wire text.

Results:

Case Language F1 .

Mixed English 93%

Upper English 91%

Mixed Spanish 90%

Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]

Slides from Cohen & McCallum

we want more than an atomic view of words
We want More than an Atomic View of Words

Would like richer representation of text:

many arbitrary, overlapping features of the words.

S

S

S

identity of word

ends in “-ski”

is capitalized

is part of a noun phrase

is in a list of city names

is under node X in WordNet

is in bold font

is indented

is in hyperlink anchor

last person name was female

next two words are “and Associates”

t

-

1

t

t+1

is “Wisniewski”

part ofnoun phrase

ends in

“-ski”

O

O

O

t

-

t

+1

t

1

Slides from Cohen & McCallum

problems with richer representation and a joint model
Problems with Richer Representationand a Joint Model

These arbitrary features are not independent.

  • Multiple levels of granularity (chars, words, phrases)
  • Multiple dependent modalities (words, formatting, layout)
  • Past & future

Two choices:

Ignore the dependencies.

This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

Model the dependencies.

Each state would have its own Bayes Net. But we are already starved for training data!

S

S

S

S

S

S

t

-

1

t

t+1

t

-

1

t

t+1

O

O

O

O

O

O

Slides from Cohen & McCallum

t

-

t

+1

t

-

t

+1

t

1

t

1

conditional sequence models
Conditional Sequence Models
  • We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(s|o) instead of P(s,o):
    • Can examine features, but not responsible for generating them.
    • Don’t have to explicitly model their dependencies.
    • Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

Slides from Cohen & McCallum

from hmms to crfs
St-1

St

St+1

...

Ot-1

Ot

Ot+1

...

From HMMs to CRFs

Conditional Finite State Sequence Models

[McCallum, Freitag & Pereira, 2000]

[Lafferty, McCallum, Pereira 2001]

St-1

St

St+1

...

Joint

...

Ot-1

Ot

Ot+1

Conditional

where

(A super-special case of

Conditional Random Fields.)

Slides from Cohen & McCallum

conditional random fields
Conditional Random Fields

[Lafferty, McCallum, Pereira 2001]

1. FSM special-case:

linear chain among unknowns, parameters tied across time steps.

St

St+1

St+2

St+3

St+4

O = Ot, Ot+1, Ot+2, Ot+3, Ot+4

2. In general:

CRFs = "Conditionally-trained Markov Network"

arbitrary structure among unknowns

3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]:

Parameters tied across hits from SQL-like queries ("clique templates")

4. Markov Logic Networks [Richardson & Domingos]:

Weights on arbitrary logical sentences

Slides from Cohen & McCallum

feature functions
Feature Functions

o =

Yesterday Pedro Domingos spoke this example sentence.

o1 o2 o3 o4 o5 o6 o7

s1

s2

s3

s4

Slides from Cohen & McCallum

learning parameters of crfs
Learning Parameters of CRFs

Maximize log-likelihood of parameters L= {lk} given training data D

Log-likelihood gradient:

  • Methods:
  • iterative scaling (quite slow)
  • conjugate gradient (much faster)
  • limited-memory quasi-Newton methods, BFGS (super fast)

[Sha & Pereira 2002] & [Malouf 2002]

Slides from Cohen & McCallum

general crfs vs hmms
General CRFs vs. HMMs
  • More general and expressive modeling technique
  • Comparable computational efficiency
  • Features may be arbitrary functions of any or all observations
  • Parameters need not fully specify generation of observations; require less training data
  • Easy to incorporate domain knowledge
  • State means only “state of process”, vs“state of process” and “observational history I’m keeping”

Slides from Cohen & McCallum

person name extraction
Person name Extraction

[McCallum 2001, unpublished]

Slides from Cohen & McCallum

person name extraction1
Person name Extraction

Slides from Cohen & McCallum

features in experiment
Capitalized Xxxxx

Mixed Caps XxXxxx

All Caps XXXXX

Initial Cap X….

Contains Digit xxx5

All lowercase xxxx

Initial X

Punctuation .,:;!(), etc

Period .

Comma ,

Apostrophe ‘

Dash -

Preceded by HTML tag

Character n-gram classifier says string is a person name (80% accurate)

In stopword list(the, of, their, etc)

In honorific list(Mr, Mrs, Dr, Sen, etc)

In person suffix list(Jr, Sr, PhD, etc)

In name particle list (de, la, van, der, etc)

In Census lastname list;segmented by P(name)

In Census firstname list;segmented by P(name)

In locations lists(states, cities, countries)

In company name list(“J. C. Penny”)

In list of company suffixes(Inc, & Associates, Foundation)

Features in Experiment

Hand-built FSM person-name extractor says yes, (prec/recall ~ 30/95)

Conjunctions of all previous feature pairs, evaluated at the current time step.

Conjunctions of all previous feature pairs, evaluated at current step and one step ahead.

All previous features, evaluated two steps ahead.

All previous features, evaluated one step behind.

Total number of features = ~500k

Slides from Cohen & McCallum

training and testing
Training and Testing
  • Trained on 65k words from 85 pages, 30 different companies’ web sites.
  • Training takes 4 hours on a 1 GHz Pentium.
  • Training precision/recall is 96% / 96%.
  • Tested on different set of web pages with similar size characteristics.
  • Testing precision is 92 – 95%, recall is 89 – 91%.

Slides from Cohen & McCallum

table extraction from government reports
Table Extraction from Government Reports

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was

slightly below 1994. Producer returns averaged $12.93 per hundredweight,

$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,

1 percent above 1994. Marketings include whole milk sold to plants and dealers

as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced,

8 percent less than 1994. Calves were fed 78 percent of this milk with the

remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat:

United States, 1993-95

--------------------------------------------------------------------------------

: : Production of Milk and Milkfat 2/

: Number :-------------------------------------------------------

Year : of : Per Milk Cow : Percentage : Total

:Milk Cows 1/:-------------------: of Fat in All :------------------

: : Milk : Milkfat : Milk Produced : Milk : Milkfat

--------------------------------------------------------------------------------

: 1,000 Head --- Pounds --- Percent Million Pounds

:

1993 : 9,589 15,704 575 3.66 150,582 5,514.4

1994 : 9,500 16,175 592 3.66 153,664 5,623.7

1995 : 9,461 16,451 602 3.66 155,644 5,694.3

--------------------------------------------------------------------------------

1/ Average number during year, excluding heifers not yet fresh.

2/ Excludes milk sucked by calves.

Slides from Cohen & McCallum

table extraction from government reports1
Table Extraction from Government Reports

[Pinto, McCallum, Wei, Croft, 2003]

100+ documents from www.fedstats.gov

Labels:

CRF

  • Non-Table
  • Table Title
  • Table Header
  • Table Data Row
  • Table Section Data Row
  • Table Footnote
  • ... (12 in all)

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was

slightly below 1994. Producer returns averaged $12.93 per hundredweight,

$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,

1 percent above 1994. Marketings include whole milk sold to plants and dealers

as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced,

8 percent less than 1994. Calves were fed 78 percent of this milk with the

remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat:

United States, 1993-95

--------------------------------------------------------------------------------

: : Production of Milk and Milkfat 2/

: Number :-------------------------------------------------------

Year : of : Per Milk Cow : Percentage : Total

:Milk Cows 1/:-------------------: of Fat in All :------------------

: : Milk : Milkfat : Milk Produced : Milk : Milkfat

--------------------------------------------------------------------------------

: 1,000 Head --- Pounds --- Percent Million Pounds

:

1993 : 9,589 15,704 575 3.66 150,582 5,514.4

1994 : 9,500 16,175 592 3.66 153,664 5,623.7

1995 : 9,461 16,451 602 3.66 155,644 5,694.3

--------------------------------------------------------------------------------

1/ Average number during year, excluding heifers not yet fresh.

2/ Excludes milk sucked by calves.

Features:

  • Percentage of digit chars
  • Percentage of alpha chars
  • Indented
  • Contains 5+ consecutive spaces
  • Whitespace in this line aligns with prev.
  • ...
  • Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

Slides from Cohen & McCallum

table extraction experimental results
Table Extraction Experimental Results

[Pinto, McCallum, Wei, Croft, 2003]

Line labels,

percent correct

HMM

65 %

Stateless

MaxEnt

85 %

CRF w/out

conjunctions

52 %

CRF

95 %

Slides from Cohen & McCallum

named entity recognition
Named Entity Recognition

Reuters stories on international news Train on ~300k words

CRICKET -

MILLNS SIGNS FOR BOLAND

CAPE TOWN 1996-08-22

South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract.

Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.

Labels: Examples:

PER Yayuk Basuki

Innocent Butare

ORG 3M

KDP

Leicestershire

LOC Leicestershire

Nirmal Hriday

The Oval

MISC Java

Basque

1,000 Lakes Rally

Slides from Cohen & McCallum

automatically induced features
Automatically Induced Features

[McCallum 2003]

Index Feature

0 inside-noun-phrase (ot-1)

5 stopword (ot)

20 capitalized (ot+1)

75 word=the (ot)

100 in-person-lexicon (ot-1)

200 word=in (ot+2)

500 word=Republic (ot+1)

711 word=RBI (ot) & header=BASEBALL

1027 header=CRICKET (ot) & in-English-county-lexicon (ot)

1298 company-suffix-word (firstmentiont+2)

4040 location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1)

4945 moderately-rare-first-name (ot-1) & very-common-last-name (ot)

4474 word=the (ot-2) & word=of (ot)

Slides from Cohen & McCallum

named entity extraction results
Named Entity Extraction Results

[McCallum & Li, 2003]

Method F1 # parameters

BBN's Identifinder, word features 79% ~500k

CRFs word features, 80% ~500k

w/out Feature Induction

CRFs many features, 75% ~3 million

w/out Feature Induction

CRFs many candidate features 90% ~60k

with Feature Induction

Slides from Cohen & McCallum

inducing state transition structure
Inducing State-Transition Structure

[Chidlovskii, 2000]

K-reversiblegrammars

Structure learning forHMMs + IE

[Seymore et al 1999]

[Frietag & McCallum 2000]

Slides from Cohen & McCallum

limitations of finite state models
Limitations of Finite State Models
  • Finite state models have a linear structure
  • Web documents have a hierarchical structure
    • Are we suffering by not modeling this structure more explicitly?
  • How can one learn a hierarchical extraction model?

Slides from Cohen & McCallum

ie resources
IE Resources
  • Data
    • RISE, http://www.isi.edu/~muslea/RISE/index.html
    • Linguistic Data Consortium (LDC)
      • Penn Treebank, Named Entities, Relations, etc.
    • http://www.biostat.wisc.edu/~craven/ie
    • http://www.cs.umass.edu/~mccallum/data
  • Code
    • TextPro, http://www.ai.sri.com/~appelt/TextPro
    • MALLET, http://www.cs.umass.edu/~mccallum/mallet
    • SecondString, http://secondstring.sourceforge.net/
  • Both
    • http://www.cis.upenn.edu/~adwait/penntools.html

Slides from Cohen & McCallum

references
References
  • [Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In Proceedings of ANLP’97, p194-201.
  • [Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).
  • [Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)
  • [Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).
  • [Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, in Proceedings of ACM SIGMOD-98.
  • [Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language, ACM Transactions on Information Systems, 18(3).
  • [Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning: Proceedings of the Seventeeth International Conference (ML-2000).
  • [Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
  • [De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural Language Processing. Larence Erlbaum, 1982, 149-176.
  • [Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98).
  • [Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon University.
  • [Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101 (2000).
  • Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99)
  • [Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.
  • [Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68).
  • [Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001.
  • [Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego.
  • [McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information extraction and segmentation, In Proceedings of ICML-2000
  • [Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.

Slides from Cohen & McCallum

ad