Information extraction from the world wide web
This presentation is the property of its rightful owner.
Sponsored Links
1 / 64

Information Extraction from the World Wide Web PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on
  • Presentation posted in: General

Information Extraction from the World Wide Web. CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts Amherst From KDD 2003. Course Overview. Introduction. Machine Learning. Scaling. Bag of Words. Hadoop. Naïve Bayes .

Download Presentation

Information Extraction from the World Wide Web

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Information extraction from the world wide web

Information Extractionfrom the World Wide Web

CSE 454

Based on Slides by

William W. Cohen

Carnegie Mellon University

Andrew McCallum

University of Massachusetts Amherst

From KDD 2003


Course overview

Course Overview

Introduction

Machine Learning

Scaling

Bag of Words

Hadoop

Naïve Bayes

Search

Info Extraction

Info Extraction

Self-Supervised

Overfitting,

Ensembles,

Cotraining

Clustering

Topics


Administrivia

Administrivia

© Daniel S. Weld

  • Proposals

    • Due Thursday 10/22

    • Pre-screen

    • Time Estimation / Pert Chart

  • Scaling Up: Way Up

    • Today 3:30 – EE 105

    • David “Pablo” Cohen – Google News

    • “Content + Connection” Algorithms; Attribute Factoring


Example the problem

Example: The Problem

Martin Baker, a person

Genomics job

Employers job posting form

Slides from Cohen & McCallum


Example a solution

Example: A Solution

Slides from Cohen & McCallum


Extracting job openings from the web

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Extracting Job Openings from the Web

Slides from Cohen & McCallum


Information extraction from the world wide web

Job Openings:

Category = Food Services

Keyword = Baker

Location = Continental U.S.

Slides from Cohen & McCallum


What is information extraction

What is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Slides from Cohen & McCallum


What is information extraction1

What is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

IE

NAME TITLE ORGANIZATION

Bill GatesCEOMicrosoft

Bill VeghteVPMicrosoft

Richard StallmanfounderFree Soft..

Slides from Cohen & McCallum


What is information extraction2

What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification + clustering + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

Slides from Cohen & McCallum


What is information extraction3

What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

Slides from Cohen & McCallum


What is information extraction4

What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification+ association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

Slides from Cohen & McCallum


What is information extraction5

NAME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill

Veghte

VP

Microsoft

Free Soft..

Richard

Stallman

founder

What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification+ association+ clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

*

*

*

*

Slides from Cohen & McCallum


Ie in context

IE in Context

Create ontology

Spider

Filter by relevance

IE

Segment

Classify

Associate

Cluster

Database

Load DB

Documentcollection

Train extraction models

Query,

Search

Data mine

Label training data

Slides from Cohen & McCallum


Ie history

IE History

Pre-Web

  • Mostly news articles

    • De Jong’s FRUMP [1982]

      • Hand-built system to fill Schank-style “scripts” from news wire

    • Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]

  • Most early work dominated by hand-built models

    • E.g. SRI’s FASTUS, hand-built FSMs.

    • But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]

      Web

  • AAAI ’94 Spring Symposium on “Software Agents”

    • Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.

  • Tom Mitchell’s WebKB, ‘96

    • Build KB’s from the Web.

  • Wrapper Induction

    • First by hand, then ML: [Doorenbos ‘96], Soderland ’96], [Kushmerick ’97],…

Slides from Cohen & McCallum


What makes ie from the web different

What makes IE from the Web Different?

Less grammar, but more formatting & linking

Newswire

Web

www.apple.com/retail

Apple to Open Its First Retail Store

in New York City

MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.

"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."

www.apple.com/retail/soho

www.apple.com/retail/soho/theatre.html

The directory structure, link structure, formatting & layout of the Web is its own new grammar.

Slides from Cohen & McCallum


Landscape of ie tasks 1 4 pattern feature domain

Landscape of IE Tasks (1/4):Pattern Feature Domain

Text paragraphs

without formatting

Grammatical sentencesand some formatting & links

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Non-grammatical snippets,rich formatting & links

Tables

Slides from Cohen & McCallum


Landscape of ie tasks 2 4 pattern scope

Landscape of IE Tasks (2/4):Pattern Scope

Web site specific

Genre specific

Wide, non-specific

Formatting

Layout

Language

Amazon.com Book Pages

Resumes

University Names

Slides from Cohen & McCallum


Landscape of ie tasks 3 4 pattern complexity

Landscape of IE Tasks (3/4):Pattern Complexity

E.g. word patterns:

Regular set

Closed set

U.S. phone numbers

U.S. states

Phone: (413) 545-1323

He was born in Alabama…

The CALD main office can be reached at 412-268-1299

The big Wyoming sky…

Ambiguous patterns,needing context andmany sources of evidence

Complex pattern

U.S. postal addresses

Person names

University of Arkansas

P.O. Box 140

Hope, AR 71802

…was among the six houses sold by Hope Feldman that year.

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

Headquarters:

1128 Main Street, 4th Floor

Cincinnati, Ohio 45210

Slides from Cohen & McCallum


Landscape of ie tasks 4 4 pattern combinations

Landscape of IE Tasks (4/4):Pattern Combinations

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Single entity

Binary relationship

N-ary record

Person: Jack Welch

Relation: Person-Title

Person: Jack Welch

Title: CEO

Relation: Succession

Company: General Electric

Title: CEO

Out: Jack Welsh

In: Jeffrey Immelt

Person: Jeffrey Immelt

Relation: Company-Location

Company: General Electric

Location: Connecticut

Location: Connecticut

“Named entity” extraction

Slides from Cohen & McCallum


Evaluation of single entity extraction

Evaluation of Single Entity Extraction

TRUTH:

Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.

PRED:

Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by RichardM. Karpe and Martin Cooke.

# correctly predicted segments 2

Precision = =

# predicted segments 6

# correctly predicted segments 2

Recall = =

# true segments 4

1

F1 = Harmonic mean of Precision & Recall =

((1/P) + (1/R)) / 2

Slides from Cohen & McCallum


State of the art performance

State of the Art Performance

  • Named entity recognition

    • Person, Location, Organization, …

    • F1 in high 80’s or low- to mid-90’s

  • Binary relation extraction

    • Contained-in (Location1, Location2)Member-of (Person1, Organization1)

    • F1 in 60’s or 70’s or 80’s

  • Wrapper induction

    • Extremely accurate performance obtainable

    • Human effort (~30min) required on each site

Slides from Cohen & McCallum


Landscape of ie techniques 1 1 models

Classify Pre-segmentedCandidates

Sliding Window

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Classifier

Classifier

which class?

which class?

Try alternatewindow sizes:

Context Free Grammars

Boundary Models

Finite State Machines

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

BEGIN

Most likely state sequence?

NNP

NNP

V

V

P

NP

Most likely parse?

Classifier

PP

which class?

VP

NP

VP

BEGIN

END

BEGIN

END

S

Landscape of IE Techniques (1/1):Models

Lexicons

Abraham Lincoln was born in Kentucky.

member?

Alabama

Alaska

Wisconsin

Wyoming

…and beyond

Any of these models can be used to capture words, formatting or both.

Slides from Cohen & McCallum


Landscape focus of this tutorial

Landscape:Focus of this Tutorial

Pattern complexity

closed set

regular

complex

ambiguous

Pattern feature domain

words

words + formatting

formatting

Pattern scope

site-specific

genre-specific

general

Pattern combinations

entity

binary

n-ary

Models

lexicon

regex

window

boundary

FSM

CFG

Slides from Cohen & McCallum


Sliding windows

Sliding Windows

Slides from Cohen & McCallum


Extraction by sliding window

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Slides from Cohen & McCallum


Extraction by sliding window1

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Slides from Cohen & McCallum


Extraction by sliding window2

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Slides from Cohen & McCallum


Extraction by sliding window3

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Slides from Cohen & McCallum


A na ve bayes sliding window model

A “Naïve Bayes” Sliding Window Model

[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun

w t-m

w t-1

w t

w t+n

w t+n+1

w t+n+m

prefix

contents

suffix

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

If P(“Wean Hall Rm 5409” = LOCATION)is above some threshold, extract it.

Other examples of sliding window: [Baluja et al 2000]

(decision tree over individual words & their context)

Slides from Cohen & McCallum


Na ve bayes sliding window results

“Naïve Bayes” Sliding Window Results

Domain: CMU UseNet Seminar Announcements

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

FieldF1

Person Name:30%

Location:61%

Start Time:98%

Slides from Cohen & McCallum


Realistic sliding window classifier ie

Realistic sliding-window-classifier IE

  • What windows to consider?

    • all windows containing as many tokens as the shortest example, but no more tokens than the longest example

  • How to represent a classifier? It might:

    • Restrict the length of window;

    • Restrict the vocabulary or formatting used before/after/inside window;

    • Restrict the relative order of tokens, etc.

  • Learning Method

    • SRV: Top-Down Rule Learning [Frietag AAAI ‘98]

    • Rapier: Bottom-Up [Califf & Mooney, AAAI ‘99]

Slides from Cohen & McCallum


Rapier results precision recall

Rapier: results – precision/recall

Slides from Cohen & McCallum


Rule learning approaches to sliding window classification summary

Rule-learning approaches to sliding-window classification: Summary

  • SRV, Rapier, and WHISK [Soderland KDD ‘97]

    • Representations for classifiers allow restriction of the relationships between tokens, etc

    • Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog)

    • Use of these “heavyweight” representations is complicated, but seems to pay off in results

  • Can simpler representations for classifiers work?

Slides from Cohen & McCallum


Bwi learning to detect boundaries

BWI: Learning to detect boundaries

[Freitag & Kushmerick, AAAI 2000]

  • Another formulation: learn three probabilistic classifiers:

    • START(i) = Prob( position i starts a field)

    • END(j) = Prob( position j ends a field)

    • LEN(k) = Prob( an extracted field has length k)

  • Then score a possible extraction (i,j) by

    START(i) * END(j) * LEN(j-i)

  • LEN(k) is estimated from a histogram

Slides from Cohen & McCallum


Bwi learning to detect boundaries1

BWI: Learning to detect boundaries

  • BWI uses boosting to find “detectors” for START and END

  • Each weak detector has a BEFORE and AFTER pattern (on tokens before/after position i).

  • Each “pattern” is a sequence of

    • tokens and/or

    • wildcards like: anyAlphabeticToken, anyNumber, …

  • Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patterns

Slides from Cohen & McCallum


Bwi learning to detect boundaries2

BWI: Learning to detect boundaries

FieldF1

Person Name:30%

Location:61%

Start Time:98%

Slides from Cohen & McCallum


Problems with sliding windows and boundary finders

Problems with Sliding Windows and Boundary Finders

  • Decisions in neighboring parts of the input are made independently from each other.

    • Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”.

    • It is possible for two overlapping windows to both be above threshold.

    • In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.

Slides from Cohen & McCallum


Finite state machines

Finite State Machines

Slides from Cohen & McCallum


Hidden markov models hmms

Hidden Markov Models (HMMs)

standard sequence model in genomics, speech, NLP, …

Graphical model

Finite state model

S

S

S

transitions

t

-

1

t

t+1

...

...

observations

...

O

O

O

t

-

t

+1

Generates:

State

sequence

Observation

sequence

t

1

o1 o2 o3 o4 o5 o6 o7 o8

Parameters: for all states S={s1,s2,…}

Start state probabilities: P(st )

Transition probabilities: P(st|st-1 )

Observation (emission) probabilities: P(ot|st )

Training:

Maximize probability of training observations (w/ prior)

Usually a multinomial over atomic, fixed alphabet

Slides from Cohen & McCallum


Ie with hidden markov models

IE with Hidden Markov Models

Given a sequence of observations:

Yesterday Pedro Domingos spoke this example sentence.

and a trained HMM:

person name

location name

background

Find the most likely state sequence: (Viterbi)

YesterdayPedro Domingosspoke this example sentence.

Any words said to be generated by the designated “person name”

state extract as a person name:

Person name: Pedro Domingos

Slides from Cohen & McCallum


Hmm example nymble

HMM Example: “Nymble”

[Bikel, et al 1998], [BBN “IdentiFinder”]

Task: Named Entity Extraction

Transitionprobabilities

Observationprobabilities

Person

end-of-sentence

P(ot | st , st-1)

P(st | st-1, ot-1)

start-of-sentence

Org

P(ot | st , ot-1)

or

(Five other name classes)

Back-off to:

Back-off to:

P(st | st-1 )

P(ot | st )

Other

P(st )

P(ot )

Train on ~500k words of news wire text.

Results:

Case Language F1 .

Mixed English93%

UpperEnglish91%

MixedSpanish90%

Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]

Slides from Cohen & McCallum


We want more than an atomic view of words

We want More than an Atomic View of Words

Would like richer representation of text:

many arbitrary, overlapping features of the words.

S

S

S

identity of word

ends in “-ski”

is capitalized

is part of a noun phrase

is in a list of city names

is under node X in WordNet

is in bold font

is indented

is in hyperlink anchor

last person name was female

next two words are “and Associates”

t

-

1

t

t+1

is “Wisniewski”

part ofnoun phrase

ends in

“-ski”

O

O

O

t

-

t

+1

t

1

Slides from Cohen & McCallum


Problems with richer representation and a joint model

Problems with Richer Representationand a Joint Model

These arbitrary features are not independent.

  • Multiple levels of granularity (chars, words, phrases)

  • Multiple dependent modalities (words, formatting, layout)

  • Past & future

    Two choices:

Ignore the dependencies.

This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

Model the dependencies.

Each state would have its own Bayes Net. But we are already starved for training data!

S

S

S

S

S

S

t

-

1

t

t+1

t

-

1

t

t+1

O

O

O

O

O

O

Slides from Cohen & McCallum

t

-

t

+1

t

-

t

+1

t

1

t

1


Conditional sequence models

Conditional Sequence Models

  • We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(s|o) instead of P(s,o):

    • Can examine features, but not responsible for generating them.

    • Don’t have to explicitly model their dependencies.

    • Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

Slides from Cohen & McCallum


From hmms to crfs

St-1

St

St+1

...

Ot-1

Ot

Ot+1

...

From HMMs to CRFs

Conditional Finite State Sequence Models

[McCallum, Freitag & Pereira, 2000]

[Lafferty, McCallum, Pereira 2001]

St-1

St

St+1

...

Joint

...

Ot-1

Ot

Ot+1

Conditional

where

(A super-special case of

Conditional Random Fields.)

Slides from Cohen & McCallum


Conditional random fields

Conditional Random Fields

[Lafferty, McCallum, Pereira 2001]

1. FSM special-case:

linear chain among unknowns, parameters tied across time steps.

St

St+1

St+2

St+3

St+4

O = Ot, Ot+1, Ot+2, Ot+3, Ot+4

2. In general:

CRFs = "Conditionally-trained Markov Network"

arbitrary structure among unknowns

3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]:

Parameters tied across hits from SQL-like queries ("clique templates")

4. Markov Logic Networks [Richardson & Domingos]:

Weights on arbitrary logical sentences

Slides from Cohen & McCallum


Feature functions

Feature Functions

o =

Yesterday Pedro Domingos spoke this example sentence.

o1 o2 o3 o4 o5 o6 o7

s1

s2

s3

s4

Slides from Cohen & McCallum


Learning parameters of crfs

Learning Parameters of CRFs

Maximize log-likelihood of parameters L= {lk} given training data D

Log-likelihood gradient:

  • Methods:

  • iterative scaling (quite slow)

  • conjugate gradient (much faster)

  • limited-memory quasi-Newton methods, BFGS (super fast)

[Sha & Pereira 2002] & [Malouf 2002]

Slides from Cohen & McCallum


General crfs vs hmms

General CRFs vs. HMMs

  • More general and expressive modeling technique

  • Comparable computational efficiency

  • Features may be arbitrary functions of any or all observations

  • Parameters need not fully specify generation of observations; require less training data

  • Easy to incorporate domain knowledge

  • State means only “state of process”, vs“state of process” and “observational history I’m keeping”

Slides from Cohen & McCallum


Person name extraction

Person name Extraction

[McCallum 2001, unpublished]

Slides from Cohen & McCallum


Person name extraction1

Person name Extraction

Slides from Cohen & McCallum


Features in experiment

Capitalized Xxxxx

Mixed Caps XxXxxx

All Caps XXXXX

Initial CapX….

Contains Digitxxx5

All lowercase xxxx

InitialX

Punctuation.,:;!(), etc

Period.

Comma,

Apostrophe‘

Dash-

Preceded by HTML tag

Character n-gram classifier says string is a person name (80% accurate)

In stopword list(the, of, their, etc)

In honorific list(Mr, Mrs, Dr, Sen, etc)

In person suffix list(Jr, Sr, PhD, etc)

In name particle list (de, la, van, der, etc)

In Census lastname list;segmented by P(name)

In Census firstname list;segmented by P(name)

In locations lists(states, cities, countries)

In company name list(“J. C. Penny”)

In list of company suffixes(Inc, & Associates, Foundation)

Features in Experiment

Hand-built FSM person-name extractor says yes, (prec/recall ~ 30/95)

Conjunctions of all previous feature pairs, evaluated at the current time step.

Conjunctions of all previous feature pairs, evaluated at current step and one step ahead.

All previous features, evaluated two steps ahead.

All previous features, evaluated one step behind.

Total number of features = ~500k

Slides from Cohen & McCallum


Training and testing

Training and Testing

  • Trained on 65k words from 85 pages, 30 different companies’ web sites.

  • Training takes 4 hours on a 1 GHz Pentium.

  • Training precision/recall is 96% / 96%.

  • Tested on different set of web pages with similar size characteristics.

  • Testing precision is 92 – 95%, recall is 89 – 91%.

Slides from Cohen & McCallum


Table extraction from government reports

Table Extraction from Government Reports

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was

slightly below 1994. Producer returns averaged $12.93 per hundredweight,

$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,

1 percent above 1994. Marketings include whole milk sold to plants and dealers

as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced,

8 percent less than 1994. Calves were fed 78 percent of this milk with the

remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat:

United States, 1993-95

--------------------------------------------------------------------------------

: : Production of Milk and Milkfat 2/

: Number :-------------------------------------------------------

Year : of : Per Milk Cow : Percentage : Total

:Milk Cows 1/:-------------------: of Fat in All :------------------

: : Milk : Milkfat : Milk Produced : Milk : Milkfat

--------------------------------------------------------------------------------

: 1,000 Head --- Pounds --- Percent Million Pounds

:

1993 : 9,589 15,704 575 3.66 150,582 5,514.4

1994 : 9,500 16,175 592 3.66 153,664 5,623.7

1995 : 9,461 16,451 602 3.66 155,644 5,694.3

--------------------------------------------------------------------------------

1/ Average number during year, excluding heifers not yet fresh.

2/ Excludes milk sucked by calves.

Slides from Cohen & McCallum


Table extraction from government reports1

Table Extraction from Government Reports

[Pinto, McCallum, Wei, Croft, 2003]

100+ documents from www.fedstats.gov

Labels:

CRF

  • Non-Table

  • Table Title

  • Table Header

  • Table Data Row

  • Table Section Data Row

  • Table Footnote

  • ... (12 in all)

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was

slightly below 1994. Producer returns averaged $12.93 per hundredweight,

$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,

1 percent above 1994. Marketings include whole milk sold to plants and dealers

as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced,

8 percent less than 1994. Calves were fed 78 percent of this milk with the

remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat:

United States, 1993-95

--------------------------------------------------------------------------------

: : Production of Milk and Milkfat 2/

: Number :-------------------------------------------------------

Year : of : Per Milk Cow : Percentage : Total

:Milk Cows 1/:-------------------: of Fat in All :------------------

: : Milk : Milkfat : Milk Produced : Milk : Milkfat

--------------------------------------------------------------------------------

: 1,000 Head --- Pounds --- Percent Million Pounds

:

1993 : 9,589 15,704 575 3.66 150,582 5,514.4

1994 : 9,500 16,175 592 3.66 153,664 5,623.7

1995 : 9,461 16,451 602 3.66 155,644 5,694.3

--------------------------------------------------------------------------------

1/ Average number during year, excluding heifers not yet fresh.

2/ Excludes milk sucked by calves.

Features:

  • Percentage of digit chars

  • Percentage of alpha chars

  • Indented

  • Contains 5+ consecutive spaces

  • Whitespace in this line aligns with prev.

  • ...

  • Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

Slides from Cohen & McCallum


Table extraction experimental results

Table Extraction Experimental Results

[Pinto, McCallum, Wei, Croft, 2003]

Line labels,

percent correct

HMM

65 %

Stateless

MaxEnt

85 %

CRF w/out

conjunctions

52 %

CRF

95 %

Slides from Cohen & McCallum


Named entity recognition

Named Entity Recognition

Reuters stories on international newsTrain on ~300k words

CRICKET -

MILLNS SIGNS FOR BOLAND

CAPE TOWN 1996-08-22

South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract.

Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.

Labels: Examples:

PERYayuk Basuki

Innocent Butare

ORG3M

KDP

Leicestershire

LOCLeicestershire

Nirmal Hriday

The Oval

MISCJava

Basque

1,000 Lakes Rally

Slides from Cohen & McCallum


Automatically induced features

Automatically Induced Features

[McCallum 2003]

IndexFeature

0inside-noun-phrase (ot-1)

5stopword (ot)

20capitalized (ot+1)

75word=the (ot)

100in-person-lexicon (ot-1)

200word=in (ot+2)

500word=Republic (ot+1)

711word=RBI (ot) & header=BASEBALL

1027header=CRICKET (ot) & in-English-county-lexicon (ot)

1298company-suffix-word (firstmentiont+2)

4040location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1)

4945moderately-rare-first-name (ot-1) & very-common-last-name (ot)

4474word=the (ot-2) & word=of (ot)

Slides from Cohen & McCallum


Named entity extraction results

Named Entity Extraction Results

[McCallum & Li, 2003]

MethodF1# parameters

BBN's Identifinder, word features79% ~500k

CRFs word features, 80% ~500k

w/out Feature Induction

CRFs many features, 75% ~3 million

w/out Feature Induction

CRFs many candidate features90% ~60k

with Feature Induction

Slides from Cohen & McCallum


Inducing state transition structure

Inducing State-Transition Structure

[Chidlovskii, 2000]

K-reversiblegrammars

Structure learning forHMMs + IE

[Seymore et al 1999]

[Frietag & McCallum 2000]

Slides from Cohen & McCallum


Limitations of finite state models

Limitations of Finite State Models

  • Finite state models have a linear structure

  • Web documents have a hierarchical structure

    • Are we suffering by not modeling this structure more explicitly?

  • How can one learn a hierarchical extraction model?

Slides from Cohen & McCallum


Ie resources

IE Resources

  • Data

    • RISE, http://www.isi.edu/~muslea/RISE/index.html

    • Linguistic Data Consortium (LDC)

      • Penn Treebank, Named Entities, Relations, etc.

    • http://www.biostat.wisc.edu/~craven/ie

    • http://www.cs.umass.edu/~mccallum/data

  • Code

    • TextPro, http://www.ai.sri.com/~appelt/TextPro

    • MALLET, http://www.cs.umass.edu/~mccallum/mallet

    • SecondString, http://secondstring.sourceforge.net/

  • Both

    • http://www.cis.upenn.edu/~adwait/penntools.html

Slides from Cohen & McCallum


References

References

  • [Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In Proceedings of ANLP’97, p194-201.

  • [Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).

  • [Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)

  • [Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).

  • [Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, in Proceedings of ACM SIGMOD-98.

  • [Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language, ACM Transactions on Information Systems, 18(3).

  • [Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning: Proceedings of the Seventeeth International Conference (ML-2000).

  • [Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.

  • [De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural Language Processing. Larence Erlbaum, 1982, 149-176.

  • [Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98).

  • [Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon University.

  • [Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101 (2000).

  • Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99)

  • [Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.

  • [Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68).

  • [Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001.

  • [Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego.

  • [McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information extraction and segmentation, In Proceedings of ICML-2000

  • [Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.

Slides from Cohen & McCallum


  • Login