Information extraction
This presentation is the property of its rightful owner.
Sponsored Links
1 / 79

Information Extraction PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on
  • Presentation posted in: General

Information Extraction. CS511 - UIUC - Fall 2005 Group Presentation: Hieu Le Nhung Nguyen Mayssam Sayyadian

Download Presentation

Information Extraction

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Information extraction

Information Extraction

CS511 - UIUC - Fall 2005 Group Presentation: Hieu Le Nhung Nguyen Mayssam Sayyadian

(With most of slides from Anhai Doan, D.W. Embley, D.M Campbell, Y.S. Jiang, Y.-K. Ng, R.D. Smith, S.W. Liddle, D.W. Quass, Yi , William W. Cohen & Scott Wen-Tau)


Road map

Road Map

  • Motivational Problem

  • What is Information Extraction (IE)?

  • IE History

  • Landscape of problems and solutions

  • Techniques in IE

  • Conclusion


Example the problem

Example: The Problem

Martin Baker, a person

Genomics job

Employers job posting form


Example a solution

Example: A Solution


Extracting job openings from the web

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Extracting Job Openings from the Web


Information extraction

Job Openings:

Category = Food Services

Keyword = Baker

Location = Continental U.S.


Information extraction

IE from Research Papers


What is information extraction

What is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION


What is information extraction1

What is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

IE

NAME TITLE ORGANIZATION

Bill GatesCEOMicrosoft

Bill VeghteVPMicrosoft

Richard StallmanfounderFree Soft..


What is information extraction2

What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification + clustering + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

aka “named entity extraction”


What is information extraction3

What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation


What is information extraction4

What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification+ association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation


What is information extraction5

NAME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill

Veghte

VP

Microsoft

Free Soft..

Richard

Stallman

founder

What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification+ association+ clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

*

*

*

*


Ie history

IE History

Pre-Web

  • Mostly news articles

    • De Jong’s FRUMP [1982]

      • Hand-built system to fill Schank-style “scripts” from news wire

    • Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]

  • Early work dominated by hand-built models

    • E.g. SRI’s FASTUS, hand-built FSMs.

    • But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]

      Web

  • AAAI ’94 Spring Symposium on “Software Agents”

    • Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.

  • Tom Mitchell’s WebKB, ‘96

    • Build KB’s from the Web.

  • Wrapper Induction

    • Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…

    • Citeseer; Cora; FlipDog; contEd courses, corpInfo, …


Ie history1

IE History

Biology

  • Gene/protein entity extraction

  • Protein/protein fact interaction

  • Automated curation/integration of databases

    • At CMU: SLIF (Murphy et al, subcellular information from images + text in journal articles)

      Email

  • EPCA, PAL, RADAR, CALO: intelligent office assistant that “understands” some part of email

    • At CMU: web site update requests, office-space requests; calendar scheduling requests; social network analysis of email.


Ie is different in different domains

IE is different in different domains!

Example: on web there is less grammar, but more formatting & linking

Newswire

Web

www.apple.com/retail

Apple to Open Its First Retail Store

in New York City

MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.

"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."

www.apple.com/retail/soho

www.apple.com/retail/soho/theatre.html

The directory structure, link structure, formatting & layout of the Web is its own new grammar.


Landscape of ie tasks 1 4 degree of formatting

Landscape of IE Tasks (1/4):Degree of Formatting

Text paragraphs

without formatting

Grammatical sentencesand some formatting & links

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Non-grammatical snippets,rich formatting & links

Tables


Landscape of ie tasks 2 4 intended breadth of coverage

Landscape of IE Tasks (2/4):Intended Breadth of Coverage

Web site specific

Genre specific

Wide, non-specific

Formatting

Layout

Language

Amazon.com Book Pages

Resumes

University Names


Landscape of ie tasks 3 4 complexity

Landscape of IE Tasks (3/4):Complexity

E.g. word patterns:

Regular set

Closed set

U.S. phone numbers

U.S. states

Phone: (413) 545-1323

He was born in Alabama…

The CALD main office can be reached at 412-268-1299

The big Wyoming sky…

Ambiguous patterns,needing context andmany sources of evidence

Complex pattern

U.S. postal addresses

Person names

University of Arkansas

P.O. Box 140

Hope, AR 71802

…was among the six houses sold by Hope Feldman that year.

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

Headquarters:

1128 Main Street, 4th Floor

Cincinnati, Ohio 45210


Landscape of ie tasks 4 4 single field record

Landscape of IE Tasks (4/4):Single Field/Record

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Single entity

Binary relationship

N-ary record

Person: Jack Welch

Relation: Person-Title

Person: Jack Welch

Title: CEO

Relation: Succession

Company: General Electric

Title: CEO

Out: Jack Welsh

In: Jeffrey Immelt

Person: Jeffrey Immelt

Relation: Company-Location

Company: General Electric

Location: Connecticut

Location: Connecticut

“Named entity” extraction


Why is ie difficult

Why is IE Difficult?

  • Different Languages

    • Morphology is very easy in English, much harder in German and Hebrew.

    • Identifying word and sentence boundaries is fairly easy in European language, much harder in Chinese and Japanese.

    • Some languages use orthography (like english) while others (like hebrew, arabic etc) do no have it.

  • Different types of style

    • Scientific papers

    • Newspapers

    • memos

    • Emails

    • Speech transcripts

  • Type of Document

    • Tables

    • Graphics

    • Small messages vs. Books


Techniques in ie

Techniques in IE


Overview of techniques in ie

Overview of Techniques in IE

  • Wrapper Construction

  • NLP oriented

  • Conceptual Model Based


Overview wrapper construction approach

Overview: Wrapper Construction Approach

  • Heavily depend on the structure of HTML pages

  • Different wrappers for different sources even of the same domain

  • Wrapper maintenance is a big problem


Overview nlp approach

Overview: NLP Approach

  • Heavily base on NLP & Machine Learning technique

  • Acquire deep understanding of the sentence.

  • Prefer complete sentences vs. partial ones


Overview conceptual model based

Overview: Conceptual Model Based

  • New

  • Depend on IA technique: Knowledge presentation.

  • More detail is comming


Overview of some ml techniques in ie

Overview of some ML techniques in IE

  • Rote Learning

  • Sliding window*

  • Naïve Bayes

  • HMM

  • ME

  • SRV

  • Multistrategy*


Rote learning

Rote Learning

  • Remember everything

  • Good for Person Name, City Name, Country Name


Extraction by sliding window

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement


Extraction by sliding window1

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement


Extraction by sliding window2

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement


Extraction by sliding window3

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement


Na ve bayes 1

Naïve Bayes (1)

  • Simple but powerful  Highly recommended

  • Have a strong assumption: all features contributing to the labels are totally independent


A na ve bayes sliding window model

A “Naïve Bayes” Sliding Window Model

[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun

w t-m

w t-1

w t

w t+n

w t+n+1

w t+n+m

prefix

contents

suffix

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

If P(“Wean Hall Rm 5409” = LOCATION)is above some threshold, extract it.


Na ve bayes sliding window results

“Naïve Bayes” Sliding Window Results

Domain: CMU UseNet Seminar Announcements

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

FieldF1

Person Name:30%

Location:61%

Start Time:98%


Hidden markov model hmm

Hidden Markov Model (HMM)

  • A very popular ML algorithm in NLP and Speech Recognition.

  • Assumption: label of an state just depends on observation at that state and the state of the limited previous states.


Ie with hidden markov models

IE with Hidden Markov Models

Given a sequence of observations:

Yesterday Pedro Domingos spoke this example sentence.

and a trained HMM:

person name

location name

background

Find the most likely state sequence: (Viterbi)

YesterdayPedro Domingosspoke this example sentence.

Any words said to be generated by the designated “person name”

state extract as a person name:

Person name: Pedro Domingos


Hmm for segmentation

HMM for Segmentation

  • Simplest Model: One state per entity type


Hmm example nymble

HMM Example: “Nymble”

[Bikel, et al 1998], [BBN “IdentiFinder”]

Task: Named Entity Extraction

Transitionprobabilities

Observationprobabilities

Person

end-of-sentence

P(ot | st , st-1)

P(st | st-1, ot-1)

start-of-sentence

Org

P(ot | st , ot-1)

or

(Five other name classes)

Back-off to:

Back-off to:

P(st | st-1 )

P(ot | st )

Other

P(st )

P(ot )

Train on ~500k words of news wire text.

Results:

Case Language F1 .

Mixed English93%

UpperEnglish91%

MixedSpanish90%

Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]


Maximum entropy algorithm

Maximum Entropy – Algorithm

  • Learning algorithm : Maximum entropy classifier

    • Estimates probabilities based on the principle of making as few assumptions as possible

    • Pick the probability distribution that has the highest entropy


Srv a realistic sliding window classifier ie system

SRV: a realistic sliding-window-classifier IE system

[Frietag AAAI ‘98]

  • What windows to consider?

    • all windows containing as many tokens as the shortest example, but no more tokens than the longest example

  • How to represent a classifier? It might:

    • Restrict the length of window;

    • Restrict the vocabulary or formatting used before/after/inside window;

    • Restrict the relative order of tokens;

    • Use inductive logic programming techniques to express all these…

<title>Course Information for CS213</title>

<h1>CS 213 C++ Programming</h1>


Srv a rule learner for sliding window classification

SRV: a rule-learner for sliding-window classification

  • Primitive predicates used by SRV:

    • token(X,W), allLowerCase(W), numerical(W), …

    • nextToken(W,U), previousToken(W,V)

  • HTML-specific predicates:

    • inTitleTag(W), inH1Tag(W), inEmTag(W),…

    • emphasized(W) = “inEmTag(W) or inBTag(W) or …”

    • tableNextCol(W,U) = “U is some token in the column after the column W is in”

    • tablePreviousCol(W,V), tableRowHeader(W,T),…


Srv a rule learner for sliding window classification1

courseNumber(X) :-

tokenLength(X,=,2),

every(X, inTitle, false),

some(X, A, <previousToken>, inTitle, true),

some(X, B, <>. tripleton, true)

Non-primitive conditions make greedy search easier

SRV: a rule-learner for sliding-window classification

  • Non-primitive “conditions” used by SRV:

    • every(+X,f, c) = for all W in X : f(W)=c

    • some(+X, W, <f1,…,fk>, g, c)= exists W: g(fk(…(f1(W)…))=c

    • tokenLength(+X, relop,c):

    • position(+W,direction,relop, c):

      • e.g., tokenLength(X,>,4), position(W,fromEnd,<,2)


Multistrategy consulting many experts

Multistrategy: consulting many experts

  • The real world show that there is no dominant ML algorithm.

  • Each ML algorithm has its own bias

  • Combing learners as a meta learner out perform individual learner (worst case?).


Conceptual model based

Conceptual Model Based


Motivation

Motivation

  • Database-style queries are effective

    • Find red cars, 1993 or newer, < $5,000

      • Select * From Car Where Color=“red” And Year >= 1993 And Price < 5000

  • Web is not a database

    • Uses keyword search

    • Retrieves documents, not records

    • Assuming we have a range operator:

      • “red” and (1993 to 1998) and (1 to 5000)


Goal query the web like we query a database

GOALQuery the Web like we query a database

Example: Get the year, make, model, and price for

1987 or later cars that are red or white.

YearMakeModelPrice

-----------------------------------------------------------------------

97CHEVYCavalier11,995

94DODGE 4,995

94DODGEIntrepid10,000

91FORDTaurus 3,500

90FORDProbe

88FORDEscort 1,000


Problem the web is not structured like a database

PROBLEMThe Web is not structured like a database.

Example:

<html>

<head>

<title>The Salt Lake Tribune Classified’s</title>

</head>

<hr>

<h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.

Previous owner heart broken! Asking only $11,995. #1415

JERRY SEINER MIDVALE, 566-3800 or 566-3888</h4>

<hr>

</html>


Making the web look like a database

Making the Web Look Like a Database

  • Web Query Languages

    • Treat the Web as a graph (pages = nodes, links = edges).

    • Query the graph (e.g., Find all pages within one hop of pages with the words “Cars for Sale”).

  • Wrappers

    • Find page of interest.

    • Parse page to extract attribute-value pairs and insert them into a database.

      • Write parser by hand.

      • Use syntactic clues to generate parser semi-automatically.

    • Query the database.


Automatic wrapper generation

Automatic Wrapper Generation

for a page of unstructured documents, rich in data and narrow in ontological breadth

Application

Ontology

Web Page

Ontology

Parser

Record

Extractor

Constant/Keyword

Matching Rules

Record-Level Objects,

Relationships, and Constraints

Database

Scheme

Constant/Keyword

Recognizer

Unstructured

Record Documents

Database-Instance

Generator

Populated

Database

Data-Record Table


Application ontology object relationship model instance

Year

Price

1..*

1..*

1..*

has

has

Make

1..*

Mileage

0..1

0..1

0..1

0..1

has

has

Car

0..1

0..1

0..*

is for

PhoneNr

has

has

1..*

Model

0..1

1..*

1..*

has

Feature

1..*

Extension

Application Ontology:Object-Relationship Model Instance

Car [-> object];

Car [0..1] has Model [1..*];

Car [0..1] has Make [1..*];

Car [0..1] has Year [1..*];

Car [0..1] has Price [1..*];

Car [0..1] has Mileage [1..*];

PhoneNr [1..*] is for Car [0..1];

PhoneNr [0..1] has Extension [1..*];

Car [0..*] has Feature [1..*];


Application ontology data frames

Application Ontology: Data Frames

Make matches [10] case insensitive

constant

{ extract “chev”; }, { extract “chevy”; }, { extract “dodge”; },

end;

Model matches [16] case insensitive

constant

{ extract “88”; context “\bolds\S*\s*88\b”; },

end;

Mileage matches [7] case insensitive

constant

{ extract “[1-9]\d{0,2}k”; substitute “k” -> “,000”; },

keyword

“\bmiles\b”, “\bmi\b “\bmi.\b”;

end;

...


Ontology parser

Application

Ontology

Ontology

Parser

Constant/Keyword

Matching Rules

Record-Level Objects,

Relationships, and Constraints

Database

Scheme

Ontology Parser

create table Car (

Car integer,

Year varchar(2),

… );

create table CarFeature (

Car integer,

Feature varchar(10));

...

Make : chevy

KEYWORD(Mileage) : \bmiles\b

...

Object: Car;

...

Car: Year [0..1];

Car: Make [0..1];

CarFeature: Car [0..*] has Feature [1..*];


Record extractor

Web Page

Record

Extractor

Unstructured

Record Documents

Record Extractor

<html>

<h4> ‘97 CHEVY Cavalier, Red, 5 spd, … </h4>

<hr>

<h4> ‘89 CHEVY Corsica Sdn teal, auto, … </h4>

<hr>

….

</html>

#####

‘97 CHEVY Cavalier, Red, 5 spd, …

#####

‘89 CHEVY Corsica Sdn teal, auto, …

#####

...


Record extractor high fan out heuristic

Record Extractor:High Fan-Out Heuristic

<html>

<head>

<title>The Salt Lake Tribune … </title>

</head>

<body bgcolor=“#FFFFFF”>

<h1 align=”left”>Domestic Cars</h1>

<hr>

<h4> ‘97 CHEVY Cavalier, Red, … </h4>

<hr>

<h4> ‘89 CHEVY Corsica Sdn … </h4>

<hr>

</body>

</html>

html

head

body

title

h1

… hr h4 hr h4 hr ...


Record extractor record separator heuristics

Record Extractor:Record-Separator Heuristics

  • Identifiable “separator” tags

  • Highest-count tag(s)

  • Interval standard deviation

  • Ontological match

  • Repeating tag patterns

Example:

<hr>

<h4> ‘97 CHEVY Cavalier, Red, 5 spd, <i>only 7,000 miles</i>

on her. Asking <i>only $11,995</i>. … </h4>

<hr>

<h4> ‘89 CHEV Corsica Sdn teal, auto, air, <i>trouble free</i>.

Only $8,995 … </h4>

<hr>

...


Record extractor consensus heuristic

Record Extractor:Consensus Heuristic

Certainty is a generalization of: C(E1) + C(E2) - C(E1)C(E2).

C denotes certainty and Ei is the evidence for an observation.

Our certainties are based on observations from 10 different

sites for 2 different applications (car ads and obituaries)

Correct Tag Rank

Heuristic 1 2 3 4

IT 96% 4%

HT 49% 33% 16% 2%

SD 66% 22% 12%

OM 85% 12% 2% 1%

RP 78% 12% 9% 1%


Record extractor results

Record Extractor: Results

HeuristicSuccess Rate

IT 95%

HT 45%

SD 65%

OM 80%

RP 75%

Consensus 100%

4 different applications (car ads, job ads, obituaries,

university courses) with 5 new/different sites for each

application


Constant keyword recognizer

Constant/Keyword

Matching Rules

Constant/Keyword

Recognizer

Unstructured

Record Documents

Data-Record Table

Constant/Keyword Recognizer

‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.

Previous owner heart broken! Asking only $11,995. #1415

JERRY SEINER MIDVALE, 566-3800 or 566-3888

Descriptor/String/Position(start/end)

Year|97|2|3

Make|CHEV|5|8

Make|CHEVY|5|9

Model|Cavalier|11|18

Feature|Red|21|23

Feature|5 spd|26|30

Mileage|7,000|38|42

KEYWORD(Mileage)|miles|44|48

Price|11,995|100|105

Mileage|11,995|100|105

PhoneNr|566-3800|136|143

PhoneNr|566-3888|148|155


Heuristics

Heuristics

  • Keyword proximity

  • Subsumed and overlapping constants

  • Functional relationships

  • Nonfunctional relationships

  • First occurrence without constraint violation


Keyword proximity

Keyword Proximity

Year|97|2|3

Make|CHEV|5|8

Make|CHEVY|5|9

Model|Cavalier|11|18

Feature|Red|21|23

Feature|5 spd|26|30

Mileage|7,000|38|42

KEYWORD(Mileage)|miles|44|48

Price|11,995|100|105

Mileage|11,995|100|105

PhoneNr|566-3800|136|143

PhoneNr|566-3888|148|155

D = 2

D = 52

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888


Subsumed overlapping constants

Subsumed/Overlapping Constants

Year|97|2|3

Make|CHEV|5|8

Make|CHEVY|5|9

Model|Cavalier|11|18

Feature|Red|21|23

Feature|5 spd|26|30

Mileage|7,000|38|42

KEYWORD(Mileage)|miles|44|48

Price|11,995|100|105

Mileage|11,995|100|105

PhoneNr|566-3800|136|143

PhoneNr|566-3888|148|155

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888


Functional relationships

Functional Relationships

Year|97|2|3

Make|CHEV|5|8

Make|CHEVY|5|9

Model|Cavalier|11|18

Feature|Red|21|23

Feature|5 spd|26|30

Mileage|7,000|38|42

KEYWORD(Mileage)|miles|44|48

Price|11,995|100|105

Mileage|11,995|100|105

PhoneNr|566-3800|136|143

PhoneNr|566-3888|148|155

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888


Nonfunctional relationships

Nonfunctional Relationships

Year|97|2|3

Make|CHEV|5|8

Make|CHEVY|5|9

Model|Cavalier|11|18

Feature|Red|21|23

Feature|5 spd|26|30

Mileage|7,000|38|42

KEYWORD(Mileage)|miles|44|48

Price|11,995|100|105

Mileage|11,995|100|105

PhoneNr|566-3800|136|143

PhoneNr|566-3888|148|155

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888


First occurrence without constraint violation

First Occurrence without Constraint Violation

Year|97|2|3

Make|CHEV|5|8

Make|CHEVY|5|9

Model|Cavalier|11|18

Feature|Red|21|23

Feature|5 spd|26|30

Mileage|7,000|38|42

KEYWORD(Mileage)|miles|44|48

Price|11,995|100|105

Mileage|11,995|100|105

PhoneNr|566-3800|136|143

PhoneNr|566-3888|148|155

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888


Database instance generator

Record-Level Objects,

Relationships, and Constraints

Database

Scheme

Database-Instance

Generator

Populated

Database

Data-Record Table

Database-Instance Generator

Year|97|2|3

Make|CHEV|5|8

Make|CHEVY|5|9

Model|Cavalier|11|18

Feature|Red|21|23

Feature|5 spd|26|30

Mileage|7,000|38|42

KEYWORD(Mileage)|miles|44|48

Price|11,995|100|105

Mileage|11,995|100|105

PhoneNr|566-3800|136|143

PhoneNr|566-3888|148|155

insert into Car values(1001, “97”, “CHEVY”, “Cavalier”,

“7,000”, “11,995”, “556-3800”)

insert into CarFeature values(1001, “Red”)

insert into CarFeature values(1001, “5 spd”)


Recall precision

Recall & Precision

N = number of facts in source

C = number of facts declared correctly

I = number of facts declared incorrectly

(of facts available, how many did we find?)

(of facts retrieved, how many were relevant?)


Results car ads

Results: Car Ads

Salt Lake Tribune

Recall %Precision %

Year 100 100

Make 97 100

Model 82 100

Mileage 90 100

Price 100 100

PhoneNr 94 100

Extension 50 100

Feature 91 99

Training set for tuning ontology: 100

Test set: 116


Car ads comments

Car Ads: Comments

  • Unbounded sets

    • missed: MERC, Town Car, 98 Royale

    • could use lexicon of makes and models

  • Unspecified variation in lexical patterns

    • missed: 5 speed (instead of 5 spd), p.l (instead of p.l.)

    • could adjust lexical patterns

  • Misidentification of attributes

    • classified AUTO in AUTO SALES as automatic transmission

    • could adjust exceptions in lexical patterns

  • Typographical errors

    • “Chrystler”, “DODG ENeon”, “I-15566-2441”

    • could look for spelling variations and common typos


Results computer job ads

Results: Computer Job Ads

Los Angeles Times

Recall %Precision %

Degree 100 100

Skill 74 100

Email 91 83

Fax 100 100

Voice 79 92

Training set for tuning ontology: 50

Test set: 50


Obituaries a more demanding application

Obituaries(A More Demanding Application)

Multiple

Dates

Our beloved Brian Fielding Frost,

age 41, passed away Saturday morning,

March 7, 1998, due to injuries sustained

in an automobile accident. He was born

August 4, 1956 in Salt Lake City, to

Donald Fielding and Helen Glade Frost.

He married Susan Fox on June 1, 1981.

He is survived by Susan; sons Jord-

dan (9), Travis (8), Bryce (6); parents,

three brothers, Donald Glade (Lynne),

Kenneth Wesley (Ellen), …

Funeral services will be held at 12

noon Friday, March 13, 1998 in the

Howard Stake Center, 350 South 1600

East. Friends may call 5-7 p.m. Thurs-

day at Wasatch Lawn Mortuary, 3401

S. Highland Drive, and at the Stake

Center from 10:45-11:45 a.m.

Names

Family

Relationships

Multiple

Viewings

Addresses


Obituary ontology

Obituary Ontology

(partial)


Data frames lexicons specializations

Data FramesLexicons & Specializations

Name matches [80] case sensitive

constant

{ extract First, “\s+”, Last; },

{ extract “[A-Z][a-zA-Z]*\s+([A-Z]\.\s+)?”, Last; },

lexicon

{ First case insensitive; filename “first.dict”; },

{ Last case insensitive; filename “last.dict”; };

end;

Relative Name matches [80] case sensitive

constant

{ extract First, “\s+\(“, First, “\)\s+”, Last; substitute “\s*\([^)]*\)” -> “”; }

end;

...


Keyword heuristics singleton items

Keyword HeuristicsSingleton Items

RelativeName|Brian Fielding Frost|16|35

DeceasedName|Brian Fielding Frost|16|35

KEYWORD(Age)|age|38|40

Age|41|42|43

KEYWORD(DeceasedName)|passed away|46|56

KEYWORD(DeathDate)|passed away|46|56

BirthDate|March 7, 1998|76|88

DeathDate|March 7, 1998|76|88

IntermentDate|March 7, 1998|76|98

FuneralDate|March 7, 1998|76|98

ViewingDate|March 7, 1998|76|98

...


Keyword heuristics multiple items

Keyword HeuristicsMultiple Items

KEYWORD(Relationship)|born … to|152|192

Relationship|parent|152|192

KEYWORD(BirthDate)|born|152|156

BirthDate|August 4, 1956|157|170

DeathDate|August 4, 1956|157|170

IntermentDate|August 4, 1956|157|170

FuneralDate|August 4, 1956|157|170

ViewingDate|August 4, 1956|157|170

BirthDate|August 4, 1956|157|170

RelativeName|Donald Fielding|194|208

DeceasedName|Donald Fielding|194|208

RelativeName|Helen Glade Frost|214|230

DeceasedName|Helen Glade Frost|214|230

KEYWORD(Relationship)|married|237|243

...


Results obituaries

Results: Obituaries

Arizona Daily Star

Recall %Precision %

DeceasedName* 100 100

Age 86 98

BirthDate 96 96

DeathDate 84 99

FuneralDate 96 93

FuneralAddress 82 82

FuneralTime 92 87

Relationship 92 97

RelativeName* 95 74

*partial or full name

Training set for tuning ontology: ~ 24

Test set: 90


Results obituaries1

Results: Obituaries

Salt Lake Tribune

Recall %Precision %

DeceasedName* 100 100

Age 91 95

BirthDate 100 97

DeathDate 94 100

FuneralDate 92 100

FuneralAddress 96 96

FuneralTime 97 100

Relationship 81 93

RelativeName* 88 71

*partial or full name

Training set for tuning ontology: ~ 12

Test set: 38


Conclusions on conceptual model based

Conclusions on Conceptual Model Based

  • Given an ontology and a Web page with multiple records, it is possible to extract and structure the data automatically.

  • Recall and Precision results are encouraging.

    • Car Ads: ~ 94% recall and ~ 99% precision

    • Job Ads: ~ 84% recall and ~ 98% precision

    • Obituaries: ~ 90% recall and ~ 95% precision (except on names: ~ 73% precision)


Conclusion

Conclusion

  • We have known: What, Why, Where, and How about IE.

  • IE is still a challenging problem. There is no clue that we will have a perfect solution.

  • This presentation is just a start point for you.


  • Login