Semiautomatic generation of resilient data extraction ontologies
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

Semiautomatic Generation of Resilient Data-Extraction Ontologies PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on
  • Presentation posted in: General

Semiautomatic Generation of Resilient Data-Extraction Ontologies. Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF. Wrapper-Driven Data Extraction. Web data extraction Obtain user-specified information from Web documents Wrapper

Download Presentation

Semiautomatic Generation of Resilient Data-Extraction Ontologies

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Semiautomatic generation of resilient data extraction ontologies

Semiautomatic Generation of Resilient Data-Extraction Ontologies

Yihong Ding

Data Extraction Group

Brigham Young University

Sponsored by NSF


Wrapper driven data extraction

Wrapper-Driven Data Extraction

  • Web data extraction

    • Obtain user-specified information from Web documents

  • Wrapper

    • Convert implicit HTML data into explicit formatted data

    • Data-source-specified, high performance

  • Examples:

    • SoftMealy, STALKER, WIEN, Omini, ROADRUNNER, …


Common problem of wrappers

? / next_token

? / ε

_U

s<U,U> / ε

s<N,N> / ε

? / ε

U

etc.

s<b,U> /

“U=” + next_token

s<U,N> /

“N=” + next_token

b

_N

s<b,N> /

“N=” + next_token

N

? / ε

? / next_token

Common Problem of Wrappers

SoftMealy

<LI> <A HREF="…"> Mani Chandy </A>,

<I>Professor of Computer Science</I>

and <I>Executive Officer for Computer

Science</I>

  • Resiliency

    • fixed domain

    • changeable layout

  • Scalability

    • unchanged existing wrapper

    • extendable domain and functions


Data extraction ontology

Structure

Object sets

Relationship sets

Participation constraints

Data frames

Pros: resilient and scalable

Cons: hard to create

Knowledge requirements

Tedious and error-prone work

Car [-> object];

Car [0:1] has Make [1:*];

Make matches [10]

constant { extract "\baudi\b"; };

end;

Car [0:1] has Model [1:*];

Model matches [25]

constant { extract "80";

context "\baudi\S*\s*80\b"; };

end;

Car [0:1] has Mileage [1:*];

Mileage matches [8]

constant {extract "\b[1-9]\d{0,2}k";

substitute "[kK]" -> "000";};

end;

Car [0:1] has Price [1:*];

Price matches [8]

constant { extract "[1-9]\d{3,6}";

context "\$[1-9]\d{3,6}";};

end;

Data-Extraction Ontology


Motif of ontology generation

Sample Documents

Human Brain

Concepts of Interest

Data-Extraction Ontology

Knowledge Base

Concepts with Relations

Motif of Ontology Generation


Thesis statement

Thesis Statement

  • Given: knowledge base

  • Input: sample Web pages of interest

  • Output: a data-extraction ontology for the domain of interest

  • Between input and output: this is the work of this thesis


Ontology generation procedure

test

documents

training

documents

pre-processing

clean

records

interact

if necessary

Data Extraction

Ontology

Concept

Selection

Relation

Retrieval

Constraint

Discovery

Extraction

Processing

Integrated Knowledge Base

Results Storage

Result

Evaluation

pre-processing

Knowledge Sources

Ontology-Generation Procedure


Primary knowledge source

Primary Knowledge Source

  • Requirements

    • Available

    • General in coverage

    • Rich in meaningful relationship

    • Encoded in or easily converted to XML

  • Mikrokosmos (K) Ontology

    • Developed by NMSU jointly with U.S. DoD

    • Contains over 5000 concepts

    • Connects to an average 14 links per concept

    • Represented in XML format


Integrated knowledge base

Integrated Knowledge Base

KNOWLEDGE BASE

K

Ontology

Lexicons

Data-Frame

Library

Synonym

Dictionary

(WordNet)


Ontology generation procedure1

test

documents

training

documents

pre-processing

clean

records

interact

if necessary

Data Extraction

Ontology

Concept

Selection

Relation

Retrieval

Constraint

Discovery

Extraction

Processing

Integrated Knowledge Base

Results Storage

Result

Evaluation

pre-processing

Knowledge Sources

Ontology-Generation Procedure


Domain specification

Domain Specification

  • Training documents

    • Data-rich

    • Narrow in topic breadth

  • Preprocessing


Example car advertisement

Example – Car Advertisement

Record 1:

00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446

Record 2:

02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250

Record 3:

02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755

Record 4:

00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah


Ontology generation procedure2

test

documents

training

documents

pre-processing

clean

records

interact

if necessary

Data Extraction

Ontology

Concept

Selection

Relation

Retrieval

Constraint

Discovery

Extraction

Processing

Integrated Knowledge Base

Results Storage

Result

Evaluation

pre-processing

Knowledge Sources

Ontology-Generation Procedure


Concept selection

Concept Selection

  • Selection strategies

    • Compare a string with the name of a concept

    • Compare a string with the values belonging to a concept

    • Apply data-frame recognizers to recognize a string

KB

<PHONE-NR>

00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah


Concept selection1

<PRICE>

<MILEAGE>

by keyword identification

price

Concept Selection

  • Reasons of conflict

    • Synonymy

    • Polysemy

  • Conflict resolution

    • Same-string only one meaning

    • Favor longer over shorter

    • Context decides meaning

KB

02 Buick Century Custom, Pwr Seat, Nada Retail13,695 221-1250.


Ontology generation procedure3

test

documents

training

documents

pre-processing

clean

records

interact

if necessary

Data Extraction

Ontology

Concept

Selection

Relation

Retrieval

Constraint

Discovery

Extraction

Processing

Integrated Knowledge Base

Results Storage

Result

Evaluation

pre-processing

Knowledge Sources

Ontology-Generation Procedure


Relationship retrieval

Relationship Retrieval

KB

<AUTOMOBILE>

<MILEAGE>

<YEAR>

<PRICE>

<PHONE-NR>

<AUDIO-MEDIA-ARTIFACT>

<CENTURY>


Ontology generation procedure4

test

documents

training

documents

pre-processing

clean

records

interact

if necessary

Data Extraction

Ontology

Concept

Selection

Relation

Retrieval

Constraint

Discovery

Extraction

Processing

Integrated Knowledge Base

Results Storage

Result

Evaluation

pre-processing

Knowledge Sources

Ontology-Generation Procedure


Constraint discovery

<AUTOMOBILE>

<AUTOMOBILE>

<PRICE>

<PRICE>

Constraint Discovery

02 Buick Century,

lo mi, mint cond, green, pwr seat,

$11,999.

373-4445 dlr# 2755

AUTOMOBILE [0:1] IsA.ARTIFACT.CostofProduction PRICE [1:1]

00 Buick Century

Stk# HU7159

Green

$9,319,

714-2200To Apply By Phone, 1-877-228-9486, OREM Utah


Ontology generation procedure5

test

documents

training

documents

pre-processing

clean

records

interact

if necessary

Data Extraction

Ontology

Concept

Selection

Relation

Retrieval

Constraint

Discovery

Extraction

Processing

Integrated Knowledge Base

Results Storage

Result

Evaluation

pre-processing

Knowledge Sources

Ontology-Generation Procedure


Ontology generation

Ontology Generation

concept nodes  object sets

paths  relationship sets

discovered constraints  participation constraints

concept recognizers  data frames


Automatically generated ontology car advertisement

Automatically Generated Ontology -- Car Advertisement

(01) {Automobile [-> object];}

(02) {Automobile [0:1] has Mileage [1:1];}

(03) {Automobile [0:1] IsA.ARTIFACT.CostOfProduction Price [1:1];}

(12) {Price [1:1] IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year [0:*];}

(20) {Automobile [0:1] relatesTo PhoneNr [1:*] relatesTo ArtifactPart [1:*] relatesTo Mileage [1:*] relatesTo Truck [1:*] relatesTo AudioMediaArtifact [1:*] relatesTo CommunicationDevice [1:*] relatesTo ControlEvent [1:*] relatesTo TravelEvent [1:*];}


Ontology generation procedure6

test

documents

training

documents

pre-processing

clean

records

interact

if necessary

Data Extraction

Ontology

Concept

Selection

Relation

Retrieval

Constraint

Discovery

Extraction

Processing

Integrated Knowledge Base

Results Storage

Result

Evaluation

pre-processing

Knowledge Sources

Ontology-Generation Procedure


Updating strategies

Updating Strategies

  • Remove all bad relationship sets

  • Modify remaining incorrect relationship sets

    • Substitute incorrect object sets

    • Reduce long n-ary relationship sets

    • Fix participation constraints

  • Adjust names or re-arrange sequences

  • Add new relationship sets


Final ontology

Final Ontology

Car [-> object]

Car [0:1] has Year [1:*]

Car [0:1] has Mileage [1:*]

Car [0:1] has Price [1:*]

PhoneNr [1:*] is for Car [0:1]

PhoneNr [0:1] has Extension [1:*]

Car [0:*] has Feature [1:*]

Car [0:1] has Make [1:*]

Car [0:1] has Model [1:*]


Evaluation criteria

Evaluation Criteria

  • Basic measures

    • POG (Precision of Ontology Generation)

    • ROG (Recall of Ontology Generation)

  • Human constraints

    • PROG (Pseudo-ROG)

    • Comparing with an expert-created ontology

  • Knowledge base constraints

    • EPROG (Effective-PROG)

  • Correctness dependency

    • DEPROG (Dependent-EPROG)

    • For example: relationship sets depends on object sets


Evaluation results

Evaluation Results


Discussion of results

Discussion of Results

  • Bottleneck: cannot generate what not in the knowledge base

  • Object sets

    • Concept-selection procedure works well

    • Desired concept not shown in training records

      • Rarely occurring concept  not severe even if we don’t fix the error

      • Example: extension

    • Aggregation and union

      • USAddressCity, USAddressState, USAddressZipCode  Location

      • CropPlant, AnimalProduct, FruitFoodStuff  AgriculturalProduct

    • Close-meaning concepts: FurniturePart  Furnished


Discussion of results1

Discussion of Results

  • Relationship sets

    • Binary relationship sets over 95%

    • Most errors due to incorrectly generated object sets

    • Semantically incorrect relationship sets

      • Price IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year

    • n-ary relationship sets (usually huge)

  • Participation constraints

    • Error due to lack of training examples

    • How much is enough?


Knowledge base extensibility

Knowledge Base Extensibility

  • Add SALT -- a new knowledge source

  • Successfully integrated into existing KB

  • Sample new relationship set (DOE abstract domain)

    • CrudeOil IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation


Conclusion

Conclusion

  • Experimented with knowledge-base construction and extension

  • Standardized application domain specification

  • Generated data-extraction ontologies from a specified domain and an integrated knowledge base

  • Showed DEPROG results of more than 70% on average and over 90% for well-defined domains


Future work

Future Work

  • Build a general-purpose knowledge source for data-extraction usage

  • Study more about data frames

    • Can a system correctly identify concepts with data frames?

    • Can a system update a data frame to fit a special situation?

    • Can a system generate a data frame from a collection of information of interest?


  • Login