1 / 20

# Recognizing Ontology-Applicable Multiple-Record Web Documents - PowerPoint PPT Presentation

Recognizing Ontology-Applicable Multiple-Record Web Documents. David W. Embley Dennis Ng Li Xu. Brigham Young University. Problem: Recognizing Applicable Documents. Document 1: Car Ads. Document 2: Items for Sale or Rent. A Conceptual Modeling Solution. Car-Ads Ontology. Car [->object];

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Recognizing Ontology-Applicable Multiple-Record Web Documents' - mohammad-freeman

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Recognizing Ontology-ApplicableMultiple-Record Web Documents

David W. Embley

Dennis Ng

Li Xu

Brigham Young University

Document 2: Items for Sale or Rent

Car [->object];

Car [0:0.975:1] has Year [1:*];

Car [0:0.925:1] has Make [1:*];

Car [0:0.908:1] has Model [1:*];

Car [0:0.45:1] has Mileage [1:*];

Car [0:2.1:*] has Feature [1:*];

Car [0:0.8:1] has Price [1:*];

PhoneNr [1:*] is for Car [1:1.15:*];

Year matches [4]

constant {extract “\d{2}”;

context "([^\\$\d]|^)[4-9]\d,[^\d]";

substitute "^" -> "19"; },

End;

• H1: Density

• H2: Expected Values

• H3: Grouping

Document 2: Items for Sale or Rent

• Number of Matched Characters: 626

• Total Number of Characters: 2048

• Density: 0.306

• Items for Rent or Sale

• Number of Matched Characters: 196

• Total Number of Characters: 2671

• Density: 0.073

Document 2: Items for Sale or Rent

Year: 3

Make: 2

Model: 3

Mileage: 1

Price: 1

Feature: 15

PhoneNr: 3

Year: 1

Make: 0

Model: 0

Mileage: 1

Price: 0

Feature: 0

PhoneNr: 4

OV D1 D2

Year 0.98 16 6

Make 0.93 10 0

Model 0.91 12 0

Mileage 0.45 6 2

Price 0.80 11 8

Feature 2.10 29 0

PhoneNr 1.15 15 11

D1: 0.996

D2: 0.567

D1

ov

D2

Document 2: Items for Sale or Rent

{

Year

Make

Model

Price

Year

Model

Year

Make

Model

Mileage

{

{

{

Year

Mileage

Mileage

Year

Price

Price

{

2+3+2+1

44

3+3+4+4

44

= 0.875

= 0.500

----------------

Year

Year

Make

Model

-------------- 3

Price

Year

Model

Year

---------------3

Make

Model

Mileage

Year

---------------4

Model

Mileage

Price

Year

---------------4

Grouping: 0.865

Sale Items

----------------

Year

Year

Year

Mileage

-------------- 2

Mileage

Year

Price

Price

---------------3

Year

Price

Price

Year

---------------2

Price

Price

Price

Price

---------------1

Grouping: 0.500

Expected Number in Group

=   Ave 

= 4 (for our example)

1-Max

Sum of Distinct 1-Max in each Group

Number of Groups  Expected Number in a Group

• Decision-Tree Learning Algorithm C4.5

• (H1, H2, H3, Positive)

• (H1, H2, H3, Negative)

• Training Set

• 20 positive examples

• 30 negative examples (some purposely similar, e.g. classified ads)

• Test Set

• 10 positive examples

• 20 negative examples

• Precision: 100%

• Recall: 91%

• Accuracy 97%

• Harmonic Mean

• 2/(1/Precision + 1/Recall)

• Precision: 91%

• Recall: 100%

• Accuracy: 97%

• Precision: 84%

• Recall: 100%

• Accuracy: 93%

• Other Approaches

• Naïve Bayes [McCallum96] (accuracy near 90%)

• Logistic Regression [Wang01] (accuracy near 95%)

• Multivariate Analysis with Continuous Random Vectors [Tang01] (accuracy near 100%)

• More Extensive Testing

• Similar documents (motorcycles, wedding announcements, …)

• Accuracy drops to near 87%

• Naïve Bayes drops to near 77%

• Others … ?

• Other Types of Documents

• XML Documents

• Forms and the Hidden Web

• Tables

• Objective: Automatically Recognize Document Applicability

• Approach:

• Conceptual Modeling

• Recognition Heuristics

• Density

• Expected Values

• Grouping

• Result: Accuracy Near 95%

www.deg.byu.edu