1 / 13

Filtering Multiple-Record Web Documents Based on Application Ontologies

Filtering Multiple-Record Web Documents Based on Application Ontologies. Presenter: L. Xu Advisor: D.W.Embley. D1: Car. D2: Item for Sale or Rent. Examples. Car Ontology. Car[->object]; Car[0..0.975..1] has Year; Car[0..0.925..1] has Make; Car[0..0.908..1] has Model;

faris
Download Presentation

Filtering Multiple-Record Web Documents Based on Application Ontologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley

  2. D1: Car D2: Item for Sale or Rent Examples

  3. Car Ontology Car[->object]; Car[0..0.975..1] has Year; Car[0..0.925..1] has Make; Car[0..0.908..1] has Model; Car[0..0.45..1] has Mileage; Car[0..2.1..*] has Feature; Car[0..0.8..1] has Price; PhoneNr is for Car[1..1.15..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, . . End;

  4. Filtering Heuristics • H1: Density • H2: Expected-values • H3: Grouping

  5. H1: Density • Car • Total Number of Characters: 2048 • Number of Matched Characters: 626 • Density: 0.306 • Item for Rent or Sale • Total Number of Characters: 196 • Number of Matched Characters: 2671 • Density: 0.073

  6. H2: Expected-values OV D1 D2 Year 0.98 16 6 Make 0.93 10 0 Model 0.91 12 0 Mileage 0.45 6 2 Price 0.80 11 8 Feature 2.10 29 0 PhoneNr 1.15 15 11 D1: 0.996 D2: 0.567 D1 ov D2

  7. H3: Grouping Year: 2000 Year: 1989 Make: Subaru Model: SW------ Nr of Distinct "One Max" Object:3 Price: 1900 Year: 1998 Model: Elantra Year: 1994------ Nr of Distinct "One Max" Object:3 . . . Grouping Factor is: 0.865 Year: 1999 Year: 1998 Year: 1960 Mileage: 10000 Nr of Distinct "One Max" Object:2 Mileage: 401000 Year: 1940 Price: 17500 Year: 10971 Nr of Distinct "One Max" Object: 3 . . . Grouping Factor is: 0.5

  8. Combining Heuristics • Decision tree learning algorithm C4.5 • Learning task: suitability • Performance measure: accuracy • Training experience: human classified documents • Training set • 20 positive examples (from 10 geographical regions of US States) • 30 negative examples • Test set • 10 positive examples • 20 negative examples

  9. Generated Rules • Car application • H2 <= 0.8767:NO • H2 > 0.8767:YES • Obituary application • H2 <= 0.6793:NO • H2 > 0.6793 • | H1 <= 0.2171:NO • | H1 > 0.2171:YES • Universal rule • H3 <= 0.625 • | H1 <= 0.369: NO • | H1 > 0.369 • | | H2 <= 0.6263: NO • | | H2 > 0.6263: YES • H3 > 0.625: YES

  10. Experiment Results • Car application • accuracy 96.7% • precision 100% • recall 91% • Obituary application • accuracy 96.7% • precision 91% • recall 100% • Universal rule • accuracy 93.4% • precision 84% • recall 100%

  11. False Drop Example

  12. False Positive Example

  13. Summary • Objective: Automatically filter multiple-record web documents. • Approach: Filtering heuristics • Density • Expected-values • Grouping • Result: ~95% accuracy

More Related