1 / 23

Conceptual-Model-Based Web Data Extraction by Example

Conceptual-Model-Based Web Data Extraction by Example. Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF. Motivation. Data-rich Websites in abundance Conceptual-Model-Based Methodology is resilient “By Example” approach is user-friendly.

candy
Download Presentation

Conceptual-Model-Based Web Data Extraction by Example

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conceptual-Model-Based Web Data Extraction by Example Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored byNSF

  2. Motivation • Data-rich Websites in abundance • Conceptual-Model-Based Methodology is resilient • “By Example” approach is user-friendly

  3. “By Example” Approach • Web users specify desired information by creating a form • Users collect sample pages on the Web • An ontology generator learns the task by analyzing the form and the sample pages • Interactions may be needed to improve or complete the ontology

  4. Extraction Ontology Architecture Sample Pages Data Frame Libraries Ontology Generator User CreatedForm GUI Populated Database Extraction Engine Target Pages

  5. Digital Camera Brand Model CCD Resolution 4.0 Image Resolution 2272 x 1074 Optical Zoom 3 Digital Zoom 2 Sample Web Page User Created Form Canon PowerShot G2

  6. Extraction Ontology • Relationship Set and Constraints • Extraction Patterns • Keywords • Context Expressions

  7. Primary Object Name Other Objects’ Names Participation Constraints DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; DigitalCamera [0:1] has Model [1:*]; DigitalCamera [0:1] has CCDResolution [1:*]; DigitalCamera [0:1] has ImageResolution [1:*]; DigitalCamera [0:1] has OpticalZoom [1:*]; DigitalCamera [0:1] has DigitalZoom [1:*]; Relationship Set and Constraints

  8. Primary Object Name Other Objects’ Names Participation Constraints DigitalCamera[-> object]; DigitalCamera[0:1] has Brand [1:*]; DigitalCamera[0:1] has Model [1:*]; DigitalCamera[0:1] has CCDResolution [1:*]; DigitalCamera[0:1] has ImageResolution [1:*]; DigitalCamera[0:1] has OpticalZoom [1:*]; DigitalCamera[0:1] has DigitalZoom [1:*]; Relationship Set and Constraints

  9. Primary Object Name Other Objects’ Names Participation Constraints DigitalCamera [-> object]; DigitalCamera [0:1] hasBrand[1:*]; DigitalCamera [0:1] hasModel[1:*]; DigitalCamera [0:1] hasCCDResolution[1:*]; DigitalCamera [0:1] hasImageResolution[1:*]; DigitalCamera [0:1] hasOpticalZoom[1:*]; DigitalCamera [0:1] hasDigitalZoom[1:*]; Relationship Set and Constraints

  10. Primary Object Name Other Objects’ Names Participation Constraints DigitalCamera [-> object]; DigitalCamera[0:1]has Brand[1:*]; DigitalCamera[0:1]hasModel[1:*]; DigitalCamera[0:1]hasCCDResolution[1:*]; DigitalCamera[0:1]hasImageResolution[1:*]; DigitalCamera[0:1]hasOpticalZoom[1:*]; DigitalCamera[0:1]hasDigitalZoom[1:*]; Relationship Set and Constraints

  11. Extraction Patterns From Data Frame Libraries • Data Frame Libraries • Lexicons • Synonym Dictionary • Regular Expressions • Extraction Pattern: • Lexicons for Brand and Model • Regular Expressions for numbers and Image resolution

  12. Extraction Patterns Data Frame Libraries • Features a high-quality 4.0 Megapixel Resolution CCD • The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD • 3 effective megapixel CCDResolution matches [20] constant{extract "\b\d(\.\d{1,2})?\b";}; keyword "\bMegapixel\b", "\bCCD\b", "\bResolution\b";

  13. Keywords • Features a high-quality4.0Megapixel Resolution CCD • The new Nikon Coolpix 995 offers a boasting3.34Megapixel CCD • 3effective megapixel

  14. Keywords • Features a high-quality4.0MegapixelResolution CCD • The new Nikon Coolpix 995 offers a boasting3.34MegapixelCCD • 3effectivemegapixel

  15. Keywords • Features a high-quality4.0Megapixel Resolution CCD • The new Nikon Coolpix 995 offers a boasting3.34Megapixel CCD • 3effective megapixel CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b", "\bCCD\b", "\bResolution\b";

  16. Context Expressions • 3.5xoptical zoom (2.5x digital) • a superior4xOptical Zoom Nikkor lens, plus4xstepless digital zoom • optical3X/digital6Xzoom OpticalZoom matches [10] constant{ extract "\b\d(\.\d)?"; context "\b\d(\.\d)?(x)\b";}; keyword "\boptical\b";

  17. Extraction Ontology DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d"; context "\b\d(x)\b"; }; keyword "\boptical\b"; end;

  18. Extraction Ontology DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d"; context "\b\d(x)\b"; }; keyword "\boptical\b"; end;

  19. Extraction Ontology DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{extract "\b\d(\.\d)"; context "\b\d(\.\d)?(x)\b"; }; keyword "\boptical\b"; end;

  20. Extraction Ontology DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword"\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword"\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d(\.\d)"; context"\b\d(\.\d)?(x)\b"; }; keyword"\boptical\b"; end;

  21. Results (Same Site)

  22. Results (Different Site)

  23. Summary and Future Work • The example indicates that the approach is feasible • Some open questions need to be explored

More Related