1 / 66

Cui Tao PhD Dissertation Defense

Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages. Cui Tao PhD Dissertation Defense. Motivation. Birth date of my great grandpa Price and mileage of red Nissans, 1990 or newer Protein and amino acids information of gene cdk-4?

Download Presentation

Cui Tao PhD Dissertation Defense

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages Cui Tao PhD Dissertation Defense

  2. Motivation • Birth date of my great grandpa • Price and mileage of red Nissans, 1990 or newer • Protein and amino acids information of gene cdk-4? • US states with property crime rates above 1%

  3. Search by Search Engine

  4. “cdk-4" Search the Hidden Web • The Hidden Web: • Hidden behind forms • Hard to query

  5. Query for Data • The Hidden Web: • Hidden behind forms • Hard to query Find the protein and the animo-acids information for gene “cdk-4"

  6. A Web of Pages  A Web of Knowledge • Web of Knowledge • Machine-“understandable” • Publicly accessible • Queriable by standard query languages • Semantic annotation • Domain ontologies • Populated conceptual model • Problems to resolve • How do we create ontologies? • How do we annotate pages for ontologies?

  7. Contributions of Dissertation Work • Web of Pages  Web of Knowledge • Knowledge & meta-knowledge extraction • Reformulation as machine-“understandable” knowledge • Automatic & semi-automatic solutions via: • Sibling tables (TISP/TISP++) • User-created forms (FOCIH)

  8. Automatic Annotation with TISP(Table Interpretation with Sibling Pages) • Recognize tables (discard non-tables) • Locate table labels • Locate table values • Find label/value associations

  9. Recognize Tables Layout Tables (discard) Data Table Nested Data Tables

  10. Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918 1 2

  11. Interpretation Technique:Sibling Page Comparison

  12. Interpretation Technique:Sibling Page Comparison Same

  13. Interpretation Technique:Sibling Page Comparison Almost Same

  14. Interpretation Technique:Sibling Page Comparison Different Same

  15. Technique Details • Unnest tables • Match tables in sibling pages • “Perfect” match (table for layout  discard ) • “Reasonable” match (sibling table) • Determine & use table-structure pattern • Discover pattern • Pattern usage • Dynamic pattern adjustment

  16. Table Unnesting

  17. Table Structure Patterns • Regularity Expectations: • (<tr><(td|th)> {L} <(td|th)> {V})n • <tr>(<(td|th)> {L})n • (<tr>(<(td|th)> {V})n)+ • … Pattern combinations are also possible.

  18. Table Structure Patterns <tr>(<(td|th)> {L})n (<tr>(<(td|th)> {V})n)+

  19. Pattern Usage

  20. Dynamic Pattern Adjustment

  21. TISP++ • Automatic ontology generation • Automatic information annotation

  22. Ontology Generation – OSM • Object set: table labels • Lexical: labels that associate with actual values • Non-lexical: labels that associate with other tables • Relationship set: table nesting • Constraints: updates based on observation

  23. Ontology Generation – OWL • Object set: OWL class • Relationship set: OWL object property • Lexical object set: • OWL data type property • Different annotation properties to keep track of the provenance

  24. Generated Ontology

  25. Generated Ontology

  26. RDF Graph

  27. Query the Data Find the protein and the animo-acids information for gene “cdk-4"

  28. TISP Evaluation • Applications • Commercial: car ads • Scientific: molecular biology • Geopolitical: US states and countries • Data: > 2,000 tables in 35 sites • Evaluation • Initial two sibling pages • Correct separation of data tables from layout tables? • Correct pattern recognition? • Remaining tables in site • Information properly extracted? • Able to detect and adjust for pattern variations?

  29. Experimental Results Table recognition: correctly discarded 157 of 158 layout tables Pattern recognition: correctly found 69 of 72 structure patterns Extraction and adjustments: 5 path adjustments and 34 label adjustments  all correct

  30. TISP++ Performance • Performance depends on TISP • TISP test set • Generates all ontologies correctly • Annotates all information in tables correctly

  31. Form-based Ontology Creation and Information Harvesting (FOCIH) • Personalized ontology creation by form • General familiarity • Reasonable conceptual framework • Appropriate correspondence • Transformable to ontological descriptions • Capable of accepting source data • Automated ontology creation • Automated information harvesting

  32. Form Creation

  33. Created Sample Form

  34. Generated Ontology View

  35. Source-to-Form Mapping

  36. Source-to-Form Mapping

  37. Source-to-Form Mapping

  38. Source-to-Form Mapping

  39. Almost Ready to Harvest • Need reading path: DOM-tree structure • Need to resolve mapping problems • Pattern recognition • Instance recognition

  40. Reading Path

  41. Pattern & Instance Recognition

  42. Pattern & Instance Recognition

  43. regular expression for decimal number left context right context Pattern & Instance Recognition

  44. Pattern & Instance Recognition list pattern, delimiter is “,”

  45. Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma

  46. Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma

  47. Can Now Harvest

  48. Can Now Harvest

  49. Can Now Harvest

  50. Semantic Annotation

More Related