semiautomatic generation of resilient data extraction ontologies n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Semiautomatic Generation of Resilient Data Extraction Ontologies PowerPoint Presentation
Download Presentation
Semiautomatic Generation of Resilient Data Extraction Ontologies

Loading in 2 Seconds...

play fullscreen
1 / 19
kailey

Semiautomatic Generation of Resilient Data Extraction Ontologies - PowerPoint PPT Presentation

75 Views
Download Presentation
Semiautomatic Generation of Resilient Data Extraction Ontologies
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Semiautomatic Generation ofResilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

  2. Data Extraction Ontology • Goal: extract data from web pages • Components • concepts • relations between the concepts • participation constraints • Resilient • Difficulty: manual ontology generation is costly

  3. Data-Extraction Ontology Generation Procedure Train Test Knowledge Selection Processing Extraction Processing Database Knowledge Sources

  4. Knowledge Collection • Assumptions about knowledge base • general • contains meaningful relationships • pre-existing • XML or easy to transfer to XML • Current input • Mikrokosmos ontology [Mik] • auxiliary data frame library

  5. Selection of Concepts PROCEDURE ConceptSelection(Tdoc, Kbase) SourceDoc = Parse(Tdoc); PrimarySelectedConceptsList = MikroSelection(M-Ontology); SecondarySelectedConceptsList = DataFrameSelection(DF-Library); ConflictHandling(); SelectedSubgraphGeneration(); MANY ISSUES selection strategies, conflict resolution, …

  6. Basic Selection Strategy • Afghanistan • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. • Select from Mikrokosmos Ontology

  7. Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • Afghanistan • smaller than Texas. • Area<GeographicalArea>: 648,000 sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>:17.7 million. • Agriculture:Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

  8. Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Afghanistan<Nation> • smaller than Texas<USState>. • Area<GeographicalArea>: 648,000 sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>:17.7 million. • Agriculture:Wheat<FoodStuff><AgriculturalProduct>, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

  9. Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Select from Data Frame Libraries • Afghanistan • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

  10. Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Select from Data Frame Libraries • extract result based on the data frames • Afghanistan • smaller than Texas. • Area: 648,000<Area><Mileage> sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7<Time> million<Population><Price>. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

  11. Document-Level Conflict • Afghanistan • smaller than Texas. • Area: 648,000<Area><Mileage> sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7<Time> million<Population><Price>. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

  12. Concept-Level Conflict • Afghanistan • smaller than Texas. • Area<GeographicalArea>: 648,000<Area> sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>: 17.7 million<Population>. • Agriculture: Wheat<FoodStuff><AgriculturalProduct>, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

  13. Relation Retrieval • Theoretical solution • all paths in the subgraph • too expensive: NP-Complete • Heuristic solution • find the shortest path between any two nodes • set a threshold distance

  14. Participation Constraints • Afghanistan<Nation> • smaller than Texas. • Area: 648,000 sq. km. • Capital—Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population: 17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. CapitalCity [1:1] IsA.CITY.PartOf Nation [1:1]

  15. Participation Constraints (cont.) • Afghanistan<Nation> • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul<City>, • Other cities<City>--Kandahar<City> Mazar-e-Sharif<City> Konduz<City> • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population: 17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. City [1:1] PartOf Nation [1:*]

  16. Performance Evaluation • Speed of generation • Precision and recall of the generation process • Precision and recall of the generated ontology

  17. Generation Time with Distance Threshold

  18. P&R of Generation Process

  19. Conclusion • Data Extraction Ontology generated • Knowledge sources exploited • Many issues applied • Many more to explore