1 / 18

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. AnHai Doan Pedro Domingos Alon Halevy. Problem & Solution. Problem Large-scale Data Integration Systems Bottleneck: Semantic Mappings Solution Multi-strategy Learning Integrity Constraints

janus
Download Presentation

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy

  2. Problem & Solution • Problem • Large-scale Data Integration Systems • Bottleneck: Semantic Mappings • Solution • Multi-strategy Learning • Integrity Constraints • XML Structure Learner • 1-1 Mappings

  3. Learning Source Descriptions (LSD) • Components • Base learners • Meta-learner • Prediction converter • Constraint handler • Operating Phases • Training phase • Matching phase

  4. Learners • Basic Learners • Name Matcher (Whirl) • Content Matcher (Whirl) • Naïve Bayes Learner • County-Name Recognizer • XML Learner • Meta-Learner (Stacking)

  5. Naïve Bayes Learner Input instance= bags of tokens

  6. XML Learner Input instance= bags of tokens including text tokens and structure tokens

  7. Domain Constraint Handler • Domain Constraints • Impose semantic regularities on schemas and source data in the domain • Can be specified at the beginning • When creating a mediated schema • Independent of any actual source schema • Constraint Handler • Domain constraints + Prediction Converter + Users’ feedback + Output mappings

  8. Training Phase • Manually Specify Mappings for Several Sources • Extract Source Data • Create Training Data for each Base Learner • Train the Base-Learner • Train the Meta-Learner

  9. Example1 (Training Phase)

  10. Example1 (Cont.) Training Data Source Data

  11. Example1 (Cont.) (“location” ,ADDRESS) (“Miami, FL”, ADDRESS) Source Data: (location: Miami, FL)

  12. Matching Phase • Extract and Collect Data • Match each Source-DTD Tag • Apply the Constraint Handler

  13. Example2 (Matching Phase)

  14. Example2 (Cont.)

  15. Example2 (Cont.)

  16. Experimental Evaluation • Measures • Matching accuracy of a source • Average matching accuracy of a source • Average matching accuracy of a domain • Experiment Results • Average matching accuracy for different domains • Contributions of base learners and domain constraint handler • Contributions of schema information and instance information • Performance sensitivity to the amount data instances

  17. Limitations • Enough Training Data • Domain Dependent Learners • Ambiguities in Sources • Efficiency • Overlapping of Schemas

  18. Conclusion and Future Work • Improve over time • Extensible framework • Multiple types of knowledge • Non 1-1 mapping ?

More Related