1 / 26

Bootstrapping Pay-As-You-Go Data Integration Systems

Bootstrapping Pay-As-You-Go Data Integration Systems. by Anish D. Sarma, Xin Dong, Alon Halevy, Proceedings of SIGMOD'08 , Vancouver, British Columbia, Canada, June 2008. Presented by Andrew Zitzelberger. Data Integration. Offer a single-point interface to a set of data sources

adin
Download Presentation

Bootstrapping Pay-As-You-Go Data Integration Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bootstrapping Pay-As-You-Go Data Integration Systems by Anish D. Sarma, Xin Dong, Alon Halevy, Proceedings of SIGMOD'08, Vancouver, British Columbia, Canada, June 2008 Presented by Andrew Zitzelberger

  2. Data Integration • Offer a single-point interface to a set of data sources • Mediated schema • Semantic mappings • Query through mediated schema • Pay-as-you-go • Many contexts can be useful without full integration • System starts with few (or inaccurate) semantic mappings • Mappings are improved over time • Problem • Requires significant upfront and ongoing effort

  3. Contributions • Self-configuring data integration system • Provides an advanced starting point for pay-as-you-go systems • Initial configuration provides good precision and recall • Algorithms • Mediated schema generation • Semantic mapping generation • Concept • Probabilistic mediated schema

  4. Probabilistic Mediated Schema

  5. Mediated Schema Generation • 1) Remove infrequent attributes • Ensure mediated schema contain most relevant attributes • 2) Construct weighted graph • Nodes are remaining attributes • Edges are the values of some similarity measure: s(ai, aj) • Cull edges below threshold τ • 3) Cluster nodes • Cluster is a connected component of the graph

  6. Probabilistic Mediated Schema Generation • Allow for error є in weighted graph • Certain edges ≥ τ + є • τ - є <Uncertain edges ≤ τ + є • Cull edges < τ – є • Remove unnecessary uncertain edges • Create schema from every subset of uncertain edges

  7. Probabilistic Mediated Schema Generation • Assign probability

  8. Probabilistic Mediated Schema

  9. Probabilistic Semantic Mappings

  10. Probabilistic Mapping Generation • Weighted correspondence Choose the consistent p-mapping with the maximum entropy.

  11. Probabilistic Mapping Generation • 1) Enumerate one-to-one mappings • Mappings must contain subset of correspondences • 2) Assign probabilities that maximize entropy • Solve the following constraint maximization problem

  12. Probabilistic Mediated Schema Consolidation • Why? • User expects a single deterministic schema • More efficient query answering • How?

  13. Schema Consolidation Example • M = {M1, M2} • M1 contains {a1, a2, a3}, {a4}, and {a5, a6} • M2 contains {a2, a3, a4} and {a1, a5, a6} • T contains {a1}, {a2, a3}, {a4}, and {a5, a6}

  14. Probabilistic Mapping Consolidation • Modify p-mapping • Update the mappings to match new mediated schema • Modify probabilities • Schema mapping probability by Pr(Mi) • Consolidate • Add all new mappings to new set • If mapping already in new set during addition, add probabilites

  15. Experimental Setup • UDI – the data integration system • Accepts select-project queries (only one table) • Source data – MySQL • Query processor – Java • Jaro Winkler simularity computation – SecondString • Entropy maximization problem – Knitro • Operating System – Windows Vista • CPU – Intel Core 2 GHz • Memory – 2GB

  16. Experimental Setup • τ = 0.85 • є = 0.02 • θ = 10%

  17. Experiments • Domains: Movie, Car, People, Course, Bibliography • Golden Standards • Manually created for People and Bibliography • Partially created for others • 10 test queries • One to four attributes in SELECT clause • Zero to three predicates in WHERE clause

  18. Results Estimated actual recall between 0.8 and 0.85

  19. Experiments • Compare to other methods: • MySQL keyword search engine • KEYWORDNAIVE • KEYWORDSTRUCT • KEYWORDSTRICT • SOURCE • Unions results of each data source • TOPMAPPING • Only consider p-mapping with highest probability

  20. Results

  21. Experiments • Compare against other Q&A methods: • SINGLEMED – single deterministic mediated schema • UNIONALL – single deterministic mediated schema that contains a singleton cluster for each frequent source attribute

  22. Results

  23. Experiment and Results • Quality of mediated schema • Test against manually created schema

  24. Experiment and Result • Setup efficiency • 3.5 minutes for 817 data sources • Roughly linear increase of time with data sources • Maximum-entropy problem is most time consuming

  25. Future Work • Different schema matcher • Dealing with multiple-table sources • Including multi-table schemas • Normalizing mediated schemas

  26. Analysis • Positives • Lots of support (proofs and experiments) • Negatives • Detail • Pictures

More Related