1 / 26

CSE 636 Data Integration

CSE 636 Data Integration. Overview. Fall 2006. What is Data Integration?. The problem of providing uniform (sources transparent to user) access to (query, and eventually updates too) multiple (even 2 is a problem!) autonomous (not affect the behavior of sources)

Download Presentation

CSE 636 Data Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 636Data Integration Overview Fall 2006

  2. What is Data Integration? The problem of providing • uniform (sources transparent to user) • access to (query, and eventually updates too) • multiple (even 2 is a problem!) • autonomous (not affect the behavior of sources) • heterogeneous (different data models, schemas) • structured (at least semistructured) • data sources (not only databases)

  3. Motivation • Enterprise data integration; web-site construction. • World-wide web: • comparison shopping (Netbot, Junglee) • portals integrating data from multiple sources • XML integration • Science & culture • Medical genetics: integrating genomic data • Astrophysics: monitoring events in the sky • Environment: Puget Sound Regional Synthesis Model • Culture: uniform access to all the cultural databases produced by different countries.

  4. Principle Dimensions of Data Integration • Virtual vs. materialized architecture • Access: query only or query&update? • problem similar to updating through views • need distributed transactional services. • Mediated schema: yes or no? • Mediated schema requires schema integration and then query reformulation. • Without mediated schema, we lose some of the advantages of data integration.

  5. Data Warehouse Architecture  OLAP / Decision Support Data Cubes / Data Mining  Users Applications Relational Database (Warehouse) ETL Tools (Extract-Transform-Load) Data Cleaning Data Source Data Source Data Source

  6. Virtual Integration Architecture • Leave the data in the sources • When a query comes in: • Determine the relevant sources to the query • Break down the query into sub-queries for the sources • Get the answers from the sources, filter them if needed and combine them appropriately • Data is fresh • Otherwise known as On Demand Integration

  7. Virtual Integration Architecture Schema Mappings Schema Mappings Schema Mappings Design-Time   End Users Applications Sources can be: • Relational DBs • Excel Files • Web Sites • Web Services Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  8. Schema Mappings • Differences in: • Names in schema • Attribute grouping • Coverage of databases • Granularity and format of attributes Inventory Database B Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName Inventory Database A BookCategories ISBN Category BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio Artists ASIN ArtistName GroupName

  9. Issues for Schema Mappings Schema Mappings Schema Mappings Schema Mappings Design-Time   End Users Applications • What formalisms to express them? • How to create them? • Can we discover them somehow? • How do we use them? Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  10. Virtual Integration Architecture Reformulation Optimization Execution Wrapper Wrapper Run-Time Query Result Mediator Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  11. Issues for Query Processing Reformulation Query • User queries refer to the global schema • Data is stored in the sources in a local schema • Rewriting algorithms Mediator Reformulation Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  12. Issues for Query Processing Reformulation Global Schema SELECT ISBN, Price FROM Books WHERE Title = ‘on the road’ Books Title ISBN Price DiscountPrice Edition Local Schema A SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ‘on the road’ AND ItemType = ‘Books’ BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords

  13. Issues for Query Processing Wrapper Query Translation Query • Different query languages Mediator Reformulation Global Schema Optimization Execution Data Source Data Source Data Source Local Schema Local Schema Local Schema

  14. Issues for Query Processing Query Translation Global Schema SELECT ISBN, Price FROM Books WHERE Title = ‘on the road’ Books Title ISBN Price DiscountPrice Edition Local Source A http://www.amazon.com/homepage.html?ItemType=Books&Title=on+the+road

  15. Issues for Query Processing Reformulation Optimization Execution Wrapper Data Translation Query • Different data models Mediator Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  16. Issues for Query Processing Data Translation Global Schema Books Title ISBN Price DiscountPrice Edition <table> <tr> <td> <a href=/details?isbn=123> <b>On the Road</b> </a> -- by Jack Kerouac; Paperback <br> <a href=/details?isbn=123> Buy new </a> :<b class=price>$10.86</b> </td> </tr> </table> Local Result A

  17. Issues for Query Processing Reformulation Optimization Execution Wrapper Wrapper Query Execution Query • Access as many data sources as needed • Duplicate/redundant and irrelevant data • Limited query capabilities Mediator Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  18. Issues for Query Processing SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ‘on the road’ A Limited Query Capabilities SELECT ISBN, Price, DiscountPrice FROM Books WHERE Title = ‘on the road’ Global Schema Books Title ISBN Price DiscountPrice Edition E Local Schema A Local Schema B BooksAndMusic Title Author ItemID ItemType SuggestedPrice DiscountBooks Title Edition ISBN GreatPrice B D SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ? SELECT GreatPrice FROM DiscountBooks WHERE ISBN = 123 SELECT GreatPrice FROM DiscountBooks WHERE ISBN = ? C

  19. Issues for Query Processing Reformulation Optimization Execution Wrapper Wrapper Query Answering Query Result • Combine the results and further process them if needed • Mainly union and merge • Inconsistencies Mediator Global Schema Data Source Data Source Data Source Local Schema Local Schema Local Schema

  20. Issues for Query Processing Query Answering (Union)

  21. Issues for Query Processing Query Answering (Merge) Primary Key Primary Key Primary Key

  22. Issues for Query Processing Query Answering (Inconsistencies) Primary Key Primary Key Primary Key

  23. Peer-Based Integration Query Peer 4 Query Peer 5 Peer 2 Peer 1 Peer 3

  24. Peer-Based Integration • No need for a central mediated schema • Peers serve as mediators for other peers • A peer can be both a server and a client • Semantic relationships are specified locally(between small sets of peers) • Queries are posed using the peer’s schema • Answers come from anywhere in the system • This is not P2P file sharing. • Data has rich semantics

  25. References • Information integration • Maurizio Lenzerini • Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003 • Invited Tutorial • Data Integration: a Status Report • Alon Halevy • German Database Conference (BTW), 2003 • Invited Talk

More Related