1 / 34

CS 540 Database Management Systems

CS 540 Database Management Systems. Data Integration. Data Integration Problem. What was the last time you used only one data source to get your answers? Various websites through Google, local DBs, …

eryk
Download Presentation

CS 540 Database Management Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 540 Database Management Systems Data Integration

  2. Data Integration Problem • What was the last time you used only one data source to get your answers? • Various websites through Google, local DBs, … • How can we access a set of heterogeneous, distributed, autonomous databases as if accessing a single database? • Arises in numerous contexts • on the Web, at enterprises, scientific cooperation, bio-informatics domains, e-commerce, etc. • Always hot research area • in both database research and industry

  3. Current State • Mostly ad-hoc programming • Create a special solution for every case, expensive • Long-standing challenge in the DB community • One of the oldest problems in DB. • AI/WWW communities are on board • Annual workshops, vision papers, ... • Companies • IBM, Google, many others,

  4. A Brief Research History • Many early ad-hoc solutions • Converged into two approaches • data warehousing vs. virtual DI systems • Semi-structured data, XML • Wrappers, info extraction • Other issues: query optimization, schema matching, ... • Current directions • DI for specialized domains (e.g., bioinformatics) • on-the-fly DI, entity-centric DI • simplify integration tasks • New types of data sharing systems • P2P systems, Semantic Web

  5. Data Warehouse Architecture OLAP / Decision support/ Data cubes/ data mining User queries Relational database (warehouse) Data extraction programs Data cleaning/ scrubbing Data source Data source Data source

  6. Virtual Integration Architecture User queries Mediated schema Mediator: Reformulation engine optimizer Data source catalog Execution engine wrapper wrapper wrapper Data source Data source Data source Sources can be: relational, hierarchical (IMS), structure files, web sites.

  7. A Brief Research History • Many early ad-hoc solutions • Converged into two approaches • data warehousing vs. virtual DI systems • Wrappers, info extraction • Other issues: query optimization, schema matching, ... • Current directions • DI for specialized domains (e.g., bioinformatics) • on-the-fly DI, entity-centric DI • simplify integration tasks • New types of data sharing systems • P2P systems, Semantic Web

  8. Wrapper Programs • Task: to communicate with the data sources and do format translations. • They are built w.r.t. a specific source. • They can sit either at the source or at the mediator. • Often hard to build (very little science). • Can be “intelligent”: perform source-specific optimizations.

  9. Example <b> Introduction to DB </b> <i> Phil Bernstein </i> <i> Eric Newcomer </i> Addison Wesley, 1999 <book> <title> Introduction to DB </title> <author> Phil Bernstein </author> <author> Eric Newcomer </author> <publisher> Addison Wesley </publisher> <year> 1999 </year> </book> Transform: into:

  10. Wrapper Construction • Huge amount of research in the past decade • Two major approaches • supervised learning: typically requires some hand-labeled data • unsupervised learning: data-intensive, completely automatic • Different focuses • pull out each record (i.e., segment page into records) • pull out fields in each record • remove junk portions (ads, etc.) • Current solutions are still brittle

  11. Information Extraction • If the source cannot be wrapped with a grammar or some easy-to-parse rules • must do information extraction • Huge research in the AI and DB communities.

  12. A Brief Research History • Many early ad-hoc solutions • Converged into two approaches • data warehousing vs. virtual DI systems • Wrappers, info extraction • Other issues: query optimization, schema matching, ... • Current directions • DI for specialized domains (e.g., bioinformatics) • on-the-fly DI, entity-centric DI • simplify integration tasks • New types of data sharing systems • P2P systems, Semantic Web

  13. Virtual Integration Architecture User queries Mediated schema Mediator: Reformulation engine Optimizer Data source catalog Execution engine wrapper wrapper wrapper Data source Data source Data source

  14. Data Source Catalog • Contains all meta-information about the sources: • Logical source contents (books, new cars). • Source capabilities (can answer SQL queries) • Source completeness (has all books). • Physical properties of source and network. • Statistics about the data (like in an RDBMS) • Source reliability • Mirror sources • Update frequency.

  15. Content Descriptions • User queries refer to the mediated schema. • Data is stored in the sources in a local schema. • Content descriptions provide the semantic mappings between the different schemas. • Data integration system uses the descriptions to translate user queries into queries on the sources.

  16. Approaches to Specifying Source Descriptions • Global-as-view: express the mediated schema relations as a set of views over the data source relations • Local-as-view: express the source relations as views over the mediated schema. • Can be combined with no additional cost.

  17. Global-as-View Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union[S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title

  18. Global-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS [S1(title,dir,year)] select title, dir, year, NULL from S1 union [S2(title, dir,genre)] select title, dir, NULL, genre from S2

  19. Global-as-View: Example 3 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create View Movie AS select NULL, NULL, NULL, genre from S4 Create View Schedule AS select cinema, NULL, NULL from S4. But what if we want to find which cinemas are playing comedies?

  20. Global-as-View Summary • Query reformulation boils down to view unfolding. • Very easy conceptually. • Can build hierarchies of mediated schemas. • You sometimes lose information. Not always natural. • Adding sources is hard. Need to consider all other sources that are available.

  21. Local-as-View: example 1 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create Source S1 AS select * from Movie Create Source S3 AS [S3(title, dir)] select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy”

  22. Local-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create Source S4 select cinema, genre from Movie m, Schedule s where m.title=s.title . Now if we want to find which cinemas are playing comedies, there is hope…

  23. Local-as-View Summary • Very flexible. You have the power of the entire query language to define the contents of the source. • Hence, can easily distinguish between contents of closely related sources. • Adding sources is easy: they’re independent of each other. • Query reformulation: answering queries using views

  24. Answering Queries Using Views • Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn? • Many, many papers on this problem. • The best performing algorithm: The MiniCon Algorithm, (Pottinger & Levy, 2000). • Great survey on the topic: (Halevy, 2001).

  25. The General Reformulation Problem • Given: • A query Q posed over the mediated schema • Descriptions of the data sources • Find: • A query Q’ over the data source relations, such that: • Q’ provides only correct answers to Q, and • Q’ provides all possible answers to Q given the sources.

  26. Query Reformulation • Generate good logical query plans quickly • Recent works • bucket algorithm [Levy et. al.] • inverse rule algorithm [Duschka et. al.] • Minicon algorithm [Pottinger et. al.] • iDRIPS/Streamer [Doan et. al.]

  27. Query Optimization • Very related to query reformulation • Goal of the optimizer: find a physical plan with minimal cost. • Key components in optimization: • Search space of plans • Search strategy • Cost model

  28. Can We Get RDBMS-Level Optimizationfor Data Integration, without Statistics? Multiple remote sources • Described and mapped “loosely” • Data changes frequently Generally, would like to support same kinds of queries as in a local setting Results Query Data Integration System Mediated Schema Source Catalog Schema Mappings Remote, Autonomous Data Sources

  29. Adaptive Query Processing Many domains where cost-based query optimization fails: Complex queries in traditional databases: estimation error grows exponentially with # joins [IC91] Querying over the Internet: unpredictable access rates, delays Querying external data sources: limited information available about properties of this source Monitor real-world conditions, adapt processing strategy in response

  30. What Are the Sources of Inefficiency? • Delays – we “stall” in waiting for I/O • Bad estimation of intermediate result sizes • No info about source cardinalities • Major challenges: • Trading off information acquisition (exploration) vs. use (exploitation) • Extrapolating performance based on what you’ve seen so far

  31. Types of Adaptive Query Processing Adaptive scheduling (q. scrambling [UF98], dyn. pipeline sched. [UF01], XJoin[UF00], PH Join [RS86][I+99], ripple join [HH99]) Changes CPU scheduling to improve feedback or reduce delays Redundant computation (competitive exec. [AZ96]) Compare two+ ways of executing the query Plan partitioning ([S+76][KD98][I+99][M+04]) Break the plan into stages; re-optimize future stages as necessary Adaptive info passing ([IT05 sub.]) Pass data between parts of an executing plan Adaptive data partitioning • Break the data into subsets; use a different plan for each subset • The only way to reduce overall computation with fine granularity • First (only) implementation has been eddies[AH00][R+03][DH04]

  32. Adaptive QP in Summary A variety of different techniques, focusing on: • Scheduling • Comparison & competition • Information passing • Data + plan partitioning A field that is still fairly open – missing: • Effective exploration methods • A true theory! • What’s possible? What kinds of queries make sense to adapt? • Guarantees of optimality and convergence (perhaps under certain assumptions)

  33. A Brief Research History • Many early ad-hoc solutions • Converged into two approaches • data warehousing vs. virtual DI systems • Semi-structured data, XML • Wrappers, info extraction • Other issues: query optimization, schema matching, ... • Current directions • DI for specialized domains (e.g., bioinformatics) • on-the-fly DI, entity-centric DI • simplify integration tasks • New types of data sharing systems • P2P systems, Semantic Web

  34. What You Should Know • Data integration is a very active research area in DB • Strong industry needs • Many interesting research questions • Touches many other fields: artificial intelligence (ontology, machine learning, …), information retrieval, … • Web data integration is a very important special case

More Related