1 / 19

Data Integration Systems overview

Data Integration Systems overview. The architecture of a data integration system: Components and their interaction Tasks Concepts. Main components of a DI system. (I) Mediator מתווך Supports in its user interface :

stormy
Download Presentation

Data Integration Systems overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Integration Systemsoverview The architecture of a data integration system: Components and their interaction Tasks Concepts Integration-intro

  2. Main components of a DI system (I) Mediator מתווך Supports in its user interface : • The global data model • The integrated / global / mediated schema / world view • A query language Manages the interaction with sources • Posing queries • Receiving answers, transforming and showing them Is responsible for query execution strategies • planning • carrying out Integration-intro

  3. (II) Wrapper עוטף Serves as the interface to a source • Receive queries from a mediator • Plan and execute how to retrieve the data from its source • Transform data to global data model • Send to mediator For an SQL source, these are rather easy For a restricted capability source, may require • A series of queries on the source, or • A program to be executed (on a non-db source) • Filtering results obtained from the source Integration-intro

  4. mediator A simple architecture: Arrows represent query and data flow wrapper wrapper source source Integration-intro

  5. mediator A more complex architecture: Mediators can serves as wrapped sources for other mediators mediator wrapper wrapper wrapper source source source Integration-intro

  6. Important: • The global database is virtual – contain no data • The data reside in the sources • The users pose queries as if the data resides in the global db • Users may/may not be aware that the data actually comes from the sources Integration-intro

  7. Main tasks & activities: At mediator: • Query reformulation & decomposition • express queries in terms of the sources’ schemas • decompose into queries on sources • Planning query execution, including optimization • a declarative query may be executed in various ways (even in a single centralized db) • different sources may provide same data at different costs (money, communication time, response time, delays, …) • If data is associated with user priorities, we may want to retrieve some answers before others • When answers arrive – fuse them – a full answer is not a simple union of partial answers; data on an entity must be combined (fused) into a single record Integration-intro

  8. Requirements: (from mediator, wrapper, system) Ability to handle • Incomplete information(data may be missing from available sources) • Heterogeneity – in data model, schema, contents • Both data and meta-data ability to describe sources: • Capabilities • what queries can a source answer • What mechanisms does it offer for data retrieval • Coverage – allows to know • Where can data relevant to a query be found • Is there overlap between sources? Integration-intro

  9. The relationship between source and global data The global data is virtual  the mediated schema describes data that • resides in the sources • is described by source schemas The relationship between the mediated and source data determines how queries are answered Two main approaches (a combination of the two – later) Integration-intro

  10. Global as View – GAV The global db is defined as a view on the sources In relational model: Each global relation defined as a view, by a query on sources Obvious advantage: simplicity of query answering Given Q on global relations, • expand it: replace each atom R(x) by an expression on sources, using the definition of R • Then send appropriate sub-queries to sources Integration-intro

  11. Simple example: a university database Source A: • Dept(D, C) – departments and their courses • Teaches(C,T) – teachers of courses Source B: • Enroll(S, C) – student enrollment to courses Integrated schema & its definition: • Stud(S, D, T) :- Dept(D, C), Teach(C, T), Enroll(S, C) Query Q: • Stud(S, ‘CS’, ‘Beeri’) Expand body to Dept(‘CS’, C), Teach(C, ‘Beeri’), Enroll(S, C) Then use one of (at least) two execution strategies on sources A,B Integration-intro

  12. Local as View – LAV The global database is viewed as the “real world” Each source is defined as a view on it Example (revisited): Global schema: Univ(D, C, T, S) Source A: • Dept(D, C) :- Univ(D, C, T, S) • Teaches(C, T) :- Univ(D, C, T, S) Source B: • Enroll(S, C) :- Univ(D, C, T, S) Integration-intro

  13. Possible assumptions on sources: • A source contains all data in its defining view • A source contains some of the data in its view, usually not all 2nd is more realistic Example: Global database describes cars for sale A source may contain : • only some of the attributes of cars present in the global schema (e.g., it may not contain history, or owner-contact) • Only some of the cars for sale full view / contained view Obviously, the more sources we have, the more cars Integration-intro

  14. Query answering in LAV: • Expansion is not possible • An approach: answering queries using views Practically: rewriting queries using views (differences explored later) Only the views have data  rewrite query to an expression over the views expression must be (explained in more detail later) • Full views: equivalent to query • Contained views: contained in query • Solution may/may not exist (contrast to expansion) • Finding it is more difficult This problem was explored in many contexts, e.g.: Query optimization using views/previous answers Integration-intro

  15. Why prefer LAV to GAV? • Ease of expanding a system: • In GAV, adding a source may require re-definition of global schema – makes it difficult to add sources • In LAV, just define the new source as a view given an algorithm for using views to answer queries, it automatically uses the new source As for expanding queries vs. using views: Even in GAV, when sources have restricted capabilities, query answering requires using views Integration-intro

  16. Typically, a global schema reflects a real ‘world’, as we know it; each source materializes only a fragment • Horizontal – not all entity types or attributes are present • Vertical – not all entities of a type are present Thus, it is natural to define the sources as (contained) views Examples: • Cars for sale: • global db reflects our understanding and requirements • A source provides only some info, only on the cars it has • Looking for personal information using UNIX facilities • we know about: name, office, phone, email, … • Each facility may offer only some of the above Integration-intro

  17. LAV is a natural approach in the presence of • www and its diversity & dynamicity of source • Legacy systems Most research efforts & systems are LAV Integration-intro

  18. On rewriting queries using views: It is not clear (now) how to obtain a rewriting, given Q But, given v1(..), v2(..), …, vn(..) as a candidate, we may • expand each vi using its definition in terms of the global schema • Check whether the resulting expression is equivalent to or contained in Q (both Q and the expansion are in terms of global schema relations)  Equivalence and containment of queries are fundamental problems for data integration Integration-intro

  19. Example (our LAV example): Q: ans(S,’CS’,’Beeri’) :- Univ(’CS, C, ’Beeri’,S) Guess an answer in terms of views: ans`(S, ’CS’ , ‘Beeri’ ) :- Dept(‘CS’,C), Teach(C, ‘Beeri’), Enroll(S,C) (Note: must use distinct variables in different expansions for all non-join variables) Is the query equivalent to this expansion? Is the expansion contained in the query? Univ(‘CS’,C, T1, S1) Univ(D2,C, ‘Beeri’, S2) Univ(D3,C, T3, S) Integration-intro

More Related