1 / 85

Schema and Data Translation

Schema and Data Translation. Paolo Atzeni Università Roma Tre Based on Tutorial at ICDE 2006 Paper in EDBT 2006 (with P. Cappellari and P. Bernstein) Demo in Sigmod 2007 (with P. Cappellari and G. Gianforme) ADBIS --- Varna , October 2, 2007. Outline. Introduction Model management

verda
Download Presentation

Schema and Data Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Schema and Data Translation Paolo Atzeni Università Roma Tre Based on Tutorial at ICDE 2006 Paper in EDBT 2006 (with P. Cappellari and P. Bernstein) Demo in Sigmod 2007 (with P. Cappellari and G. Gianforme) ADBIS --- Varna, October 2, 2007

  2. Outline • Introduction • Model management • Schema and data translation: the problem • A metamodel based approach Varna - October 2, 2007

  3. A ten-year goal for database research • The “Asilomar report”(Bernstein et al. Sigmod Record 1999 www.acm.org/sigmod): • The information utility:make it easy for everyone to store, organize, access, and analyze the majority of human information online Varna - October 2, 2007

  4. A general framework: cooperation • "The capacity of a system to interact (effectively) with other systems, possibly managed by different organizations" • Forms of cooperation • Process-centered cooperation: • the systems offer services one another, by exchanging messages, or by triggering activities, without making remote data explicitly visible • Data-centered cooperation: • the systems offer data one another; data is distributed, heterogeneous and autonomous Varna - October 2, 2007

  5. Databases in the Internet era • Databases before the Internet • An internal infrastructure, a precious resource, but usually hidden, with some controlled cooperation • Internet changes the requirements • More users (not only humans), more diverse cooperating systems (distributed, heterogeneous, autonomous), more types of data • "Future" Internet changes more • New devices (embedded everywhere), even more users (many “per person”), real mobility, need for personalization and adaptation Varna - October 2, 2007

  6. The most studied form of data-centered cooperation: integration • We are interested in data-centered cooperation, often referred to as integration “The unification of related, heterogeneous data from disparate sources, for example, to enable collaboration” (Hammer & Stonebraker 2005) • Some "paradigms" … Varna - October 2, 2007

  7. Global Manager Mediator Mediator Mediator Local mgr Local mgr Local mgr DB DB DB Multidatabase Varna - October 2, 2007

  8. Mediator Mediator Mediator Local manager Local manager Local manager Data Warehousing System DW manager Data Warehouse “Integrator” DB DB DB Varna - October 2, 2007

  9. DB Mediator Local manager Local manager Local manager Local manager Application Intermediate solutions in practice Local Manager DB Integrator Mediator Mediator DB DB DB Varna - October 2, 2007

  10. Peer-based architecture Peer mgr Peer mgr Mediator Local mgr Local mgr Mediator Local mgr Local mgr DB DB DB DB Mediator Mediator Peer mgr Mediator Local mgr Local mgr DB DB Mediator Varna - October 2, 2007

  11. Data is not just in databases • Mail messages • Web pages • Spreadsheets • Textual documents • Palmtop devices, mobile phones • Multimedia annotations (e.g., in digital photos) • XML documents Varna - October 2, 2007

  12. The same data in the same form? • Adaptivity: • Personalization: content adapted to the user • upon system's decision • upon user's request • Customization: structure adapted to the user • according to the user's role • upon user's request • Context dependence • User, Device, Network, Place, Time, Rate Varna - October 2, 2007

  13. Summarizing: a general need • We have data at various places, and data has to be • exchanged • replicated • migrated • integrated • adapted Varna - October 2, 2007

  14. A major difficulty • Heterogeneity • System level • Model level • Structural (different structure for similar data) • Semantic (different meaning for the same structure) • Many efforts, but current techniques are mostly manual and ad hoc Varna - October 2, 2007

  15. A direction for the solutions • Be general! • ad hoc solution could work in-the-small, but they • are repetitive and time consuming • do not scale • are not maintainable • Historical notes: • W. C. McGee: Generalization: Key to Successful Electronic Data Processing. J. ACM 1959 • Indeed, databases are the result of generalization applied to secondary storage management! Varna - October 2, 2007

  16. Generality requires … • … high-level descriptions of problems within the family of interest: • Metadata: • “data about data” • (formal or informal) description of structures and meaning • General solutions leverage on metadata (and then operate on data as a consequence) Varna - October 2, 2007

  17. Outline • Introduction and motivation • Model management • Schema and data translation: the problem • A metamodel based approach Varna - October 2, 2007

  18. A wider perspective • (Generic) Model Management: • A proposal by Bernstein et al (2000 +) • Includes a set of operators on • schemas and • mappings between schemas Varna - October 2, 2007

  19. Model Mgmt people Traditional DB people Meta-metamodel Metamodel Metamodel Model Model Schema Terminology: a warning Varna - October 2, 2007

  20. Schemas and mappings • A simplified approach: • Schema: • a set of elements, related in some way to one another • Mapping: • a set of correspondences (pair of elements) or • its reification, a third schema related to the other two via two sets of correspondences Varna - October 2, 2007

  21. Model mgmt operators, a first set • map = Match (S1, S2) • S3 = Merge (S1, S2, map) • S2 = Diff (S1, map) • and more • map3 = Compose (map1, map2) • S2 = Select (S1, pred) • Apply (S, f) • list = Enumerate (S) • S2 = Copy (S1) • … Varna - October 2, 2007

  22. Match • map = Match (S1, S2) • given • two schemas S1, S2 • returns • a mapping between them • the “classical” initial step in data integration: • find the common elements of two schemas and the correspondences between them Varna - October 2, 2007

  23. Merge • S3 = Merge (S1, S2, map) • given • two schemas and a mapping between them • returns • a third schema (and two mappings) • the “classical” second step in data integration: • given the correspondences, find a way to obtain one schema out of two Varna - October 2, 2007

  24. Diff • S2 = Diff (S1, map) • given • a schema and a mapping from it (to some other schema, not relevant) • returns • a (sub-)schema, with the elements that do not participate in the mapping Varna - October 2, 2007

  25. Example (Bernstein and Rahm, ER 2000) • A database (a “source”), a data warehouse and a mapping between the two • we get a second source, with some similarity to the first one • and we want to update the DW DB1 DW2 DW1 DB2 Varna - October 2, 2007

  26. Example, the "solution" m2 = Match(DB1,DB2) m3= Compose(m2,m1) DB2’=Diff(DB2,m3) DW2’, m4 user defined m5 = Match(DW1,DW2’) DW2 = Merge(DW1,DW2’,m5) DW2 DB1 DW1 m1 m3 m2 m5 DB2 DW2’ m4 DB2’ Varna - October 2, 2007

  27. Magic does not exist • Operators might require human intervention: • Match is the main case • Scripts involving operators might require human intervention as well (or at least benefit from it): • a full implementation of each operator might not always available • a mapping might require manual specification • incomparable alternatives might exist Varna - October 2, 2007

  28. The “data level” • The major operators have also an extended version that operates on data, and not only on schemas • Especially apparent for • Merge Varna - October 2, 2007

  29. We also have heterogeneity • Round trip engineering (Bernstein, CIDR 2003) • A specification (for example ER or UML) and an implementation (for example, relational) • then a change to the implementation: want to revise the specification • We need a translation from the implementation model to the specification one S1 S2 I1 I2 Varna - October 2, 2007

  30. Model management with heterogeneity • The previous operators have to be “model generic” (capable of working on different models) • We need a “translation” operator • <S2, map12> = ModelGen (S1) Varna - October 2, 2007

  31. ModelGen, an additional operator • <S2, map12> = ModelGen (S1) • given • a schema (in a model) • returns • a schema (in a different data model) and a mapping between the two • A “translation” from a model to another • I should call it “SchemaGen” … • We should better write • <S2, map12> = ModelGen (S1,mod2) Varna - October 2, 2007

  32. Round trip engineering S1 S2 S2’ m3 m1 m4 m2 I1 I2 I2’ m2 = Match (I1,I2) m3 = Compose (m1,m2) I2’= Diff(I2,m3) <S2’,m4 > = Modelgen(I2’) … Match, Merge Varna - October 2, 2007

  33. Outline • Introduction • Model management • Schema and data translation: the problem • A metamodel based approach Varna - October 2, 2007

  34. Schema and data translation, a long standing issue • Schema translation: • given schema S1 in model M1 and model M2 • find a schema S2 in M2 that “corresponds” to S1 • Schema and data translation: • given also a database D1 for S1 • find also a database D2 for S2 that “contains the same data” to D1 Varna - October 2, 2007

  35. Schema and data translation, a long standing issue • Translations from a model to another have been studied since the 1970’s • Whenever a new model is defined, techniques and tools to generate translations are studied • However, proposals and solutions are usually model specific Varna - October 2, 2007

  36. Model specific solutions • Given an ER schema, find the suitable relational schema that “implements” it • the original paper (Chen 1976) contains the basics • further discussions by many (e.g. Markowitz and Shoshani 1989) • illustrated in every textbook • Similarly with • any other conceptual model and any other logical one • XML and relational (or object) Varna - October 2, 2007

  37. Another problem in the picture: data exchange • Given a source S1 and a target schema S2 (in different models or even in the same one), find a translation, that is, a function that given a database D1 for S1 produces a database D2 for S2 that “correspond” to D1 • Often emphasized with reference to materialized solutions Varna - October 2, 2007

  38. Integration • Given two or more sources, build an integrated schema or database Varna - October 2, 2007

  39. Schema translation • Given a schema find another one with respect to some specific goal (another model, better quality, …) Varna - October 2, 2007

  40. Data exchange • Given a source and a target schema, find a transformation from the former to the latter Varna - October 2, 2007

  41. Schema translation and data exchange • Can be seen a complementary: • Data translation = schema translation + data exchange • Given a source schema and database • Schema translation produces the target schema • Data exchange generates the target database • In model management terms we could write • Schema translation: • <S2, map12> = ModelGen (S1,mod2) • Data exchange: • i2 = DataGen (S1,i1,S2,map12) Varna - October 2, 2007

  42. Outline • Introduction • Model management • Schema and data translation: the problem • A metamodel based approach Varna - October 2, 2007

  43.  The problem • ModelGen • given two data models M1 and M2, and a schema S1 of M1 (the source schema and model), • generate a schema S2 of M2 (the target schema and model), corresponding to S1 • and, for each database D1 over S1, generate an equivalent database D2 over S2 Varna - October 2, 2007

  44. We have been doing this for a while • Initial work more than ten years ago (Atzeni & Torlone, 1996) • Major novelty recently (Atzeni, Cappellari & Bernstein, 2006) • translation of both schemas and data • data-level translations generated automatically, from schema-level ones Varna - October 2, 2007

  45. System managed ids used as references  A simple example • An object relational database, to be translated in a relational one • Source: the OR-model • Target: the relational model Varna - October 2, 2007

  46.  Example, 2 • Does the OR model allow for keys? • Assume EmpNo and Name are keys Varna - October 2, 2007

  47.  Example, 3 • Does the OR model allow for keys? • Assume no keys are specified Varna - October 2, 2007

  48. Many different models (and variations …) N-ary ER w/ gen OR XSD Binary ERw/ gen N-ary ER w/o gen … … Bin ER w/ gen w/o attr on rel Binary ER w/o gen Bin ER w/o gen w/o attr on rel Bin ER w/ gen w/o M:N rel OO w/ gen Bin ER w/o gen w/o M:N rel … Relational OO w/o gen Varna - October 2, 2007

  49. Heterogeneity • We need to handle artifacts and data in various models • Data are defined wrt to schemas • Schemas are defined wrt to models • How models can be defined? Models Schemas Data Varna - October 2, 2007

  50. A metamodel approach • The constructs in the various models are rather similar: • can be classified into a few categories (Hull & King 1986): • Lexical: set of printable values (domain) • Abstract (entity, class, …) • Aggregation: a construction based on (subsets of) cartesian products (relationship, table) • Function (attribute, property) • Hierarchies • … • We can fix a set of metaconstructs (each with variants): • lexical, abstract, aggregation, function, ... • the set can be extended if needed, but this will not be frequent • A model is defined in terms of the metaconstructs it uses Varna - October 2, 2007

More Related