Chapter 6: General Schema Manipulation Operators

Chapter 6: General Schema Manipulation Operators PRINCIPLES OF DATA INTEGRATION ANHAI DOAN ALON HALEVY ZACHARY IVES

Outline • Introduction to model management and motivation • The merge operator • The ModelGen operator • The Invert operator

Model Management Operators • We saw operators for creating mappings between pairs of schemas. • But you can imagine other operators on schemas and mappings: • Merge schemas, compose and invert mappings, translate schemas from one data model to another • In fact, imagine an entire algebra of operators that apply to schemas and to mappings: • Many common workflows can be formulated as a sequence of such operators [Bernstein, 2000] • Note: “model” = “schema”. More terminology coming soon.

Example of Model Management (1) • In a data integration scenario, you may proceed as follows, beginning with sources S1 and S2: • Use a match operator to create a mapping between S1 and S2 • Use merge to create a merged (mediated) schema of S1 and S2 with mappings. Merge will create the minimal schema that includes both S1 and S2.

Example of Model Management (2) • Suppose we have another source S3, which is very similar to S1. • We could first use match to create a mapping from S1 to S3 • Then use compose to create a mapping from S3 to the mediated schema G.

Operators • Match: see previous chapters • Merge: create a merged schema of S1 and S2w.r.t. a mapping M12 • ModelGen: create an equivalent model but in a different data model (e.g., relational  XML) • Invert: given M12, create M21 • Diff: find the difference between two models (see bibliography)

Some Terminology • Model: a specific description of a set of data in a given data model. • Meta model: a data model, such as relational schema, XML DTD, java class definitions, … • Meta-meta-model: a generic language that is independent of a particular meta-model • Usually, some a graph-based formalism.

The Merge Operator • Given • Two models, M1 and M2 • A mapping from M1 to M2 • Create: • A merged model M12 that contains only the information in M1 and M2, but does not repeat information that is in both • Mappings from M1and M2 to M12 • Challenge to many model management operators: • Can you develop algorithms that are generic, i.e., not specific to particular data models?

Merge Challenges: Example • Challenge 1: different attribute representations. Resolution should be part of the input mappings.

Merge Challenges: Example • Challenge 2: merging models of different data models. (What if one data model supports sub-attributes and another doesn’t?) • See ModelGen.

Merge Challenges: Example • Challenge 3: “fundamental conflicts”. Zipcode is an integer in one model and string in another. Merged model cannot have both: • Solutions depend on particular conflict and data models involved.

The ModelGen Operator • Transform a schema from one meta-model (e.g,. Java object model, relational, XML) to another meta-model. • Main challenge: features that exist in the source meta-model may not exist in the target (e.g., sub-classes and inheritance). • The need for ModelGen is very common in practice and is used by several of the other operators.

ModelGen Example Java classes  relational tables No classes or inheritance in the relational model

ModelGen Strategy • Possible to design specific transformations from one meta-model to another, but we want a generic approach. • Design a super meta-model that has (almost) all features that exist in the meta-models. • The super meta-model knows which features are present in each meta-model. • The algorithm will translate a given model into the super meta-model and from there to the target meta-model.

ModelGen Algorithm • Input: model M1 in meta-model MM1 • Output: a model M2 in meta-model MM2 that is equivalent to M1. • Transform M1 to the super-model, yielding M’. • While M’ includes features that are not present in MM2, apply transformations to remove these features (e.g., remove class hierarchy by translating it to multiple vertically partitioned tables) • Transform M’ into M2

The Invert Operator • Schema mappings are often directional: • They map data in source schema into a target schema. • Natural question: • Can we find an inverse mapping? • But what is the right definition of inverse. • We’ll see a couple of failed attempts before we see a good one. • Note: algorithms here are not generic. Highly dependent on the meta-model.

Invert Definition: Attempt 1 • Given a mapping M between a source S and target T. • M defines a relation between pairs of instances (I,J) that are consistent with each other: • I is an instance of S, J is an instance of T. • Hence, a natural definition is: M-1 should define the relation (J,I), where (I,J) in M. • However, inverses defined this way will not be expressible with tuple-generating dependencies/GLAV mappings. • Why? See next slide.

Attempt #1 Problem Explained • Any relation defined by TGDs is closed up on the right and closed down on the left. • Formally, assume • (I,J) is in M • I’ is a subset of I, J is a subset of J’, then • (I’, J’) is also in M. • However, by definition, M’ would have to be closed up on the left and closed down on the right • Hence, cannot be defined with TGDs or GLAV.

Invert Definition: Attempt 2 • Definition by composition: • M composed with M’ should be the identity mapping! • However, it can be shown that under that condition, a mapping has an inverse only if the following holds: • If I1 and I2 are two distinct instances of S, then their targets under M should be distinct instances of T. • The above result considerably limits the mappings that have inverses. m1 and m2 won’t have inverses:

Third Time’s a Charm: Quasi inverses • Define equivalence between two instances w.r.t. M as: • Define M’ to be the quasi-inverse of M if the composition of M and M’ always maps I to an instance I’ such that • Example: So m is a quasi-inverse of m’

Summary of Chapter 6 • Generic model management operators save a lot of repetitive code and can result in several forms of efficiency gains • Employing such operators also ensures that applications think carefully about the meaning of what they are doing. • Two main open challenges: • Can the implementation of these operators be described in a meta-model independent fashion? • Is model management a system in itself that should be built or should operator implementations be individual services?

Chapter 6: General Schema Manipulation Operators