Towards the Preservation of Keys in XML Data Transformation for Integration

Towards the Preservation of Keys in XML Data Transformation for Integration Md. Sumon Shahriar and Jixue Liu Data and Web Engineering Lab Computer and Information Science University of South Australia

Outline of the Presentation • Motivation for XML Data Transformation with XML keys • How to define XML keys • How to transform XML keys • Whether transformed XML keys are valid and preserved [Key Preservation] • If XML key is not preserved, how to capture XML key as XML functional dependency (XFD) [Key Transition]

Data Transformations for Integration • RelationalRelational • RelationalXML • XMLRelational • XMLXML

Data Transformations for Integration with Constraints Constraint (keys, functional dependencies etc.) preservations (a.k.a propagations) are well studied • RelationalRelational • RelationalXML • XMLRelational • XMLXML • Little investigated! • Mostly structural transformations of schema and data ignoring constraints! • Reason: document-centric approach rather than data-centric approach of XML

Motivating Example 1 Nested Source DTD Da: <!ELEMENT enroll(dept+)> <!ELEMENT dept(dname, (cid,sid+)+)> Unnest(sid) Operation Flat-like Target DTD Db: <!ELEMENT enroll(dept+)> <!ELEMENT dept(dname, (cid,sid)+)>

Vr enroll XML tree Ta dept dept V2 V1 dname dname V3 V6 V4 V5 cid sid V7 V8 sid V9 V10 V11 cid sid cid V12 sid sid Chemistry Phys01 001 002 Chem02 004 002 Phys02 003 Physics Unnest(sid) Vr XML tree Tb enroll dept dept V2 V1 dname dname V3 V9 cid V6 V4 V5 sid sid V7 V8 cid V10 V12 V13 V11 cid cid sid sid V14 sid cid Chemistry Phys01 002 Phys02 003 Chem02 004 Phys01 001 Chem02 002 Physics

XML key consideration K(enroll/dept,{cid}) Da: <!ELEMENT enroll (dept+)> <!ELEMENT dept (dname, (cid,sid+)+)> • K is valid on Da • K is satisfied by Ta Unnest(sid) Unnest(sid) • Is K is transformed?: NO • Is K is valid on Db :YES • Is K is satisfied by Tb?: NO Db: <!ELEMENT enroll (dept+)> <!ELEMENT dept (dname, (cid,sid)+)> K(enroll/dept,{cid})

Vr enroll XML tree Ta dept dept V2 V1 dname dname V3 V6 V4 V5 cid sid V7 V8 sid V9 V10 V11 cid sid cid V12 sid sid Chemistry Phys01 001 002 Chem02 004 002 Phys02 003 Physics distinct XML tree Tb Vr enroll dept dept V2 V1 dname dname V3 V9 cid V6 V4 V5 sid sid V7 V8 cid V10 V12 V13 V11 cid cid sid sid V14 sid cid Chemistry Phys01 002 Phys02 003 Chem02 004 Phys01 001 Chem02 002 Physics duplicates duplicates

Observation • Observation 1: An XML key may not be preserved after transformation.

Motivating Example 2 Source DTD Da: <!ELEMENT enroll(dept+)> <!ELEMENT dept(dname, (cid,sid+)+)> K(enroll/dept,{cid}) • Vaild and satisfied expand operation replacing (cid,sid+) with course Target DTD Db: <!ELEMENT enroll(dept+)> <!ELEMENT dept(dname, course+)> <!ELEMENT course(cid,sid+)> • Is K Valid? • Answer: NO • Reason: Path is transformed K(enroll/dept/course,{cid}) • Suggestion: Needs transformation of key • Satisfactions?: May be or not, need to check

Expanding (cid,sid+) with new element course

Observation • Observation 2:How XML keys should be transformed needs to be defined when DTD is transformed

Contributing on • Defining XML keys on DTD and their satisfactions • Rules for transforming XML keys using important operations • Key preservation [key to key] • Defining XML functional dependencies (XFDs) and their satisfactions • Key transition [key to XFD]

Contributing on • Defining XML keys on DTD and their satisfactions • Defined on schema definition DTD • Use a novel technique to produce semantically correct values for key satisfactions • Can capture some properties of relational key on the sense of value completeness and disallowing redundant values • Can capture ID properties of DTD definition • Improvement of key notion in XML Schema

XML Key Given a DTD D = (EN, , ), an XML key on D is defined as K(Q,{P1,…,Pl}), where l>= 0 , Q is a complete path on D called the selector, and {P1, ..., Pi,…, Pl} (often denoted by P) is a set of fields where each Pi is defined as: • , where " U " means disjunction and pij (j [1,…,ni]) is a simple path on D, (last(pij))=Str, and has the following syntax: • pij=seq • seq=e | e/seq where ; • Q/pij is a complete path.

Example of XML keys Source DTD Da: <!ELEMENT enroll(dept+)> <!ELEMENT dept(dname, (cid,sid+)+)> <!ELEMENT dname(#PCDATA)> <!ELEMENT cid(#PCDATA)> <!ELEMENT sid(#PCDATA)> K(enroll/dept,{cid}) • selector=enroll/dept • field={cid} • (cid)=#PCDATA means Str • (last(cid))=Str K(enroll/dept,{cid,sid}) • selector=enroll/dept • fields={cid,sid} • (last(cid))= (last(sid))= Str

Some definitions for XML key satisfactions [P-tuple] Given a key K(Q,{P1,...,Pl}) and a tree T, let TQ be a tree in T. A P-tuple in TQ is a tuple of pair-wise close sub-trees . By pair-wise close, we mean tuples in the same minimal hedge A P-tuple is complete if We call TP =Tlast(P) the prefixed format tree. For example P=enroll/dname. Then TP =Tdname

Proposed techniques • [Hedge] Hedge is a consecutive sequence of primary sub-trees of the same node. • [Minimal structure] Given a DTD definition (e) and two elements e1 and e2 in (e), the minimal structure g of e1 and e2 in (e) is the pair of brackets that encloses e1 and e2 and any other structure in g does not enclose both. • [Minimal Hedge]Given a hedge H of (e), a minimal hedge of e1 and e2 is one of Hgs in H.

Example of minimal structure, minimal hedge and P-tuple K(enroll/dept,{cid,sid}) Da: <!ELEMENT enroll (dept+)> <!ELEMENT dept (dname, (cid,sid+)+)> Vr Ta enroll dept dept V2 V1 dname dname V3 V6 V4 V5 cid V7 V8 sid V9 V10 V11 cid sid cid sid V12 sid sid Chemistry Phys01 001 002 Chem02 004 002 Phys02 003 Physics H3g H2g H1g • P1=cid, P2=sid • Minimal structure is g=(cid,sid+) • Minimal hedges are: H1=v4v5v6, H2=v7v8 under node v1 and H3=v10v11v12 under node v2 • P-tuples are: F1=v4v5, F2=v4v6 for hedge H1, F3=v7v8 for hedge H2 for node v1 and F4=v10v11, F5=V10v12 for hedge H3 for node v2

Produced P-tuples

XML Key Satisfaction An XML tree satisfies a Key K(Q,{P1,…Pl}) if the followings are held: • If {P1,…Pl}= then T satisfies K iff there exists one and only one TQ in T; • Else • (exists at least one P-tuple in TQ) • (every P-tuple in TQ is complete) • (every P-tuple in TQ is value distinct) • (exists two P-tuples ) This requires that P-tuples in different TQ must be value distinct.

Checking satisfaction of key TQ=Tv1 TQ=Tv2

Contributing on • Rules for transformation on key definition • A key is transformed if any path in the key is transformed. • After the transformation, key needs to be checked whether it is valid on target schema. • If a key is not transformed, it is valid on target DTD

Transformation on key • Unnest operation: • g=(g1xg2+)+g=(g1xg2)+ • Example: (cid,sid+)+ (cid,sid)+ • It makes the nested structure to flat-like structure • No path transformation • No change in the key definition

Transformation on key • Nest operation: • g=(g1xg2)+ g=(g1xg2+)+ • Example: (cid,sid)+ (cid,sid+)+ • It makes the flat-like structure to nested structure • No path transformation • No change in the key definition

Transformation on key • Expand operation: • g=(g1xg2 +)+ g=(gnew)+, gnew =g1xg2+ • Example: g=(cid,sid+)+ g=(course+), gnew=(cid,sid+)+ • It pushes the structure to one level down • Path is transformed in DTD and so in key • Needs some rules to transform key correctly

Transformation on key • Transformation rules on key using expand: • Depends where the new element is added in the key paths (either selector or field) Da: <!ELEMENT enroll (dept+)> <!ELEMENT dept (dname, (cid,sid+)+)> K(enroll/dept,{cid,sid}) K(enroll/dept,{cid,sid}) expand((cid,sid+), course) expand(sid+, stIDs) K(enroll/dept/course,{cid,sid}) K(enroll/dept,{course/cid,course/sid}) K(enroll/dept,{cid,stIDS/sid})

Transformation on key • Collapse operation: • g=(gcoll)+, gcoll =g1xg2+  g=(g1xg2 +)+ • Example: g=(dept+), gdept=(cid,sid+)  g=(cid,sid+)+ • It moves the structure to one level up • Path is transformed in DTD and so in key • Needs some rules to transform key correctly

Transformation on key • Transformation rules on key using collapse: • Depends which element is deleted in the key paths (either selector or field) Da: <!ELEMENT enroll (dept+)> <!ELEMENT dept (dname, (cid,sid+)+)> K(enroll/dept,{cid,sid}) K(enroll,{dept/cid,dept/sid}) collapse(dept) collapse(dept) K(enroll,{cid,sid}) K(enroll,{cid,sid})

Contributing on [Key preservation] Given a source DTD, its conforming document, a valid key that is satisfied by the document, if the transformed key is valid on target DTD and is satisfied by the target document then key is said to be preserved by the transformation.

Key preserving properties of operations • Preserving: • Nest and collapse • Preserving with necessary and sufficient conditions: • Unnest and Expand

Theorem: Unnest operator is key preserving if some key fields don’t cross g1.

Example to explain (cid,sid+)+ K(enroll/dept,{cid}) Unnest(sid) g2 g1 However if the key is K(enroll,{cid,sid}), then Key is preserved

Theorem: Expand operator is key preserving if when the selector is transformed, then every tree for selector has a P-tuple.

Example to explain K(enroll/dept/course,{cid}) K(enroll/dept,{cid}) K(enroll/dept,{course/cid}) distinct No duplicate cid’s are produced

Contributing on [Key transition] Given a source DTD, its conforming document, a valid key that is satisfied by the document, if the transformed key is valid on target DTD and is not satisfied by the target document but if key is transformed to XFD and is satisfied by the target document then we say XML key is transited as XFD.

XML functional dependency (XFD) Given a DTD D = (EN, , ), an XML key on D is defined as (S, PQ), where S is a complete path on D called the scope, P is a set of simple paths P={p1, ...,pi,…,pl} called determinant or LHS, Q is a simple path or empty path called dependent or RHS, and S/P and S/Q are complete paths. If Q=, then XFD (S, P ) implies that Plast(S) meaning that P determines S

Tuple for XFD [Tuple] Given an XFD (S,PQ) and a tree T,let TS be a tree in T. A tuple in TS is a tuple of pair-wise close sub-trees . • By pair-wise close, we mean tuples in the same minimal hedge • By P-tuple, we mean the tuple for paths P • By Q-tuple, we mean the tuple for path Q • A P-tuple is complete if • A P-tuple is complete if

XFD satisfactions An XML tree satisfies an XFD (S, PQ) if the followings are held: • If Q=  then is complete; • Else • are complete. • For every pair of tuples F1[P] and F2[Q] in TS, if F1[P]=vF1[Q], then F1[Q]=vF2[Q].

Key transition algorithm 1: check=CheckKeyTransformation(k, UnNest); 2: if check=TRUE then 3: TransformKeyToXFD(k); 4: end if 5: if target T satisfies the XFD Φ then 6: return Φ and ”KeyTransited”; 7: end if

Function CheckKeyTransformation(k, UnNest) 1: if g1crossing any Pi in [P1, · · · , Pn] at an element e where e in g1and e in Pithen 2: return TRUE; 3: else 4: return FALSE; 5: end if

Function TransformKeyToXFD(k) 1: Φ[S] := k[Q]; 2: for all i such that 1 ≤ i ≤ n do 3: Φ[Pi] := k[Pi]; 4: end for 5: Φ[Q] :=; 6: return Φ(S, {P} → Q);

Vr enroll XML tree Ta dept dept V2 V1 dname dname V3 V6 V4 V5 cid sid V7 V8 sid V9 V10 V11 cid sid cid V12 sid sid Chemistry Phys01 001 002 Chem02 004 002 Phys02 003 Physics distinct K(enroll/dept,{cid}) XML tree Tb Vr enroll Φ(enroll/dept,{cid} ) dept dept V2 V1 dname dname V3 V9 cid V6 V4 V5 sid sid V7 V8 cid V10 V12 V13 V11 cid cid sid sid V14 sid cid Chemistry Phys01 002 Phys02 003 Chem02 004 Phys01 001 Chem02 002 Physics duplicates duplicates

Theorem: An XML key on source DTD can only be transited to an XFD on the target DTD if the key is satisfied by the conforming source document.

Talked on • XML data transformation with keys • A new definition for XML keys • Transformation rules for keys • Key preservations • Key transition • Also a new definition for XML functional dependency (XFD)

our papers • “On Defining Keys for XML”, IEEE cit2008, Database and Data Mining Workshop, Sydney • “Key Preserving P2P Data Transformation for XML”,LNCS, DBISP2P,2008(VLDB Workshop), Auckland, New Zealand • “Transition of keys in XML Data Transformation”, IEEE CSA2008, Hobart. • “On Defining Functional Dependency for XML”, IEEE IWSCA 2008, Korea

Other research issues • Already done • “Preserving functional dependency in XML data transformation”, LNCS, ADBIS 2008, Finland. • Preserving Inclusion dependency in XML data transformation • Future work • Adaptation of constraints in XML data integration • Detecting conflicts between source constraints and target constraints in XML settings • Checking Validations and satisfactions of the constraints • XML keys, XFDs and XML inclusion dependencies (XID) • Performances in XML data transformation and Integrations with constraints

Thank You ? Questions

Towards the Preservation of Keys in XML Data Transformation for Integration

Towards the Preservation of Keys in XML Data Transformation for Integration

Presentation Transcript

The Role of Moral Leadership in Transformation towards Sustainability

Transformation sofware AQUI: creating data flows in XML

Data integration and transformation 3. Data Exchange

Preservation of Data for Future Use

Data Preservation

XML Transformation: XSLT

Data preservation in ALICE

Data integration and transformation

Towards Seamless Integration and Querying of Biological Data

Transformation sofware AQUI: creating data flows in XML

The Role of XML in Cloud Data Integration

Preservation of Scientific Data in the Humanities

Integration of Biological XML data

Transformation in Integration

Data integration via XML

The correction of XML data

Data integration and transformation 3. Data Exchange

Keys for XML

Data Preservation

WEB BASED DATA TRANSFORMATION USING XML, JAVA