Framework for Versioned Data Access Methods

A Framework for Access Methods for Versioned Data B. Salzberg, L. Jiang, D. Lomet, M. Barrena, J. Shan & E. Kanoulas

Outline • Motivation • Introducing versions through the examples • Versions and version ranges • Data pages • Page splitting and consolidation • Efficiency guarantee • Index pages • Conclusions

Motivation • Historical archives need to be retained • Medical, banking, … • Different historical versions created along different branches must be reconstructed • Software libraries, design, … • Access methods for versioned data have been proposed

Motivation • We present a framework for constructing and understanding versioned access methods • Central point: the study of version splitting of units of data storage (disk pages) • Main goal: to make the stabbing query efficient: “Find all data alive at this version”

version v1 version v2 <v1,k1,d1> <v1,k2,d2> <v1,k3,d3> <v2,k1,d’1> <v2,k2,d2> <v2,k3,d3> Redundancy Could we use start and end versions labels? Set of versions for which record k2 does not change What if k2 does not change for a large number of versions? <v1,k1,d1> <v2,k1,d’1> <{v1,v2},k2,d2> <{v1,v2},k3,d3> The two-version example • Records format: <vi,kj,dl>

branch b2 branch b2 v3 v1 v2 branch b1 branch b1 version v1 version v2 version v3 Key space Key space <v1,k1,d1> <v1,k2,d2> <v1,k3,d3> <v2,k1,d’1> <v2,k2,d2> <v2,k3,d3> <v3,k1,d1> <v3,k2,d’2> <v3,k3,d3> branch b1 branch b2 k3 k3 d3 d3 d2 d’2 k2 k2 d2 d1 d’1 k1 k1 d1 time time v1 v2 now v1 v3 now The three-version example

Key space Key space branch b1 branch b2 k3 k3 d3 d3 d2 d’2 k2 k2 d2 d1 d’1 k1 k1 d1 time time v1 v2 now v1 v3 now When there is branching, we cannot express a unique end version for a set of versions What if k3 is never updated in branch b1? Should we keep updating the end version for k3 as new versions appear on b1? We might keep the start and the end version on each branch <{v1, v3},k1,d1> <v2,k1,d’1> <{v1, v2},k2,d2> <v3,k2,d’2> <{v1, v2, v3},k3,d3> The three-version example

Versions • The initial version set: V = {v1} • New versions are obtained by updating, inserting or deleting records from old versions of V • V can be represented by a tree: the version tree • There is a partial order on the nodes of the version tree • Ancestors: anc(v)={a V/ a < v} • Descendents: desc(v) = {d V/ d > v}

Version Ranges • Records correspond to sets of versions over which they do not change • Such a set forms a subtree called the version range. We have: • Onestart version: the root of the subtree • A set of end versions: the leaves of the subtree (one on each branch) • The main objection: • To have to update end versions for every new version for which the record does not change • The solution: • To take apart end versions from the version range

v4 Assume now that R is updated at version v4. We could say v3 is an end version for R Assume that a new version v5 appears but R is not touched by v5 v5 Version Ranges Consider a record R inserted at v1. Suppose that R remains unchanged at v2 and v3. v3 v1 v2

v4 Now version v3 is no longer an end version for R. R remains unchanged at {v1, v2, v3, v5 } v5 v6 Version Ranges We choose the end version for R to be a “stop sign” along a branch. The end version v4, does not belong to the version range for R Later, any number of descendent in VR could be created. If these versions do not change R, the VR expands automatically v3 v1 v2

v4 v5 v6 Version Ranges Later, any number of descendent in VR could be created. If this versions do not change R, the VR expands automatically v3 v1 v2

Version Ranges • The version range vr = (start(vr), end(vr)), where: • start(vr) is an individual version • end(vr) is the minimal set of versions ev with the property: v vr iif • start(vr)  v • ev  end(vr) (ev  v)

<{v1, v3},k1,d1> <v2,k1,d’1> <{v1, v2},k2,d2> <v3,k2,d’2> <{v1, v2, v3},k3,d3> <(v1, {v2}),k1,d1> <(v2, { }),k1,d’1> <(v1, {v3}),k2,d2> <(v3, { }),k2,d’2> <(v1, { }),k3,d3> The three-version example revisited

Data pages • Data pages (P) delimit one version range (vr) and one key range (kr) • We define KVR(P) = (kr(P),vr(P)) • A data page with KVR(P) = (kr,vr) stores all records <vr’,k,d> such that: • k  kr and • vr  vr’  

Deletion events do not cause lose of content, they are stated by means of compact null records <(v1, {v2}),k1,d1> <(v2, { }),k1,d’1> <(v1, {v3}),k2,d2> <(v3, { }),k2,d’2> <(v1, { }),k3,d3> <v4,k2,null> Compact record representation • To store records in data pages we use the compact record representation <v1,k1,d1> <v1,k2,d2> <v1,k3,d3> <v2,k1,d’1> <v3,k2,d’2>

Looking for the efficiency • To make the stabbing query efficient, a substantial percentage of the records in an accessed page must be alive for a version v • The splitting page policy • When a page P gets full, a version splitting of P must be done (here current version vnis used) • A new page P’ is allocated with VR(P’) = (vn,) • Records from P can be moved or copied to P’

Page splitting policy • Records created by vn which are not null are moved from P to P’ • Records whose version range lie in VR(P)  VR(P’) which are not null are copied to P’

Page splitting policy • Some kind of key splits are allowed in our framework (similar to B-tree page splits) • After a version split if the new page has more than a certain threshold value Tk (we call version-and-key split) • When a full page has version range (current_version, ) (we call restricted-key split) • Pure key splits cannot guarantee a minimun number of records alive for a given version

Consolidation • Delete operations may damage the stabbing query efficiency • When the number of records alive in P at vn fall below a threshold Tc, a consolidation process is triggered • A sparse page and a proper sibling are current-version split, and the results are combined in one page • Transactions with a large number of deletions may generate ghost pages

Efficiency guarantee • We start with a page D at version v1 having n alive records • Our framework guarantees a minimum number of records in a data page D in answering a stabbing query (v VR(D)) under different scenarios

Efficiency guaranteeAssertions • No deletes and only version splits: • at least n • No deletes and only current-version or version-and-key or restricted-key splits: • at least min(n,Tk/2) • Any kind of transactions and version splits, version-and-key splits, restricted-key splits and node consolidation: • at least min(Tc,n)

Index pages • Index pages + data pages form a DAG • Index pages also correspond to key-version ranges • Index page entries contain for every child C: <VR(C), KR(C), Disk_address(C)> • Index page splits and consolidations follow the same policy as for data pages • Additional details about properties and treatment of index pages can be seen in the paper

Conclusions • Version data are not trivial to deal with • Our framework • contributes to understand the implications of managing and retrieving version data • gives clear cues to represent in a compact and robust way this kind of data • supports realistic assumptions on transactions

A Framework for Access Methods for Versioned Data B. Salzberg, L. Jiang, D. Lomet, M. Barrena, J. Shan & E. Kanoulas

Framework for Versioned Data Access Methods

Framework for Versioned Data Access Methods

Presentation Transcript

A revised framework for global medicine access

Methods for Data Integration

Data Access Framework (DAF) HL7

A Framework for Reflective Database Access Control Policies

Universal Design for Learning: A framework for access and equity

Data Access Framework

Data Access Framework (DAF) IHE

A Data Quality Framework for National Transportation Data

A Framework for

The Unified Access Framework for Gridded Data

Data Access Framework (DAF)

Access methods for time-evolving data

A Parallel Computational Framework for Discontinuous Galerkin Methods

A Universal Framework for Data Validation

TheDataWeb: a New Framework for Data

QoS Framework for Access Networks

Framework for Raw Data

A Data Access Framework for ESMF Model Outputs

A Universal Framework for Data Validation

Using Collector for Offline Editing with Versioned Data

Access methods for time-evolving data

TheDataWeb: a New Framework for Data