1 / 45

A Generic Provenance Middleware for Database Queries , Updates, and Transactions

A Generic Provenance Middleware for Database Queries , Updates, and Transactions. Bahareh Sadat Arab 1 , Dieter Gawlick 2 , Venkatesh Radhakrishnan 2 , Hao Guo 1 , Boris Glavic 1. IIT DBGroup 1. Oracle 2. Outline. ❶ Motivation and Overview ❷ GProM Vision

Download Presentation

A Generic Provenance Middleware for Database Queries , Updates, and Transactions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Generic Provenance Middleware for Database Queries, Updates, and Transactions Bahareh Sadat Arab1, Dieter Gawlick2, VenkateshRadhakrishnan2, HaoGuo1, Boris Glavic1 IIT DBGroup1 Oracle2

  2. Outline ❶ Motivation and Overview ❷ GProM Vision ❸ Provenance for Transactions GProM - Provenance for Queries, Updates, and Transactions

  3. Introduction • Data Provenance • Information about the origin and creation process data • Provenance tracking for database operations • Considerable interest from database community in last decade • The de-facto standard for database provenance [1,2,3,4,5] • model provenance as annotations on data (e.g., tuples) • compute the provenance by propagating annotations (query rewrite) SELECT DISTINCT Owner FROM CannAcc; • [1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, Springer, 2013. • [2] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS, 2013. • [3] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 14(4):373–396, 2005. • [4] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and Lineage. In VLDB, pages 1151–1154, 2006. • [5] G. Karvounarakisand T. Green. Semiring-annotated data: Queriesand provenance. SIGMOD Record, 41(3):5–14, 2012. GProM - Provenance for Queries, Updates, and Transactions

  4. Use Cases • Debugging data and transformations (queries)[1] • Probabilistic databases (queries)[5] • Auditing and compliance (transactions and update statements)[6] • Understanding data integration transformations (queries and transactions) • Assessing data quality and trust (queries and transactions)[7] • Computing provenance for updates and transactions is essential for many use cases. • [1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, pringer, 2013. • [5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and Lineage. In VLDB, 2006. • [6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using temporal databases. In TaPP, 2011. • [7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 2012. GProM - Provenance for Queries, Updates, and Transactions

  5. Shortcomings of State-of-the-Art • No practical implementation for updates • No system or model supports transactions • Inflexible provenance storage • Always on [2,3] • On-demand only [1] • Query rewrite use atypical access patterns and operator sequences • -> leads to poor execution plans • Most systems: only one type of provenance • [1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, pringer, 2013. • [2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 2005. • [3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS, 2013. GProM - Provenance for Queries, Updates, and Transactions

  6. Objectives • Vision: Generic Provenance Database Middleware (GProM). • Provenance for • Queries, updates, and transactions • User decides when to compute and store provenance • Supports multiple provenance models • Database-independent • Tracking provenance of concurrent transactions • Reenactment Queries GProM - Provenance for Queries, Updates, and Transactions

  7. Contributions • First solution for provenance of transactions • Retroactive on-demand provenance computation • Using read-only reenactment • Only requires audit log + time travel • Supported by most DBMS • No additional storage and runtime overhead • Non-invasive provenance computation • query rewrite + annotation propagation GProM - Provenance for Queries, Updates, and Transactions

  8. Outline ❶ Motivation and Overview ❷ GProM Vision ❸ Provenance for Transactions GProM - Provenance for Queries, Updates, and Transactions

  9. System Architecture • Database independent middleware • Plug-able parser and SQL code generator • Internal query representation • Relational Algebra Graph Model (AGM) • Core driver: Query rewrites • Provenance Computation • Flexible storagepolicies for provenance • Provenance import/export • AGM Optimizer (rewritten queries) • Extensibility: Rewrite Specification Language (RSL) • Initial prototype build on-top of Oracle GProM - Provenance for Queries, Updates, and Transactions

  10. GProM Overview GProM - Provenance for Queries, Updates, and Transactions

  11. Provenance Computation • Query rewrite • Take original query q and rewrite into q+ Computes original results + provenance • Propagate provenance through operations GProM - Provenance for Queries, Updates, and Transactions

  12. Example Rewrite • Input: SELECT DISTINCT u.OwnerFROM Usacc u, CanAcc c WHERE u.ID= c.ID; • Rewrite Parts: USaccSELECT ID, Owner, Balance, Type, ID AS P1, OwnerAS P2, BalanceAS P3, Type AS P4 FROM USacc CanAccSELECT ID, Owner, Balance, Type, ID AS P5, Owner AS P6, Balance AS P7, TypeAS P8 FROM CanAcc WHERE u.ID= c.IDWHERE u.ID= c.ID SELECT DISTINCT OwnerSELECT Owner, P1, P2, P3, P4, P5, P6, P7, P8 • Output: SELECT u.Owner, P1, P2, P3, P4, P5, P6, P7, P8 FROM (SELECT ID, Owner, Balance, Type, ID AS P1, OwnerAS P2, Balance AS P3, TypeAS P4 FROM USacc) u (SELECT ID, Owner, Balance, Type, ID AS P5, OwnerAS P6, Balance AS P7, TypeAS P8 FROM CanAcc) c WHERE u.ID= c.ID; GProM - Provenance for Queries, Updates, and Transactions

  13. Provenance Computation • Operates on relational algebra representation of queries • Fixed set of rewrite rules per provenance type: • One per type of algebra operator • Recursive top-down rewrite • For each relation access: duplicate attributes as provenance • For each operator: replace with algebra graph that propagates provenance annotations • Composable GProM - Provenance for Queries, Updates, and Transactions

  14. Supporting Past Queries, Updates,and Transactions • Only needs audit logand time travel • supported by most DBMS • Sufficient for provenance of past queries [4] • Our contribution • Sufficient for updates and transactions • [4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, 2010. GProM - Provenance for Queries, Updates, and Transactions

  15. Provenance Generation andStorage Policies • GProMdefault • Only compute provenance if explicitly requested • User can register storage policies • When to store which type of provenance POLICY storeOnR{ FIRE ON Query, Insert q WHEN Root(q) +=> Table(R) COMPUTE PI-CS STORE AS NEW TABLE NAMING SCHEME Hash } GProM - Provenance for Queries, Updates, and Transactions

  16. Optimizing Rewritten Queries • Query rewrite use atypical access patterns and operator sequences leads to poor execution plans • Optimization for rewritten queries • Heuristic • Cost-based SELECT ID, Owner, Balance, CASE WHEN Balance > 1000000 THEN 'Premium ' ELSE Type END AS Type, prov_CanAcc_ID, prov_CanAcc_Owner, prov_CanAcc_Balance, prov_CanAcc_Type, prov_USacc_ID, prov_USacc_Owner, prov_USacc_Balance, prov_USacc_Type FROM u1 ... SELECT ID, Owner, Balance, 'Premium ' AS Type, prov_CanAcc_ID, prov_CanAcc_Owner, prov_CanAcc_Balance, prov_CanAcc_Type, prov_USacc_ID, prov_USacc_Owner, prov_USacc_Balance, prov_USacc_Type FROM u1 WHERE Balance > 1000000 UNION ALL SELECT * FROM u1 WHERE (Balance > 1000000) IS NOT TRUE GProM - Provenance for Queries, Updates, and Transactions

  17. Rewrite Extensibility • Extensible using Rewrite Specification Language (RSL) • Concise specification of rewrite rules RULEmergeSelections { FORq => c => g WHEREq->type = selectionANDc->type = selection REWRITEINTO selection [pred= q->predANDc->pred] => g } GProM - Provenance for Queries, Updates, and Transactions

  18. Outline ❶ Motivation and Overview ❷ GProM Vision ❸ Provenance for Transactions GProM - Provenance for Queries, Updates, and Transactions

  19. Provenance of Transactions GProM - Provenance for Queries, Updates, and Transactions

  20. Provenance of Transactions INSERT INTO USacc (SELECT ID, Owner, Balance, ‘Standard’ AS Type FROMCanAcc WHERE Type = ‘US_dollar’); UPDATEUSacc SET Type = ’Premium’ WHERE Balance > 1000000; COMMIT; GProM - Provenance for Queries, Updates, and Transactions

  21. Provenance of Transactions INSERT INTO Usacc (SELECT ID, Owner, Balance, ‘Standard’ AS Type FROM CanAcc WHEREType = ‘US_dollar’); u1 u2 UPDATE Usacc SET Type = ’Premium’ WHERE Balance > 1000000; GProM - Provenance for Queries, Updates, and Transactions

  22. Provenance of Transactions • Our Approach: Reenactment + Provenance Propagation • Currently supports • Snapshot Isolation • Statement-level Snapshot Isolation 1 2 3 4 5 Gather Transaction Information Construct Update Reenactment Query Construct Transaction Reenactment Query Rewrite For Provenance Computation Execute Query GProM - Provenance for Queries, Updates, and Transactions

  23. 1.Gather Transaction Information • Retrieve SQL statementsof transaction from audit log • Update u1: INSERT INTO USacc (SELECTID, Owner, Balance, ‘Standard’ AS Type FROM CanAcc WHERE Type = ‘US_dollar’); • Update u2: UPDATEUsacc SET Type = ’Premium’ WHERE Balance > 1000000; GProM - Provenance for Queries, Updates, and Transactions

  24. 2. Translate Updates: Reenactment • Update reads table version and outputs updated table version • Multiple versions of the database • Each modification of a tuple t causes a new version to be created • Old tuple versions are kept (SI) • Add version annotation τ to provenance of each updated row • Use semi-ring model UPDATE Usacc SET Type=’Premium’ WHERE Balance>1000000; GProM - Provenance for Queries, Updates, and Transactions

  25. 2.Translate Updates • Construct update reenactment query • Simulates effect of update • Read DB version seen by update using time travel • Query result = updated table (Annotation-Equivalent) UPDATE Usacc SETType = ’Premium’ WHERE Balance > 1000000; SELECT ID, Owner, Balance, ’Standard’ AS Type FROMCanAccAS OF SCN 3652 WHERE Type=‘US_dollar’ UNION ALL SELECT * FROMUsaccAS OF SCN 3652; INSERT INTO Usacc (SELECT ID, Owner, Balance, ‘Standard’ AS Type FROM CanAcc WHERE Type = ‘US_dollar’); SELECTID, Owner, Balance, ’Premium’ AS Type FROMUsaccAS OF SCN 3652 WHERE Balance>1000000 UNION ALL SELECT* FROMUsaccAS OF SCN3652 WHERE (Balance>1000000) IS NOT TRUE; GProM - Provenance for Queries, Updates, and Transactions

  26. 3. Construct Reenactment Query • Simulates the whole transaction • Annotation-Equivalentto original transaction • Merge reenactment queries based on concurrency control protocol • Each concurrency control requires a different merge process • SERIALIZABLE (Snapshot isolation) -> modifications before the transaction started + previous updates of the transaction • READ COMMITTED (Snapshot isolation) -> sees committed changes by concurrent transaction WHITU1AS (SELECT ID, Owner, Balance, ’Standard’ AS Type FROMCanAccAS OF SCN 3652 WHERE Type=‘US_dollar’ UNION ALL SELECT * FROMUsaccAS OF SCN 3652); SELECT ID, Owner, Balance, ’Premium’ AS Type FROM U1 WHERE Balance>1000000 UNION ALL SELECT* FROMU1 WHERE (Balance>1000000) IS NOT TRUE; GProM - Provenance for Queries, Updates, and Transactions

  27. 4. Rewrite For Provenance Computation • Rewrite reenactment query to compute provenance using annotation propagation WITH u1 AS (SELECT ID, Owner, Balance, ’Standard ’ AS Type, ID AS prov_CanAcc_ID, . . . NULL AS prov_USacc_ID, . . . 1 AS updated, FROM CanAccAS OF SCN 3652 WHERE Type = ’US dollar ’ UNION ALL SELECT ID , Owner , Balance , Type , NULL AS prov_CanAcc_ID, . . . ID AS prov_USacc_ID, . . . 0 AS updated FROM USaccAS OF SCN 3652), . . . u1 AS (SELECT . . . GProM - Provenance for Queries, Updates, and Transactions

  28. 4. Execute Query • Execute query to retrieve provenance GProM - Provenance for Queries, Updates, and Transactions

  29. Conclusions • We present our vision for GProM • Database-independent middleware for computing provenance of queries, updates, and transactions. • First solution for provenance of transactions • Query rewrite techniques on steroids: • Provenance computation • Transaction reenactment • Provenance translation • Provenance storage • Optimization • Extensible through RSL language GProM - Provenance for Queries, Updates, and Transactions

  30. Future Works • Implementing additional provenance types • Comprehensive study of heuristic and cost-based optimizations • Design and implementation of RSL • Implementing additional provenance formats • Study reenactment for other concurrency control mechanisms • Locking protocols (2PL) • Investigate additional Use-cases for Reenactment • Transaction backout • Retroactive What-if analysis GProM - Provenance for Queries, Updates, and Transactions

  31. Questions? • Homepage: Bahareh:http://www.cs.iit.edu/~dbgroup/people/barab.php Boris:http://www.cs.iit.edu/~glavic/ • DBGroup: http://www.cs.iit.edu/~dbgroup/ • GProM Project (partially funded by Oracle) http://www.cs.iit.edu/~dbgroup/research/oracletprov.php • Perm http://www.cs.iit.edu/~dbgroup/research/perm.php GProM - Provenance for Queries, Updates, and Transactions

  32. References [1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, pages 291–320. Springer, 2013. [2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 14(4):373–396, 2005. [3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS, 38(3): 19, 2013. [4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, pages 311–322, 2010. [5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and Lineage. In VLDB, pages 1151–1154, 2006. [6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using temporal databases. In TaPP, 2011. [7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 41(3):5–14, 2012.

  33. Q-Bomb • One pattern that arises from reenactment are long chains of SELECT clauses using CASE • Each level references attributes from next level multiple times • Subquery pull-up creates expressions of size exponential in the number of SELECT clauses • In praxis: optimization never finishes • Minimal example using one row table SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b FROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b … FROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b FROM R

  34. Example Provenance Computation

  35. Example – Update Reenactment

  36. Example – Trans. Reenactment

  37. Rewrite Reenactment Query

  38. Execute Rewritten Query

  39. Types of Update Operations - Insert • Insert executed at time t • Updated version of R contains • All tuples from previous version • All newly inserted tuples • Fixed tuple defined in VALUES clause • Results of query over database version at t Union these two sets (SELECT * FROM RAS OF t) UNION ALL (SELECT v1 AS a1, ... ,vnAS an); INSERT INTO RVALUES (v1, ... ,vn); (SELECT * FROM RAS OF t) UNION ALL (q(t)); INSERT INTO R (q);

  40. Types of Update Operations - Delete • Delete executed at time t • Tuples in updated version of R: • All tuples from for which Conditionis notfulfilled SELECT * FROM RAS OF t WHERE (C) IS NOT TRUE; DELETE FROM RWHERE C;

  41. Types of Update Operations - Update • Update executed at time t • Find tuples where Conditionholds and update the attribute values • Find tuples where NOT Conditionholds Union these two sets (SELECT A’FROM RAS OF tWHERE C) UNION ALL (SELECT * FROM RAS OF tWHERE (C) IS NOT TRUE) UPDATE RSET AWHERE C;

  42. READ COMMITTED • Statement of a transaction T sees committed changes by concurrent transaction • For a given update we need to combine • tuples produced by previous statements of same transaction • tuples produced by transactions that committed before update • Observations • Once a transaction T modifies a tuple t, no other transaction can access t until T commits • Let ui be the update executed at time x of T that first modifies t • uiwill read the latest version committed x • If we know uithen updates of T before x do not have to look at t • Consider the database version 1 time unit (C-1) before commit of T • This contains all the tuple versions seen by the first update of T updating each individual tuple • Let t be a tuple version in this version and it’s start time is y • We know that updates from T which executed before y cannot have updated t • We can use version C-1 as input for reenactment as long as we hide tuple version t at y from an reenactment of an updated executed at x with x < y

  43. READ COMMITTED u1 AS (SELECT CASE WHEN Balance <=1000000 ANDversion <= 0 THEN'Standard ' ELSETypeEND AS Type, ID , Owner , Balance , CASE WHEN Balance <=1000000 AND version <= 0 THEN−1 ELSE version END AS version FROMUSacc AS OF SCN 3652) , u2 AS (SELECT CASE WHEN Balance > 1000000 ANDversion <= 1 THEN'Premium' ELSETypeEND AS Type, ID , Owner , Balance , CASE WHEN Balance > 1000000AND version <= 1 THEN −1 ELSE version END AS version FROM u1 ) SELECTID , Owner , Balance , Type FROM u2 WHERE version = −1;

  44. Database Independence • Encapsulate database-specific functionality in pluggable modules. • What needs to be adapted are : • Parser • SQL code generator • Metadata access • Audit log access • Time travel activation.

  45. Accessing Several Tables • Transactions Accessing Several Tables • We require user to specify which table she is interested in • Replace access to table with query for last update that modified the table

More Related