1 / 63

Trio: A System for Data, Uncertainty, and Lineage

Trio: A System for Data, Uncertainty, and Lineage. Martin Theobald Jennifer Widom Stanford University. Uncertainty in Databases. Not a new idea — proposed 20+ years ago “Classic” probabilistic database systems Uncertainty about attribute values Modeling of missing values/incompleteness

Download Presentation

Trio: A System for Data, Uncertainty, and Lineage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Trio: A System for Data, Uncertainty, and Lineage Martin Theobald Jennifer Widom Stanford University

  2. Uncertainty in Databases • Not a new idea — proposed 20+ years ago • “Classic” probabilistic database systems • Uncertainty about attribute values • Modeling of missing values/incompleteness [Cavallo ‘87, Abiteboul ‘91, Garcia-Molina ‘92, Fuhr/Rölleke ‘97] • Most initial (18+ years) work largely theoretical; not much systems-building until recently: URank – U Waterloo, Orion – Purdue U, MayBMS - Saarland U • But applications weren’t ready anyway Are they now?

  3. The “Trio” in Trio • Data Student #123 is majoring in Econ:(123,Econ) Major • Uncertainty Student #123 is majoring in Econ or CS: (123, Econ ∥ CS) Major With confidence 60% student #456 is a CS major: (456, CS0.6) Major • Lineage 456HardWorkerderived from: (456, CS) Major and “CS is hard” some web page

  4. Original Motivation for Trio New Application Domains • Many involve data that is uncertain • (probabilistic, imprecise, fuzzy, incomplete...) • Many of the same ones need to track the lineage (provenance) of their data Coincidence or Fate?

  5. Original Motivation for Trio New Application Domains • Many involve data that is uncertain • (probabilistic, imprecise, fuzzy, incomplete...) • Many of the same ones need to track the lineage (provenance) of their data  Trio combines uncertainty & lineage for a closed & complete representation model Neither uncertainty nor lineage is supported in current commercial database systems

  6. Sample Applications • Information extraction • Find & label entities in unstructured text • Often probabilistic • Information integration • Combine data from multiple sources • Inconsistencies, probabilistic schema mappings • “Human input” data / Scientific experiments • Inexact/incomplete data • Many levels of “derived data products”

  7. Sample Applications • Sensor data management • Approximate readings • Missing readings • Levels of data aggregation • Deduplication (“data cleaning”) • Object linkage, entity resolution • Often heuristic/probabilistic • But *NOT* approximate query processing • Fast but inexact answers

  8. Our Claim Coincidence or Fate? • The connection between uncertainty and lineage goes deeper than just a shared need by several applications

  9. Substantiation of Claim • [technical] Lineage: • Enables simple and consistent representation of uncertain data • Correlates uncertainty in query results with uncertainty in the input data • Can make computations over uncertain data more efficient • [fluffy] “Applications may use lineage to reduce or resolve uncertainty”

  10. Our Goal • Develop a new kind of database management system (DBMS) in which: • Data • Uncertainty • Lineage • are all first-class interrelated concepts • With all the “usual” DBMS features

  11. The “Usual” DBMS Features • Efficient, • Convenient, • Safe, • Multi-User storage of and access to • Massive amounts of • Persistent data

  12. Standard Relational DBMSs • Persistent; Convenient • All data stored in simple tables (relations) • Queries and updates via simple but powerful declarative language (SQL) • Multi-User; Safe • Transactions, Recovery • Massive; Efficient • Storage and indexing structures • Query optimization

  13. Trio: What Changes • Persistent; Convenient • All data stored in “uncertain” relations • Queries and updates via simple but powerful declarative language • Multi-User; Safe • Transactions, Recovery • Massive; Efficient • Storage and indexing structures • Query optimization

  14. Trio: What Changes • Persistent; Convenient • All data stored in “uncertain” relations • Queries and updates via simple but powerful declarative language • Multi-User; Safe – standard DBMS underneath • Transactions, Recovery • Massive; Efficient – standard DBMS underneath • Storage and indexing structures • Query optimization

  15. Another “Trio” in Trio • Data Model • Simplest extension to relational model that’s sufficiently expressive • Query Language • Simple extension to SQL with well-defined semantics and intuitive behavior • System • A complete open-source DBMS that people want to use

  16. Another “Trio” in Trio • Data Model • Uncertainty-Lineage Databases (ULDBs) • Query Language • TriQL • System • Trio-One— built on top of standard DBMS

  17. Remainder of Talk • Data Model • Uncertainty-Lineage Databases (ULDBs) • Query Language • TriQL • System Trio-One • Real-World Applications PharmGKB.org SERF

  18. Running Example: Crime-Solving • Saw (witness, color, car)// may be uncertain • Drives (person,color, car)// may be uncertain • Suspects (person)= πperson(Saw ⋈ Drives)

  19. In Standard Relational DBMS Create Table Suspects as Select person From Saw, Drives Where Saw.color = Drives.color And Saw.car = Drives.car

  20. Data Model: Uncertainty • An uncertain database enodes a set of possible instances based on different kinds of uncertainty: • Jimmy drives either a Toyota or a Mazda, but not both. • Amy saw either a Honda or a Toyota, or nothing? • Hank is a suspect with confidence 0.7 • Betty saw either an Acura with confidence 0.5 or a Toyota with confidence 0.3 or nothing with confidence 0.2 ….

  21. Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences

  22. Our Model for Uncertainty • 1. Alternatives:uncertainty about value • 2. ‘?’ (Maybe) Annotations • 3. Confidences Three possible instances

  23. Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe): uncertainty about presence • 3. Confidences ? Six possible instances

  24. Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe):uncertainty about presence • 3. Confidences absent  unknown ?

  25. Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences: weighted uncertainty ? Six possible instances, each with a probability

  26. Our Model is Not Closed Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result (not 12)! ? ? ?

  27. to the Rescue Lineage • Lineage (provenance): “where data came from” • Internal lineage – from another Trio relation • External lineage – from an external source • In Trio: A Boolean functionλ from data elements to other data elements (or external sources)

  28. Example with Lineage Suspects= πperson(Saw ⋈ Drives) ? ? ? λ(31) = (11,2)˄(21,2) λ(32,1) = (11,1)˄(22,1) ; λ(32,2) = (11,1)˄(22,2) λ(33) = (11,1) ˄ 23

  29. Example with Lineage Correctly captures possible instances in the result (7) Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2)˄(21,2) ? λ(32,1) = (11,1)˄(22,1); λ(32,2) = (11,1)˄(22,2) ? λ(33) = (11,1) ˄ 23 ?

  30. Trio Data Model Uncertainty-Lineage Databases (ULDBs) • Alternatives • ‘?’ (Maybe) Annotations • Confidences • Lineage ULDBs are closed and complete [Databases with Uncertainty and Lineage, VLDB J. Special Issue 2/08]

  31. ULDBs: Lineage • Conjunctive lineage sufficient for most operations (,,⋈) • Disjunctive lineage for duplicate-elimination () • Negative lineage for set difference () • General case after several queries: • Boolean formula • Can represent anyfinite (sub)set of possible instances • Can capture arbitraryBoolean constraints among tuples

  32. ULDBs: Minimality • A ULDB relation R represents a set of possible • instances • Does every tuple inRappear in some possible instance? (no extraneous tuples) • Does every maybe-tuple inRnotappear in some possible instance?(no extraneous ‘?’s) • Also Data-minimality Lineage-minimality

  33. Data Minimality Examples • Extraneous ‘?’ λ(40,1)=(10,1) λ(40,2)=(10,2) ? extraneous ? ?

  34. Data Minimality Examples • Extraneous tuple λ(40,1)=(10,1)˄(10,2) ? extraneous ? ?

  35. ULDBs: Complexity Questions • Does a given tuple t appear in some(all)possible instance(s)of an uncertain relationR? • Is a given tableTone of(all of)the possible instances ofR? Polynomial algorithms based on data-minimization • Confidence computation over arbitrary Boolean expressions: •  Efficient approximation algorithms • (E.g. Monte Carlo simulations) NP-hard #P-complete

  36. Querying ULDBs TriQL • Simple extension to SQL • Formal semantics, intuitive meaning • Query uncertainty, confidences, and lineage

  37. Formal Semantics • Relational (SQL) query Qon ULDB D implementation of Q D D + Result operational semantics possible instances representation of instances Qon each instance D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn)

  38. Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car

  39. Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car

  40. Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car

  41. Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car ?

  42. Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car ?

  43. Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car ?

  44. Operational Semantics: Example SELECT person FROM Saw, Drives WHERE Saw.car = Drives.car ? ?

  45. Operational Semantics: Example CREATE TABLE Suspects AS SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car ? λ( ) = ... λ( ) = ... ? • Efficiently implementable on top of standard SQL • Data computation overhead: O(n log(n))

  46. Confidences • Confidences supplied with base data • Trio computes confidences on query results • Default probabilistic interpretation (#P-complete) • Plug in different arithmetics, e.g., Min(linear) 0.3 0.4 Probabilistic Min 0.6

  47. TriQL: Querying Confidences Built-in function:Conf() • SELECT person • FROM Saw, Drives • WHERE Saw.car = Drives.car • AND Conf(Saw) > 0.5 AND Conf(Drives) > 0.8 Similarly:Conf(*) • SELECT person • FROM Saw, Drives • WHERE Saw.car = Drives.car • AND Conf(*) > 0.5 Naïve data-minimization: WHERE Conf(*) > 0

  48. TriQL: Querying Lineage Built-in join predicate:Lineage() • SELECT Suspects.person • FROM Suspects, Saw • WHERE Saw.witness = ‘Amy’ • AND Lineage(Suspects,Saw) X ==> Y shorthand for Lineage(X,Y)

  49. More TriQL Constructs • “Horizontal subqueries” • Refer to tuple alternatives as a relation • Aggregations: • (sum, count, avg, min, max)  (low, high, expected) • Fetch modifiers: • Merged (eliminate horizontal duplicates) • Flatten, GroupAlts, NoLineage, NoConf, NoMaybe • Set operators: Union [All], Intersect, Except • Conjunctive, Disjunctive & Negative Lineage

  50. Horizontal Subquery Example • List suspects with conf values based on accuser credibility

More Related