1 / 25

Trio: A System for Data, Uncertainty, and Lineage

UNCERTAINTY. LINEAGE. DATA. Trio: A System for Data, Uncertainty, and Lineage. Search “stanford trio” http://i.stanford.edu/trio. People. Current Jennifer Widom (faculty) Omar Benjelloun (post-doc) Parag Agrawal, Anish Das Sarma, Shubha Nabar (PhD) Michi Mutsuzaki (MS)

lindsey
Download Presentation

Trio: A System for Data, Uncertainty, and Lineage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UNCERTAINTY LINEAGE DATA Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio” http://i.stanford.edu/trio

  2. People • Current • Jennifer Widom (faculty) • Omar Benjelloun (post-doc) • Parag Agrawal, Anish Das Sarma, Shubha Nabar (PhD) • Michi Mutsuzaki (MS) • Tomoe Sugihara (visitor) • Incoming • Martin Theobald (post-doc) • Raghu Murthy (MS) • Ander de Keijzer (visitor) • Alums • Alon Halevy, Ashok Chandra (visitors) • Chris Hayworth (MS)

  3. Why Uncertainty + Lineage? • Many applications seem to need both • From a technical standpoint, it turns out that • lineage... • Enables simple and consistent representation of uncertain data • Correlates uncertainty in query results with uncertainty in the input data • Can make computation over uncertain data more efficient

  4. Trio Components • Data Model • ULDBs (Uncertainty-Lineage Databases): • Simple extension to relational model • Query Language • TriQL: Simple extension to SQL, well-defined semantics and intuitive behavior • System • Version 1: Complete system and GUI built on top of conventional DBMS

  5. Running Example: Crime-Solving • Saw(witness,car) // may be uncertain • Drives(person,car) // may be uncertain • Suspects(person)= πperson(Saw ⋈ Drives)

  6. Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences

  7. Our Model for Uncertainty • 1. Alternatives:uncertainty about value • 2. ‘?’ (Maybe) Annotations • 3. Confidences Three possible instances =

  8. Our Model for Uncertainty • 1. Alternatives • 2.‘?’ (Maybe): uncertainty about presence • 3. Confidences ? Six possible instances

  9. Our Model for Uncertainty • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences: weighted uncertainty ? Six possible instances, each with a probability

  10. Models for Uncertainty • Our model (so far) is not especially new • We spent some time exploring the space of models for uncertainty [ICDE 06, journal] • Tension between understandability and expressiveness • Our model is understandable • But it is not complete, or even closed under common operations

  11. Our Model is Not Closed Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result ? ? ?

  12. Lineage to the Rescue • Lineage • Captures “where data came from” • In Trio:A functionλ from alternatives to other alternatives (or external sources)

  13. Example with Lineage Correctly captures possible instances in the result Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2) ? λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2) ? ? λ(33) = (11,1), 23

  14. Uncertainty-Lineage Databases (ULDBs) • 1. Alternatives • 2. ‘?’ (Maybe) Annotations • 3. Confidences • 4. Lineage • ULDBs are closed and complete • [VLDB 06]

  15. ULDBs: Lineage • Conjunctive lineage sufficient for most operations • Duplicate-elimination: Disjunctive lineage • Difference: Negative lineage • General case after multiple operations/queries: Boolean formula

  16. ULDBs: Interesting Questions • Data-minimality: extraneous alternatives, extraneous “?” • Lineage-minimality: harder • Membership: tuple and table, some-instance and all-instances • Coexistence: multiple tuples • Extraction: remove tables, retain possible-instances

  17. Example: Extraneous Data ? extraneous ? ?

  18. Example: Coexistence ? Can’t coexist ? ? ?

  19. Querying ULDBs: Semantics • Query Qon ULDB D implementation of Q D D’ D + Result operational semantics possible instances representation of instances Qon each instance D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn)

  20. Querying ULDBs: TriQL • Basic TriQL: SQL with new semantics • Obeys commutative diagram for uncertain data • Tracks lineage • Query results: new table or on-the-fly • Implemented TriQL: also built-in predicates conf(),lineage(), lineage*()

  21. Additional TriQL Constructs • [Language manual on web site] • “Horizontal subqueries” Refer to tuple alternatives as a relation • Unmerged (horizontal duplicates) • Flatten, GroupAlts • NoLineage, NoConf, NoMaybe • Query-specified confidences [done] • Data modification statements

  22. Confidence Computation • Confidences computed on-demand based on lineage • Confidence of alternative A is function of confidences in λ*(A) • Permits any query plan for data computation • Default probabilistic interpretation, but queries can override SELECT person, min(conf(Saw),conf(Drives)) as conf FROM Saw, Drives WHERE Saw.car = Drives.car

  23. Trio System: Version 1 TrioExplorer (GUI client) • DDL commands • TriQL queries • Schema browsing • Table browsing • Explore lineage • On-demand • confidence • computation Command-line client Trio API and translator (Python) • “Verticalize” • Shared IDs for • alternatives • Columns for • confidence,“?” Standard SQL • Table types • Schema-level • lineage structure Standard relational DBMS • conf() • lineage() “==>” • lineage*() “==>>” Encoded Data Tables Trio Metadata • One per result • table • Uses unique IDs Lineage Tables Trio Stored Procedures

  24. Current & Future Topics • Algorithms: confidence computation, coexistence • extraneous data • Minimize lineage traversal • Memoization • Batch operations • System • Full query language • More internal processing ? • Storage and indexing • Statistics and query optimization

  25. Current & Future Topics • Top-K by confidence • Extend basic uncertainty model • Incomplete relations • Continuous uncertainty • Correlated uncertainty ? • External lineage, • update lineage, • versioning

More Related