Uncertainty Lineage Data Bases - PowerPoint PPT Presentation

kueng
slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Uncertainty Lineage Data Bases PowerPoint Presentation
Download Presentation
Uncertainty Lineage Data Bases

play fullscreen
1 / 37
Download Presentation
133 Views
Download Presentation

Uncertainty Lineage Data Bases

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Uncertainty Lineage Data Bases Very Large Data Bases 1975 2006

  2. UNCERTAINTY LINEAGE DATA ULDBs: Databases with Uncertainty and Lineage Omar Benjelloun, Anish Das Sarma, Alon Halevy, Jennifer Widom Stanford InfoLab

  3. Motivation • Many applications involve data that is uncertain (approximate, probabilistic, inexact, incomplete, • imprecise, fuzzy, inaccurate, ...) • Many of the same applications need to track the lineageof their data • Neither uncertainty nor lineage are supported by conventional DBMSs Coincidence or Fate?

  4. Sample Applications Needing Uncertainty and Lineage • Scientific databases • Sensor databases • Data cleaning • Data integration • Information extraction

  5. Trio Project Building a new kind of DBMS in which: • Data • Uncertainty • Lineage are all first-class interrelated concepts


  6. Coincidence or Fate? Lineage and Uncertainty • Lots of independent work in lineage and uncertainty (related work at end of talk) • Turns out: The connection between uncertainty and lineage goes deeper than just a shared need by several applications

  7. Lineage and Uncertainty • Lineage... • Enables simple and consistent representation of uncertain data • Correlates uncertainty in query results with uncertainty in the input data • Can make computation over uncertain data more efficient

  8. Outline of the Talk • The ULDB data model • Querying ULDBs • ULDB properties • Membership and extraction operations • Confidences • Current, related, and future work

  9. Running Example: Crime Solver Saw(witness,car) Drives(person,car) Suspects(person) = πperson(Saw ⋈ Drives)

  10. Uncertainty • Anuncertain database represents a set of possible instances. Examples: • Amy saw either a Honda or a Toyota • Jimmy drives a Toyota, a Mazda, or both • Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3 • Hank is a suspect with confidence 0.7

  11. Uncertainty in a ULDB 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences

  12. Three possible instances Uncertainty in a ULDB 1.Alternatives:uncertainty about value 2. ‘?’ (Maybe) Annotations 3. Confidences =

  13. Uncertainty in a ULDB 1. Alternatives 2. ‘?’ (Maybe):uncertainty about presence 3. Confidences ? Six possible instances

  14. Uncertainty in a ULDB 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences:weighted uncertainty ? Six possible instances, each with a probability

  15. Data Models for Uncertainty • Our model (so far) is not especially new • We spent some time exploring the space of models for uncertainty [ICDE 2006] • Tension between understandability and expressiveness • Our model is understandable • But it is not complete, or even closed under common operations

  16. Closure and Completeness • Completeness Can represent all sets of possible instances • Closure Can represent results of operations • Note: Completeness Closure

  17. Model (so far) Not Closed Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result ? ? ?

  18. Lineage to the Rescue • Lineage: “where data came from” • Internal lineage • External lineage (not covered in this talk) • In ULDBs: A functionλ from alternatives to sets of alternatives (or external sources)

  19. Correctly captures possible instances in the result Example with Lineage Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2) ? λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2) ? λ(33) = (11,1), 23 ?

  20. ULDBs • Alternatives • ‘?’ (Maybe) Annotations • Confidences • Lineage ULDBs are Closed and Complete

  21. Outline of the Talk • The ULDB data model • Querying ULDBs • ULDB properties • Membership and extraction operations • Confidences • Current, related, and future work

  22. Querying ULDBs • Query Qon ULDB D implementation of Q D D’ D + Result possible instances representation of instances Qon each instance D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn)

  23. Well-Behaved ULDBs • If we start with a well-behaved ULDB and perform standard queries, it remains well-behaved • Intuitively (details in paper): • Acyclic:No cycles in the lineage • Deterministic:Non-empty lineages of distinct alternatives are distinct • Uniform: Alternatives of same tuple are derived from the same set of tuples

  24. ULDB Minimality • Data-minimality • Does every alternative appear in some possible instance? (no extraneous alternatives) • Does every maybe-tuple in Rnot appear in some possible instance? (no extraneous ‘?’s) • Lineage-minimality

  25. Data-Minimality Examples Extraneous ‘?’ λ(20,1)=(10,1); λ(20,2)=(10,2) ? extraneous

  26. Data-Minimality Examples Extraneous alternative ? extraneous ? ?

  27. Data-Minimization • Extraneous alternative theorem: • An alternative is extraneous iff it is (possibly transitively) derived from multiple alternatives of the same tuple. • Extraneous “?” theorem • A “?” on tuple t is extraneous iff • it is derived from base tuples without “?” • t has as many alternatives as the product of the number in its base tuples • Minimization algorithm based on the theorems (see paper)

  28. ULDB Properties and Operations Data-minimize Lineage-minimal Queries Data-minimal Lineage-minimal Data-minimal Extraction Membership Lineage-minimize

  29. R possible instances I1, I2, …, In Membership Questions • Does a given tuple t appear in some (all) possible instance(s) of R? • Polynomial algorithms based on Data-minimization • Is a given table T one of (all of) the possible instances of R? • NP-Hard t? , T?

  30. Extraction Drives Saw • Extraction algorithm in paper Eats Suspects

  31. Outline of the Talk • The ULDB data model • Querying ULDBs • ULDB properties • Membership and extraction operations • Confidences • Current, related, and future work

  32. Confidences • Confidences supplied with base data • Trio computes confidences on query results • Default probabilistic interpretation • Can choose to plug in different arithmetic ? ? Probabilistic Min 0.3 0.4 ? 0.6

  33. Query Processing with Confidences • Previous approach (probabilistic databases) • Each operator computes confidences during query execution • Only certain query plans allowed • In ULDBs • Confidence of alternative A is function of confidences in its transitive lineage • Our approach: Decouple data and confidence computation • Use any query plan for data computation • Compute confidences on-demand using lineage • Can give arbitrarily large improvements

  34. Current Work: Algorithms • Algorithms: confidence computation, extraneous data, membership questions • Minimize lineage traversal • Memoization • Batch computations

  35. The Trio Trio • Data Model • ULDBs (Coming: incomplete relations; continuous uncertainty; correlation uncertainty) • Query Language • Simple extension to SQL • Query uncertainty, confidences, and lineage • System • Did you see our demo?  • Version 1: Entirely on top of conventional DBMS • Surprisingly easy and complete, reasonably efficient TriQL

  36. Brief Related Work • Uncertainty • Modeling • C-tables [IL84], Probabilistic Databases [CP87], using Nested Relations [F90] • Systems • ProbView [LLRS97], MYSTIQ [BDM+05], ORION [CSP05], Trio [BDHW05] • Lineage • DBNotes [CTV05], Data Warehouses [CW03]

  37. UNCERTAINTY LINEAGE DATA but don’t forget the lineage… Thank You Search “stanford trio” (or, http://i.stanford.edu/trio)