133 Views

Download Presentation
## Uncertainty Lineage Data Bases

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Uncertainty**Lineage Data Bases Very Large Data Bases 1975 2006**UNCERTAINTY**LINEAGE DATA ULDBs: Databases with Uncertainty and Lineage Omar Benjelloun, Anish Das Sarma, Alon Halevy, Jennifer Widom Stanford InfoLab**Motivation**• Many applications involve data that is uncertain (approximate, probabilistic, inexact, incomplete, • imprecise, fuzzy, inaccurate, ...) • Many of the same applications need to track the lineageof their data • Neither uncertainty nor lineage are supported by conventional DBMSs Coincidence or Fate?**Sample Applications Needing Uncertainty and Lineage**• Scientific databases • Sensor databases • Data cleaning • Data integration • Information extraction**Trio Project**Building a new kind of DBMS in which: • Data • Uncertainty • Lineage are all first-class interrelated concepts**Coincidence or Fate?**Lineage and Uncertainty • Lots of independent work in lineage and uncertainty (related work at end of talk) • Turns out: The connection between uncertainty and lineage goes deeper than just a shared need by several applications**Lineage and Uncertainty**• Lineage... • Enables simple and consistent representation of uncertain data • Correlates uncertainty in query results with uncertainty in the input data • Can make computation over uncertain data more efficient**Outline of the Talk**• The ULDB data model • Querying ULDBs • ULDB properties • Membership and extraction operations • Confidences • Current, related, and future work**Running Example: Crime Solver**Saw(witness,car) Drives(person,car) Suspects(person) = πperson(Saw ⋈ Drives)**Uncertainty**• Anuncertain database represents a set of possible instances. Examples: • Amy saw either a Honda or a Toyota • Jimmy drives a Toyota, a Mazda, or both • Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3 • Hank is a suspect with confidence 0.7**Uncertainty in a ULDB**1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences**Three possible**instances Uncertainty in a ULDB 1.Alternatives:uncertainty about value 2. ‘?’ (Maybe) Annotations 3. Confidences =**Uncertainty in a ULDB**1. Alternatives 2. ‘?’ (Maybe):uncertainty about presence 3. Confidences ? Six possible instances**Uncertainty in a ULDB**1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences:weighted uncertainty ? Six possible instances, each with a probability**Data Models for Uncertainty**• Our model (so far) is not especially new • We spent some time exploring the space of models for uncertainty [ICDE 2006] • Tension between understandability and expressiveness • Our model is understandable • But it is not complete, or even closed under common operations**Closure and Completeness**• Completeness Can represent all sets of possible instances • Closure Can represent results of operations • Note: Completeness Closure**Model (so far) Not Closed**Suspects= πperson(Saw ⋈ Drives) CANNOT Does not correctly capture possible instances in the result ? ? ?**Lineage to the Rescue**• Lineage: “where data came from” • Internal lineage • External lineage (not covered in this talk) • In ULDBs: A functionλ from alternatives to sets of alternatives (or external sources)**Correctly captures possible instances in**the result Example with Lineage Suspects= πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2) ? λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2) ? λ(33) = (11,1), 23 ?**ULDBs**• Alternatives • ‘?’ (Maybe) Annotations • Confidences • Lineage ULDBs are Closed and Complete**Outline of the Talk**• The ULDB data model • Querying ULDBs • ULDB properties • Membership and extraction operations • Confidences • Current, related, and future work**Querying ULDBs**• Query Qon ULDB D implementation of Q D D’ D + Result possible instances representation of instances Qon each instance D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn)**Well-Behaved ULDBs**• If we start with a well-behaved ULDB and perform standard queries, it remains well-behaved • Intuitively (details in paper): • Acyclic:No cycles in the lineage • Deterministic:Non-empty lineages of distinct alternatives are distinct • Uniform: Alternatives of same tuple are derived from the same set of tuples**ULDB Minimality**• Data-minimality • Does every alternative appear in some possible instance? (no extraneous alternatives) • Does every maybe-tuple in Rnot appear in some possible instance? (no extraneous ‘?’s) • Lineage-minimality**Data-Minimality Examples**Extraneous ‘?’ λ(20,1)=(10,1); λ(20,2)=(10,2) ? extraneous**Data-Minimality Examples**Extraneous alternative ? extraneous ? ?**Data-Minimization**• Extraneous alternative theorem: • An alternative is extraneous iff it is (possibly transitively) derived from multiple alternatives of the same tuple. • Extraneous “?” theorem • A “?” on tuple t is extraneous iff • it is derived from base tuples without “?” • t has as many alternatives as the product of the number in its base tuples • Minimization algorithm based on the theorems (see paper)**ULDB Properties and Operations**Data-minimize Lineage-minimal Queries Data-minimal Lineage-minimal Data-minimal Extraction Membership Lineage-minimize**R**possible instances I1, I2, …, In Membership Questions • Does a given tuple t appear in some (all) possible instance(s) of R? • Polynomial algorithms based on Data-minimization • Is a given table T one of (all of) the possible instances of R? • NP-Hard t? , T?**Extraction**Drives Saw • Extraction algorithm in paper Eats Suspects**Outline of the Talk**• The ULDB data model • Querying ULDBs • ULDB properties • Membership and extraction operations • Confidences • Current, related, and future work**Confidences**• Confidences supplied with base data • Trio computes confidences on query results • Default probabilistic interpretation • Can choose to plug in different arithmetic ? ? Probabilistic Min 0.3 0.4 ? 0.6**Query Processing with Confidences**• Previous approach (probabilistic databases) • Each operator computes confidences during query execution • Only certain query plans allowed • In ULDBs • Confidence of alternative A is function of confidences in its transitive lineage • Our approach: Decouple data and confidence computation • Use any query plan for data computation • Compute confidences on-demand using lineage • Can give arbitrarily large improvements**Current Work: Algorithms**• Algorithms: confidence computation, extraneous data, membership questions • Minimize lineage traversal • Memoization • Batch computations**The Trio Trio**• Data Model • ULDBs (Coming: incomplete relations; continuous uncertainty; correlation uncertainty) • Query Language • Simple extension to SQL • Query uncertainty, confidences, and lineage • System • Did you see our demo? • Version 1: Entirely on top of conventional DBMS • Surprisingly easy and complete, reasonably efficient TriQL**Brief Related Work**• Uncertainty • Modeling • C-tables [IL84], Probabilistic Databases [CP87], using Nested Relations [F90] • Systems • ProbView [LLRS97], MYSTIQ [BDM+05], ORION [CSP05], Trio [BDHW05] • Lineage • DBNotes [CTV05], Data Warehouses [CW03]**UNCERTAINTY**LINEAGE DATA but don’t forget the lineage… Thank You Search “stanford trio” (or, http://i.stanford.edu/trio)