A new class of lineage expressions over probabilistic databases computable in PTIME

A new class of lineage expressions over probabilistic databases computable in PTIME SUM 2013 BatyaKenig Avigdor Gal OferStrichman

Probabilistic Databases for managing uncertain data • A variety of data sources generate incomplete, noisy and uncertain data (sensor networks, information extraction, data integration…). • Probabilistic databases enable storing and querying such data • A lot of research in recent years MayBMS [Cornell], Trio [Stanford], SPROUT [Oxford], PrDB [U.Md]

Tuple Independent Probabilistic Databases Each possible world is a standard database instance with probability

Query Semantics • Let be a query evaluated against probabilistic DB • Let be the possible worlds that return . • Sum the probabilities of instances that return . • Goal: efficiently evaluate • In time polynomial in • Not always possible, in general #P-hard

Probabilistic Inference for queries • DalviSuciu04: Conjunctive queries (without self joins) are either: • Safe queries: Have query plans that run in on all DB instances • Unsafe queries: Data complexity is -hard • However, • Even for unsafe queries there are DB instances which will enable efficient computation

Why lineage? Each tuple is associated with a binary random variable S ) Compute probability of this formula

Why Lineage ? Efficient computation [Roy2011,Sen2010] Safe plans for safe queries produce formulas in read-once form • Expression in Read-Once form • Linear time probability computation [Olteanu&Huang2008]

Unsafe query • Solutions: • Jha&Suciu11: Compile to decision diagram • Jha&Suciu12: Exponential in pathwidth, double exponential in expression pathwidth Not read-once We will show how to compute the probability of disjoint branch lineage expressions in

Lineage as a hypergraph Primal Graph Hyperedges • In general, expanding a formula to its DNF form can lead to an exponential blowup. • For SPJ queries without self joins, the primal graph can be generated directly from the formula [Roy 2011].

Junction trees for lineages • Let be a hypergraph. • Hypergraph is acyclic iff it has a junction tree ,A) [Beeri et al 1981] • The junction tree property: for every the set of nodes in the tree that contain , induce a (connected) tree.

Junction trees for lineages

Background: junction trees for probabilistic inference node separator Each node and separator stores joint pdf • Send messages towards a given root node • Messages are passed by multiplication of factor entries • Once the root node has received messages from all of its neighbors, its factor holds the marginal of the joint probability distribution of the entire variable set.

(Naïve) Junction Tree Algorithmfor lineage computation PROBLEM: The JT Alg runs in time that is exponential in the largest factor.  Restricted to lineage expressions that can be efficiently represented using a junction tree [i.e; low tree width] 0

Take advantage of Junction Tree structure Rooted Directed Path Graphs [Gavril1975]: A graph is a rooted directed path graph (RDPG) iff there exists a rooted directed junction tree such that for every vertex , the set of nodes that contain form a directed path of

Disjoint Branch Junction Trees (DBJT)

Use compact factors We would ultimately like to calculate the entry probabilities. Their sum is exactly

The Algorithm This can be done due to the disjoint branch property. =

Projection/Marginalization • Sending a message involves summing out variables in the factor No longer mutual exclusive! Disables subsequent projections.

Projection/Marginalization • Solution: Perform marginalization by repeatedly projecting out only the last (rightmost) var. • Requires ordering message-vars before those to be summed out. • Due to the junction tree property this is always possible.

Complexity Analysis • Let be the size of the largest factor • Each node can have at most children • Therefore, each entry in the factor is updated at most times. • Overall

Conclusions • Define disjoint branch lineage expressions • Provide an algorithm for computing the probability of disjoint branch lineage expressions in PTIME -

Future Work • Are there other structural properties of junction trees that can facilitate efficient probabilistic inference ? • Real data is correlated • Drop tuple-independence assumption • Characterize queries and DB instances which induce lineage with “efficient” junction trees.

Thank You

A new class of lineage expressions over probabilistic databases computable in PTIME

A new class of lineage expressions over probabilistic databases computable in PTIME

Presentation Transcript

Probabilistic reasoning over time

Probabilistic Databases

A Course on Probabilistic Databases

Lineage Processing over Correlated Probabilistic Databases

Indexing Correlated Probabilistic Databases

Representing and Querying Correlated Tuples in Probabilistic Databases

Managing Probabilistic Duplicates in Databases

Efficient Query Evaluation on Probabilistic Databases

Probabilistic Reasoning over Time

Probabilistic Reasoning over Time

Completeness of Queries over Incomplete Databases

A Course on Probabilistic Databases

Probabilistic Similarity Queries in Uncertain Databases

Probabilistic Reasoning over Time

Databases With Uncertainty And Lineage

Computable Problems

Probabilistic Reasoning over Time

Probabilistic Reasoning over Time

Discovering Frequent Subgraphs over Uncertain Graph Databases under Probabilistic Semantics