1 / 25

A new class of lineage expressions over probabilistic databases computable in PTIME

A new class of lineage expressions over probabilistic databases computable in PTIME. SUM 2013 Batya Kenig Avigdor Gal Ofer Strichman. Probabilistic Databases for managing uncertain data.

mahala
Download Presentation

A new class of lineage expressions over probabilistic databases computable in PTIME

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A new class of lineage expressions over probabilistic databases computable in PTIME SUM 2013 BatyaKenig Avigdor Gal OferStrichman

  2. Probabilistic Databases for managing uncertain data • A variety of data sources generate incomplete, noisy and uncertain data (sensor networks, information extraction, data integration…). • Probabilistic databases enable storing and querying such data • A lot of research in recent years MayBMS [Cornell], Trio [Stanford], SPROUT [Oxford], PrDB [U.Md]

  3. Tuple Independent Probabilistic Databases Each possible world is a standard database instance with probability

  4. Query Semantics • Let be a query evaluated against probabilistic DB • Let be the possible worlds that return . • Sum the probabilities of instances that return . • Goal: efficiently evaluate • In time polynomial in • Not always possible, in general #P-hard

  5. Probabilistic Inference for queries • DalviSuciu04: Conjunctive queries (without self joins) are either: • Safe queries: Have query plans that run in on all DB instances • Unsafe queries: Data complexity is -hard • However, • Even for unsafe queries there are DB instances which will enable efficient computation

  6. Why lineage? Each tuple is associated with a binary random variable S ) Compute probability of this formula

  7. Why Lineage ? Efficient computation [Roy2011,Sen2010] Safe plans for safe queries produce formulas in read-once form • Expression in Read-Once form • Linear time probability computation [Olteanu&Huang2008]

  8. Unsafe query • Solutions: • Jha&Suciu11: Compile to decision diagram • Jha&Suciu12: Exponential in pathwidth, double exponential in expression pathwidth Not read-once We will show how to compute the probability of disjoint branch lineage expressions in

  9. Lineage as a hypergraph Primal Graph Hyperedges • In general, expanding a formula to its DNF form can lead to an exponential blowup. • For SPJ queries without self joins, the primal graph can be generated directly from the formula [Roy 2011].

  10. Junction trees for lineages • Let be a hypergraph. • Hypergraph is acyclic iff it has a junction tree ,A) [Beeri et al 1981] • The junction tree property: for every the set of nodes in the tree that contain , induce a (connected) tree.

  11. Junction trees for lineages

  12. Background: junction trees for probabilistic inference node separator Each node and separator stores joint pdf • Send messages towards a given root node • Messages are passed by multiplication of factor entries • Once the root node has received messages from all of its neighbors, its factor holds the marginal of the joint probability distribution of the entire variable set.

  13. (Naïve) Junction Tree Algorithmfor lineage computation PROBLEM: The JT Alg runs in time that is exponential in the largest factor.  Restricted to lineage expressions that can be efficiently represented using a junction tree [i.e; low tree width] 0

  14. Take advantage of Junction Tree structure Rooted Directed Path Graphs [Gavril1975]: A graph is a rooted directed path graph (RDPG) iff there exists a rooted directed junction tree such that for every vertex , the set of nodes that contain form a directed path of

  15. Disjoint Branch Junction Trees (DBJT)

  16. Use compact factors We would ultimately like to calculate the entry probabilities. Their sum is exactly

  17. The Algorithm This can be done due to the disjoint branch property. =

  18. Projection/Marginalization • Sending a message involves summing out variables in the factor No longer mutual exclusive! Disables subsequent projections.

  19. Projection/Marginalization • Solution: Perform marginalization by repeatedly projecting out only the last (rightmost) var. • Requires ordering message-vars before those to be summed out. • Due to the junction tree property this is always possible.

  20. Complexity Analysis • Let be the size of the largest factor • Each node can have at most children • Therefore, each entry in the factor is updated at most times. • Overall

  21. Conclusions • Define disjoint branch lineage expressions • Provide an algorithm for computing the probability of disjoint branch lineage expressions in PTIME -

  22. Future Work • Are there other structural properties of junction trees that can facilitate efficient probabilistic inference ? • Real data is correlated • Drop tuple-independence assumption • Characterize queries and DB instances which induce lineage with “efficient” junction trees.

  23. Thank You

More Related