1 / 24

Senlin Liang and Michael Kifer Stony Brook University

Deriving Predicate Statistics (SDP) in Datalog Principles and Practice of Declarative Programming 12 th International ACM SIGPLAN Symposium July 26, 2010, Hagenberg, Austria. Senlin Liang and Michael Kifer Stony Brook University. Summary of Our Approach. Motivation

aure
Download Presentation

Senlin Liang and Michael Kifer Stony Brook University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deriving Predicate Statistics (SDP) in DatalogPrinciples and Practice of Declarative Programming12th International ACM SIGPLAN SymposiumJuly 26, 2010, Hagenberg, Austria Senlin Liang and Michael Kifer Stony Brook University

  2. Summary of Our Approach • Motivation • Take advantage of cost-based optimizations in deductive database systems • Compute cost information (predicate statistics) • Store and retrieve cost information efficiently • Apply optimization techniques • Advantages of our approach • Keeps argument dependencies • Handles recursion • Handles negation “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  3. Outline • Introduction • Traditional approach: histograms + argument independence assumption • Error grows exponentially • SDP • Dependency matrix stores predicate statistics • Abstract interpretation of Datalog rules, which are evaluated over dependency matrices • Experimental studies • Future work “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  4. Histograms • Data distribution: T=((v1, f1), ……, (vn,fn)). • E.g. ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) • Histograms • Partition data distribution into groups • Summarize each group as a bucket: (floor, ceiling, size, count) • Compute the values and frequencies in each bucket efficiently • MaxDiff histograms with β buckets • Partition T using β-1 largest frequency differences “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  5. Example: MaxDiff Histograms (3 buckets) 1 2 1. Partition T using 2 largest frequency differences 2. Summarize as (floor, ceiling, size, count) 3. Value-frequency approximation vals(bucket) = [floor, ceiling]; f(val) = count/size, e.g. f(7)=5/3 T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) o o o 1 1 2 2 1 0 T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) (2,4,3,4) (5,5,1,3) (6,8,3,5) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  6. Argument Independence Assumption • Common in database size estimates • Data distributions of different arguments are independent of each other • For example, in predicate p(X,Y), the data distributions of X and Y are independent • Joint data distribution can be easily computed from individual distributions E.g., p(X=a, Y=b) = p(X=a) × p(Y=b) • Unfortunately, the independence assumption is almost always wrong in real datasets “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  7. Example: Histogram+Independence = Poor Estimate • answer(X,Y) :- e(X,Y), 5 ≤X≤7. • Facts: e(2,2), … as in Example 1 of the paper. • Histogram buckets of e • X: (2,4,3,4) (5,5,1,3) (6,8,3,5) • Y: (1,1,1,1) (2,4,3,3) (5,8,4,8) • Size estimate • Answer size estimate for each bucket size(answer) = |[floor, ceiling] ∩ [5,7]|/|[floor, ceiling]| × count • size(answer) = 6.33 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  8. Example: Histogram+Independence = Poor Estimate • Histogram buckets of e • X: (2,4,3,4) (5,5,1,3) (6,8,3,5) • Y: (1,1,1,1) (2,4,3,3) (5,8,4,8) • Histogram buckets of answer • X: (5,5,1,3) (6,7,2,3.33) • Y: (1,1,1,0.53) (2,4,3,1.58) (5,8,4,4.22) • answer.count = e.count ×size(answer)/size(e) • Real results for answer.Y • (1,1,1,0) (2,4,3,0) (5,8,4,6) • Independence causes information loss “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  9. Our Approach: Dependency Matrices Only considers dependency matrices (DM) for binary predicates Partitions facts into localgroups Sum up the groups into DM values Sum up each row/column into (floor, ceiling, size) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  10. Example: DM • Fact Matrix F(i,j) = 1 iff p(i,j) is a fact • Partition fact matrix using MaxDiff • Sum up partitions into matrix values “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  11. Example: DM • Fact Matrix F(i,j) = 1 iff p(i,j) is a fact • Partition fact matrix using MaxDiff • Sum up partitions into matrix values • Sum up each row/column, into (floor,ceiling,size) (2,4,3) (5,8,4) (1,1,1) (2,4,3) (5,5,1) (6,8,3) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  12. SDP for Selection by Example • From fact matrix, we know that • size(answer) • = ΣF(i,j) for 5 ≤ i≤ 7 • = 6 answer(X,Y) :- e(X,Y), 5 ≤X≤7. “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  13. (5,8,4) (2,4,3) (1,1,1) SDP for Selection by Example (2,4,3) (5,5,1) (6,8,3) (5,8,4) (2,4,3) (1,1,1) (5,5,1) (6,7,2) • answer(X,Y) :- e(X,Y), 5 ≤X≤7. • Extract the portions covered by the selection • Recompute matrix values • Sum them up as size(answer)=3+.67+.67+2 =6.34 • For each row, recompute (floor, ceiling, size) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  14. Example: Sort-Merge-Join • answer(X,Z) :- a(X,Y), b(Y,Z) • middle(X,Y,Z) is for the ease of explanation ...... …… a(4,3) b(3,1) a(4,4) b(3,5) b(4,5) …… …… answer (4,1) (4,5) (4,5) …… middle (4,3,1) (4,3,5) (4,4,5) …… Duplicates! “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  15. SDP for Join by Example • answer(X,Z) :- a(X,Y), b(Y,Z). • Simulate Sort-Merge-Join (1,1,1) (6,8,2) (2,4,2) (9,9,1) (2,4,3) (1,1,1) (5,5,1) (2,4,2) align A.X, A.Y, A.Val (2,4,3), (1,1,1), 2 (2,4,3), (2,4,2), 4 B.Y, B.Z, B.Val (1,1,1), (6,8,2), 1 (2,4,2), (6,8,2), 3 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  16. SDP for Join by Example • answer(X,Z) :- a(X,Y), b(Y,Z). A.X, A.Y, A.Val (2,4,3), (1,1,1), 2 (2,4,3), (2,4,2), 4 B.Y, B.Z, B.Val (1,1,1), (6,8,2), 1 (2,4,2), (6,8,2), 3 • Result size of middle(X,Y,Z) can be estimated as • min(A.Y.size,B.Y.size) × (A.Val/A.Y.size) × (B.Val/B.Y.size) • Examples: • size(middle((2,4,3),(1,1,1),(6,8,2))) ~ min(1,1) × (2/1) × (1/1) • size(middle((2,4,3),(2,4,2),(6,8,2))) ~ min(2,2) × (4/2) × (3/2) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  17. SDP for Join by Example • answer(X,Z) :- a(X,Y), b(Y,Z). • Examples: • middle((2,4,3),(1,1,1),(6,8,2))  answer((2,4,3),(6,8,2)) • middle((2,4,3),(2,4,2),(6,8,2))  answer((2,4,3),(6,8,2)) • Three duplicate handling approaches • Sum: no duplicate removal • Max: most aggressive removal • Expected sum: remove “expected” number of duplicates Duplicates! “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  18. SDP for Recursive Predicates • Recursive predicates are computed incrementally until they reach approximate fixed points • Size reaches α-approximate fixed point if Δ(size)/size ≤ α where • Δ(…) is the difference between two consecutive iterations in fixed point computation • 0 ≤ α ≤ 1 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  19. Example: Recursive Predicates • Transitive closure path(X,Y) :- edge(X,Y). (base) path(X,Y) :- edge(X,Z), path(Z,Y). (rec) • Computation of the estimate: • Compute size(path) and DM(path) using rule base • Compute size(path) and DM(path) using rule rec as in the case of a join • If size(path) reaches approximate fixed points, stop; Otherwise, go to step 2 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  20. Experimental Studies • Test programs: • Transitive closure • General same generation • Datasets: generated with Thomas Process and Matern Cluster Process • Results • SDP estimates converge to real sizes for recursive predicates • Expected sum is good for duplicate removal • Details in the paper “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  21. Experimental Studies • SDP estimates converge to real sizes for recursive predicates Transitive Closure “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  22. Experimental Studies • Expected sum is good for duplicate removal Transitive Closure “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  23. Conclusion Dependency matrix for binary predicates Overcomes problems with argument independence assumption SDP for selection, join, and recursion Experimental validations “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

  24. Future works More complex recursions Negation Extending SDP to n-ary predicates Apply cost-based optimization in deductive systems, such as XSB “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer

More Related