1 / 83

Machine and Statistical Learning for Database Querying

Machine and Statistical Learning for Database Querying. Chao Wang Data Mining Research Lab Dept. of Computer Science & Engineering The Ohio State University Advisor: Prof. Srinivasan Parthasarathy Supported by: NSF Career Award IIS-0347662. Outline. Introduction Selectivity estimation

Download Presentation

Machine and Statistical Learning for Database Querying

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science & Engineering The Ohio State University Advisor: Prof. Srinivasan Parthasarathy Supported by: NSF Career Award IIS-0347662

  2. Outline • Introduction • Selectivity estimation • Probabilistic graphical model • Querying transaction database • Probabilistic model-based itemset summarization • Querying XML database • Conclusion

  3. Introduction

  4. Introduction • Database querying • Selectivity estimation • Estimation of a query result size in database systems • Usage: for query optimizer to choose an efficient execution plan • Rely on probabilistic graphical models

  5. Probabilistic Graphical Models • Marriage of graph theory and probability theory • Special cases of the basic algorithms discovered in many (dis)guises: • Statistical physics • Hidden Markov models • Genetics • Statistics • … • Numerous applications • Bioinformatics • Speech • Vision, • Robotics, • Optimization • …

  6. Directed Graphical Models (Bayesian Network) x4 x2 x1 x6 x3 x5 p(x1,x2,x3,x4,x5,x6) = p(x1)p(x2|x1) p(x3|x1)p(x4|x2)p(x5|x3)p(x6|x2,x5)

  7. Undirected Graphical Models (Markov Random Field (MRF)) x4 x2 x1 x6 x5 x3 p(x1,x2,x3,x4,x5,x6) = (1/Z)Φ(x1,x2) Φ(x1,x3)Φ(x2,x4)Φ(x3,x5)Φ(x2,x5,x6)

  8. Inference – Computing Conditional Probabilities • Conditioning x4 x2 x1 x6 x3 x5 • Marginalization: • Conditional probabilities

  9. Querying Transaction Database

  10. Transaction Database • Consist of records of interactions among entities • Two examples: • Market-basket data Each basket is a transaction consisting of items • Co-authorship data Each paper is a transaction consisting of “author” items

  11. Querying Transaction Database • Rely on frequent itemsets to learn graphical models • Rely on the model to solve the selectivity estimation problem • Given a conjunctive query Q, estimate the size of the answer set, i.e., how many transactions satisfy Q

  12. Frequent Itemset Mining • Market-Basket Analysis B D A C 1 0 1 1 0

  13. Frequent Itemset Mining • Support(I): number of transactions “containing I” 1 1 1 1 1 1 1 1 1 1 1

  14. Frequent Itemset Mining Problem • Given D, minsup Find all itemsets I with support(I) ≥ minsup

  15. Using Frequent Itemsets to Learn an MRF • A k-itemset can be viewed as a constraint on the underlying distribution generating the data • Given a set of itemsets, we compute a distribution satisfying them and having a Maximum Entropy (ME) • This maximum entropy distribution is equivalent to an MRF

  16. An ME Distribution Example • The maximum entropy distribution has the following product form: Where I(.) is an indication function for the corresponding itemset constraint and the constants u0, u1, …, u11 are estimated from the data.

  17. An MRF Example C1 X1 C2 X2 X3 C3 X4 X5

  18. Iterative Scaling Algorithm • Time complexity Runs for k iterations, m itemset constraints and t is the average inference time  O(k * M * t) Efficient inference is crucial !

  19. Junction Tree Algorithm • Exact inference algorithm • Time complexity is exponential in the treewidth (tw) of the model • Treewidth = (maximum clique size in the graph formed by triangulating the model – 1) • Real world models, tw is often well above 20, thus intractable

  20. Approximate Inference Algorithm • Gibbs sampling • Simulating samples from posterior distributions • Sum over samples to evaluate marginal probabilities • Mean field algorithm • Convert the inference problem to an optimization problem, and solve the relaxed optimization problem • Loopy belief propagation • Apply Pearl’s belief propagation directly to loopy graphs • Works quite well in practice Will the iterative scaling algorithm still converge (when subjected to approximate inference algorithms) ?

  21. Graph Partitioning-Based Approximate MRF Learning Lemma: For all disjoint vertex subsets a, b and c in an MRF, whenever b and c are separated by a in the graph, then the variables associated with b, c are independent given the variables associated with a alone.

  22. Graph Partitioning-Based Approximate MRF Learning • Cluster variables based on graph partitioning • Interaction importance and treewidth based variable-cluster augmentation • Learn an exact local MRF on a variable-cluster and combine all local models to derive an approximate global MRF

  23. Clustering Variables • k-MinCut • Partition the graph into k equal parts • Minimize the number of edges of E whose incident vertices belong to different partitions • Weighted graphs: Minimize the sum of weights of all edges across different partitions

  24. Accumulative Edge Weighting Scheme • Edge weight should reflect the correlation strength 3+2=

  25. Clustering Variables • The k-MinCut partitioning scheme yields disjoint partitions. However, there exist edges across different partitions. In other words, different partitions are correlated to each other. So how do we account for the correlations across different partitions?

  26. Interaction Importance and Treewidth Based Variable-Cluster Augmentation • Augmenting variable-cluster • Add back most significant incident edges to a variable-cluster • Optimization • Take into consideration model complexity • Keep track of treewidth of the augmented variable-clusters • 1-hop neighboring nodes first, then 2-hop nodes, …, and so on

  27. Treewidth Based Augmentation Variable-cluster 1-hop neighboring nodes 2-hop neighboring nodes … …

  28. Interaction Importance and Treewidth Based Variable-Cluster Augmentation

  29. Approximate Global MRFs • For each augmented variable-cluster, collect related itemsets and learn an exact local MRF • All local MRFs together offer an approximate global MRF

  30. Learning Algorithm

  31. A Greedy Inference Algorithm • Given the global model consisting of a set of local MRFs, how do we make inference? • Case 1: all query variables are covered by a single MRF, evaluate the marginal probability directly • Case 2: use a greedy decomposition scheme to compute • First, pick a local model that has the largest intersection with the current query (i.e., cover most variables) • Then pick the next local model covering most uncovered query variables, and so on • Overlapped decomposition

  32. A Greedy Inference Algorithm X1X2X3X6X7 X3X4X6X8 X5X9X10 M2 M3 M1 Qx = X1 X2 X3 X4 X5

  33. Discussions • The greedy inference scheme is a heuristic • Global model is not globally consistent; However, we expect that the global model is nearly consistent ( Heckerman et al. 2000) • A generalized belief propagation style approach is currently under investigation to force the local consistency across the local models, thereby offering a globally consistent model

  34. Experimental Results • C++ implementation. The Junction tree algorithm is implemented based on Intel’s Open-Source Probabilistic Networks library (C++) • Use Apriori algorithm to collect frequent itemsets • Use Metis for graph partitioning

  35. Experimental Setup • Datasets • Microsoft Anonymous Web, |D|=32711, |I|=294 • BMS-Webview1, |D|=59602, |I|=497 • Query workloads • Conjunctive queries, e.g., X1 & ¬X2 & X4 • Performance metrics • Time: online estimating time and offline learning time • Error: average absolute relative error • Varying • k, the no. of clusters • g, the no. of vertices used during the augmentation • tw, the treewidth threshold when using treewidth based augmentation optimization

  36. Results on the Web Data • Support threshold = 20, results in 9901 frequent itemsets • Treewidth = 28 according to Maximum Cardinality Search (MCS)-ordering heuristic

  37. Varying k (g = 5): Online Time Estimation accuracy Online time Offline Time

  38. Varying g (k = 20): Online Time Estimation Accuracy Online time Offline Time

  39. Varying tw (k = 25): Online Time Estimation Accuracy Offline Time

  40. Using Non-Redundant Itemsets • There exist redundancies in a collection of frequent itemsets • Select non-redundant patterns to learn probabilistic models • Closely related to pattern summarization

  41. Probabilistic Model-Based Itemset Summarization

  42. Non-Derivable Itemsets • Based on redundancies • How do supports relate? • What information about unknown supports can we derive from known supports? • Concise representation: only store non-redundant information

  43. The Inclusion-Exclusion Principle

  44. Deduction Rules via Inclusion-Exclusion • Let A, B, C, … be items • Let A’ correspond to the set { transactions t | t contains A } • (AB)’ = (A)’ ∩ (B)’ • Then supp(AB) = | (AB)’|

  45. Deduction Rules via Inclusion-Exclusion • Inclusion-exclusion principle: |A’ U B’ U C’| = |A’| + |B’| + |C’| - |(AB)’| - |(AC)’| - |(BC)’| + |(ABC)’| Thus, since |A’ U B’ U C’| ≤ n, Supp(ABC) ≤ s(AB) + s(AC) + s(BC) - s(A) - s(B) - s(C) + n

  46. Complete Set for Supp(ABC)

  47. Derivable Itemsets Given: Supp(I) for all I  J  Lower bound on Supp(J) = L Upper bound on Supp(J) = U • Without counting: Supp(J)  [L, U] • J is a derivable itemset (DI) iff L = U We know Supp(J) exactly without counting!

  48. Derivable Itemsets • J is a derivable itemset: • No need to count Supp(J) • No need to store Supp(J) • We can use the deduction rules • Concise representation: C = { (J, Supp(J) ) | J not derivable from Supp(I), I  J }

  49. Probabilistic Model Based Itemset Summarization • We can learn the MRF from non-derivable itemsets alone Lemma: Given a transaction dataset D, the MRF M constructed from all of its σ-frequent itemsets is equivalent to M’, the MRF constructed from only its σ-frequent non-derivable itemsets • Can we do better? • Further compress the patterns

  50. Probabilistic Model Based Itemset Summarization • Use smaller itemsets to learn an MRF • Use this model to infer the supports of larger itemsets • Use those itemsets whose occurrence can not be explained (by some error threshold) by the model to augment the model

More Related