250 likes | 431 Views
In large enterprise systems, complex database schemas with thousands of tables and columns challenge user comprehension. This paper addresses the need for tools that simplify these schemas while preserving essential relationships. We propose a schema graph summary that balances informativeness and succinctness, utilizing defined edge weights based on mutual information. Our innovative approach enables users to visualize key tables and understand schema connections through optimized summary graphs, thereby enhancing database usability and query efficiency.
E N D
Summary Graphs for Relational Database Schemas Xiaoyan Yang (NUS) Cecilia M. Procopiuc, DiveshSrivastava (AT&T)
Motivation • Complex database schemas in large enterprise systems • 1000+ tables, 10000+ columns, 100000+ PK/FK edges • Prior work to help users understand complex schemas • Customized views (forms) to hide database schema • Present informative tables to simplify schema understanding • Goal: schema graph summary connecting user’s query tables • Needs to be succinct • Needs to preserve informative join paths
Complex Schema Graph Example • Complex database schema in a large real enterprise system • Too complex for illustrative purposes
Useless TPC-E Schema Summary Graph trade status_type customer security Graph weight = 4.5572034455 • Not very informative: all query tables have a status_type field • Succinct graph does not mean informative graph!
Informative TPC-E Schema Summary Graph trade customer_account holding_summary customer security Graph weight = 1.6917276155 • Very informative: securities held by, trades made by customer • Larger graph, smaller graph weight, union of shortest paths
Useless TPC-E Schema Summary Graph • Union of pairwise shortest paths is not the answer • Small graph weight, but verbosity hinders understandability
Succinct TPC-E Schema Summary Graph broker 0.7298749340 customer_account exchange 1.9574738210 0.2947410428 1.0944249463 customer_taxrate commission_rate industry 0.7470561327 1.4236398511 1.2674994678 • Informative & succinct: customer_account, exchange are hubs • Slightly larger graph weight, but informative and succinct Graph weight = 7.5147101957
Outline • Motivation • Problem statement • Our solution • Defining schema edge weights • Computing summary graphs • Experimental results
Desiderata • Schema graph summary must be informative and succinct • Need a formal definition of “informative” • Use Information Theory • Need a formal definition of “succinct” • Use Graph Summarization
Problem Statement 1: Informative Edges • Given schema graph G = (R, E) and database instance D • Problem 1: define schema edge weights, wt: E R+ • More informative join edges have smaller weights (≥ 0) • Extend wt(R1, R2) = weight of shortest path between R1 and R2
Problem Statement 2: Succinct Graph • Given schema G = (R, E), weight wt, user-specified tables Q • Problem 2: compute summary graph (Rs, Es) • Q Rs R, |Rs| ≤ |Q|+B, for a given small budget B • Meta-edges Es {(R1, R2) | exists path between R1 and R2 in G} • (Rs, Es) must preserve shortest paths between Q tables in G • Optimize: (Rs, Es) has the minimum sum of meta-edge weights
Outline • Motivation • Problem statement • Our solution • Defining schema edge weights • Computing summary graphs • Experimental results
Informative Edges: Column Graph D E • Build an edge weighted column graph GC = (NC, EC) where • NC consists of all primary and foreign key columns in all tables • Intra-table edges in EC = {(R.P, R.F) | R.P is a PK column of R} • Inter-table edges in EC = {(R.P, R1.F) | R1.F is a foreign key to R.P} • Edge weights based on mutual information between columns B 0.1 0.05 0.28 S 0.21 R A 0.5 C T 0.6 F
Informative Edges: Table Graph S D E • Induce an edge weighted table graph GT = (NT, ET) where • NT consists of all tables • ET = {(R, R1) | R1.F is a foreign key to R.P} • Edge weight = min sum of weights on path between PK columns B 0.1 0.05 0.38 0.28 0.26 S R 0.21 R A 0.5 1.1 C T 0.6 F T
Edge Weight: Using Mutual Information H(X,Y) • Mutual information I(X;Y) = SxSy p(x,y) log2(p(x,y)/p(x)p(y)) • Mutual information captures strength of linkage between X, Y • D(X,Y) = 1 – H(X,Y)/I(X;Y) is a distance function, H() is entropy • D(X,Y) = 0 iff X, Y are identical; D(X,Y) = 1 iff X, Y are independent H(X|Y) I(X;Y) H(X) H(Y) H(Y|X) I(X;Y) = 1.5 H(X,Y) = 2.0, D(X,Y) = 0.25 i(x;y)
Outline • Motivation • Problem statement • Our solution • Defining schema edge weights • Computing summary graphs • Experimental results
Summary Graph • Given schema graph G = (R, E), edge weight wt: E R+, and user-specified tables Q, compute summary graph (Rs, Es) • Q Rs R, |Rs| ≤ |Q|+B, for a given small budget B • Meta-edges Es {(R1, R2) | exists path between R1 and R2 in G} • (Rs, Es) must preserve shortest paths between Q tables in G • Optimize: (Rs, Es) has the minimum sum of meta-edge weights 0.5 R 0.1 0.2 0.3 R R 0.1 0.3 0.3 0.1 0.5 A B T 0.2 S 0.3 0.1 B T A T S S Total weight = 1.2 Total weight = 1.1 Total weight = 0.7
Properties of Summary Graphs • Theorem: Computing the optimal summary graph is NP-hard Proof uses reduction from Clique in (n – 4)-regular graphs • Proposition (towards an elegant solution formulation): • It is sufficient to compute an optimal summary graph for the smaller graph consisting of shortest paths between Q nodes • Endpoints of meta-edges in optimal summary graph have to appear together on at least one shortest path between Q nodes
Efficient Computation of Summary Graphs • It is sufficient to compute an optimal summary graph for the smaller graph consisting of shortest paths between Q nodes • Elegant solution: formulate an integer program; use CPLEX
Outline • Motivation • Problem statement • Our solution • Defining schema edge weights • Computing summary graphs • Experimental results
Experimental Setup • Data: use 2 instances of TPC-E benchmark database schema • Simulates an OLTP workload of a brokerage firm • Well-specified schema, including PK/FK constraints • Quality: use measures based on the TPC-E transaction logs • Table coverage: relative frequency of summary graph tables in log • Join coverage: relative frequency of summary graph joins in log • Summary graph density: reflects complexity of summary graph
Comparing Weight Functions • Compare MI-based and MAF-based [YPS09] edge weights • Fixed B, varying |Q|; fixed |Q|, varying B • MI-based weight is superior: higher coverage, lower density
Choosing Budget Tables • Effect of our strategy for choosing budget tables • Use coordinated summary graphs for fixed |Q|+B • Budget nodes reduce complexity, improve quality
Summary • Complex database schemas in large enterprise systems • 1000+ tables, 10000+ columns, 100000+ PK/FK edges • Novel schema graph summary is informative and succinct • Define schema graph edge weights using mutual information • Compute succinct summary graph that preserves query table shortest paths and minimizes graph weight, for a given budget • Experimental study validates weight definition, summary model • Future work: approximations for schema graph summaries