1 / 27

Summarizing Relational Databases

Summarizing Relational Databases. Xiaoyan Yang National University of Singapore Cecilia Magda Procopiuc Divesh Srivastava AT&T Labs-Research. Labs-Research. Motivation. Complex databases are challenging to explore and query Consisting of hundreds of inter-linked tables

Download Presentation

Summarizing Relational Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summarizing Relational Databases Xiaoyan Yang National University of Singapore Cecilia Magda Procopiuc Divesh Srivastava AT&T Labs-Research Labs-Research

  2. Motivation • Complex databases are challenging to explore and query • Consisting of hundreds of inter-linked tables • Users unfamiliar with the schema • Insufficient or unavailable schema information • Propose a principled approach to summarize the contents of a relational database • Cluster similar tables • Label each cluster by its most important table Summarizing Relational Databases

  3. TPCE Summarizing Relational Databases

  4. Outline • Motivation • Our Approach • Define table importance • Define a metric space over the tables • Clustering: Weighted k-Center • Experimental Results • Conclusions Summarizing Relational Databases

  5. Table Importance • Depends on • Internal information content • External connectivity • Join behavior • Taxrate: 1 join • Customer: 5 joins 3 columns 2 columns 24 columns Summarizing Relational Databases

  6. Table Importance (cont’d) • Entropy of Attribute A in table R is defined as • R.A = {a1, …, ak} • pi is the fraction of tuples in R that have value ai on attribute A • The Information Content of a table R is defined as • Create a primary key R.Key to table R • Add a self-loop R.Key – R.Key Summarizing Relational Databases

  7. Table Importance (cont’d) • Entropy transfer matrix associated with schema graph G is defined as: • For a join edge e = R.A – S.B • For a pair of tables R and S, define • qA': number of join edges involving R.A' (including self-join) • VE – Variable entropy transfer Model , Summarizing Relational Databases

  8. Table Importance (cont’d) • [S,T] = Pr(S.S_SymbT.T_S_Symb) • = α/ (IC(S)+2α) • The importance of table R is defined as the stable-state value of a random walk on G, using probability matrix • Vector , s.t. • Importance , R G • Example Security (S) Trade (T) • H(S.S_Symb)=α; qS_Symb=3 G(TPCE) Trade_Request (TR) = • Pr(S.S_Symb T.T_S_Symb) • H(S.S_Symb) α • IC(S)+(qS_Symb-1) ·H(S.S_Symb) IC(S)+2α = Summarizing Relational Databases

  9. Outline • Motivation • Our Approach • Define table importance • Define a metric space over the tables • Clustering: Weighted k-Center • Experimental Results • Conclusions Summarizing Relational Databases

  10. Table Similarity • Distance = 1 - similarity • Goal: define metric distance • Enables meaningful clustering over relational databases • Table similarity depends on how join edgesand join paths are instantiated A A B B 1 1 1 1 2 2 2 2 3 3 4 3 R R S S A B 1 1 2 1 3 2 R 2 2 3 3 • R.A = S.B S Summarizing Relational Databases

  11. Table Similarity (cont’d) • Consider a join edge e = R.A – S.B • Tuplest1, t2 instantiate e • fanoute(ti) is the fanout of ti along e • fanoute(t1) = 3 • Let q be the number of tuples in Rs.t. fanoute(ti) > 0, define the matching fraction of Rw.r.t. e as fe(R) = q/n, |R|= n • fe(R) = 2/3 ≤ 1; fe(S) = 5/5 = 1 ≤ 1 • Define the matched average fanoutof Rw.r.t. e as • mafe(R) = (3+2)/2 = 2.5 ≥ 1; mafe(S) = 5/5 = 1 ≥ 1 A B t1 t2 1 1 2 1 3 1 R 3 3 • R.A = S.B S Summarizing Relational Databases

  12. Table Similarity (cont’d) • The similarity of tables R and S (w.r.t. e(R,S)) must satisfy: • Property 1: Proportional to the matching fractions fe(R) and fe(S) • Property 2: Inverse proportional to the matched average fanoutsmafe(R) and mafe(S) • Define the strength of tables R and S (w.r.t. e(R,S)) as • e: R.A = S.B A A B B 1 1 1 1 2 2 2 2 3 3 4 3 R R S S • fe(R) = fe(S) =1 • fe(R) = fe(S) =2/3 • e: R.A = S.B A A B B 1 1 1 1 2 2 2 1 3 3 3 2 R R S 2 • Property 2: Inverse proportional to the matched average • fanoutsmafe(R) and mafe(S) 2 • mafe(R) = mafe(S) =1 3 • mafe(R) = 7/3 • mafe(S) =1 3 S Summarizing Relational Databases

  13. Table Similarity (cont’d) • Let : R = R0 - R1 - … - Rα= S be a path in G, define • Table similarity (R, S): • Distance (R, S) • dists(R,S) = 1 – strength(R,S) • (R, dists) is a metric space R R1 R0 Rα • ei Ri-1 S Ri • : R = R0 - R1 - … - Rα= S Summarizing Relational Databases

  14. Outline • Motivation • Our Approach • Define table importance • Define a metric space over the tables • Clustering: Weighted k-Center • Experimental Results • Conclusions Summarizing Relational Databases

  15. Clustering: Weighted k-Center • Clustering Criteria: • Minimize the maximum distance between a cluster center and a table in that cluster • Take table importance into consideration, avoid grouping top important tables into one cluster • Weighted k-Center clustering • Weights: table importance • Given k clusters C = {C1, C2, …, Ck}, minimize • NP-Hard Summarizing Relational Databases

  16. Weighted k-Center: Greedy Approximation Algorithm • Start with one cluster, whose center is the top-1 important table. • Iteratively chooses the table Ri whose weighted distance • from its cluster center is largest, and creates a new cluster with Ri as its center. • All tables that are closer to Ri than to their current • cluster center are reassigned to cluster Ci. Summarizing Relational Databases

  17. Outline • Motivation • Our Approach • Define table importance • Define a metric space over the tables • Clustering: Weighted k-Center • Experimental Results • Conclusions Summarizing Relational Databases

  18. Experimental Results • Validate the proposed three components in our approach • Model for table importance IE • Distance function dists • Clustering: Weighted k-Center • Other methods Entropy-based Strength-based Ic: Cardinality-initialized distc = 1 – coverage [1] [1] [1] distp = 1 – proximity [2] [1] C.Yu and H.V.Jagadish. Schema summarization. VLDB 2006. [2] H.Tong, C.Faloutsos and Y.Koren. Fast direction-aware proximity for graph mining. KDD 2007. Summarizing Relational Databases

  19. Experimental Results (cont’d) • Data Sets: TPCE schema • Benchmark database simulating OLTP workload • 33 tables pre-classified into 4 categories • Two database instances: TPCE-1 / TPCE-2 • Affect the size of the majority of tables • Affect Pr(R.AS.B), strength(R,S) for most pairs and mafe for 1/3 of edges Summarizing Relational Databases

  20. TPCE Summarizing Relational Databases

  21. Table Importance • Comparison of IE and IC models • Top-5 Important Tables in IE and their ranks in IC IE more accurate than IC • Top-5 Important Tables in IC and their ranks in IE Summarizing Relational Databases

  22. Table Importance (cont’d) • Consistency of IE and IC models • Top-7 Important Tables in IE and IC for TPCE-1 and TPCE-2 IE more consistent than IC Summarizing Relational Databases

  23. Distance Between Tables • Accuracy of distance functions • Observation: for each table R, its distances to tables within the same category (pre-defined) should be smaller than its distances to tables in different categories • n(R): # top-qnbrs (NNR) of R under dist. d (distS, distC, distP) • m(R): # tables (NNR) in the same category as R under dist. d • Calculate: q=5 distS most accurate Summarizing Relational Databases

  24. Clustering Algorithms Based on IC and distC(coverage) • Accuracy of a summary • TPCE is pre-classified into 4 categories: Broker, Customer,Market and Dimension • m(Ci): # tables in Ci with the same category as center(Ci) • Given a summary C = {C1, C2, …, Ck}, calculate • Balanced-Summary (BS) [1] • Weighted k-Center (WKC) WKC is more accurate Summarizing Relational Databases

  25. Summarization Algorithms • Weighted k-Center over three distance functions distS: most balanced and accurate • Summary Accuracy Summarizing Relational Databases

  26. Related Work • C. Yu and H. V. Jagadish. Schema summarization. VLDB’06 • H.Tong, C.Faloutsos and Y.Koren. Fast direction-aware proximity for graph mining. KDD’07 • W. Wu, B. Reinwald, Y. Sismanis and R. Manjrekar. Discovering topical structures of databases. SIGMOD’08 Summarizing Relational Databases

  27. Conclusions • We proposed a novel approach for summarizing relational schemas • A new model for table importance • A metric distance over schema tables • A summarization algorithm Q&A Summarizing Relational Databases

More Related