Summarizing Relational Databases
Download
1 / 27

Summarizing Relational Databases - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Summarizing Relational Databases. Xiaoyan Yang National University of Singapore Cecilia Magda Procopiuc Divesh Srivastava AT&T Labs-Research. Labs-Research. Motivation. Complex databases are challenging to explore and query Consisting of hundreds of inter-linked tables

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Summarizing Relational Databases' - sybil-mcintyre


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Summarizing Relational Databases

Xiaoyan Yang

National University of Singapore

Cecilia Magda Procopiuc

Divesh Srivastava

AT&T Labs-Research

Labs-Research


Motivation
Motivation

  • Complex databases are challenging to explore and query

    • Consisting of hundreds of inter-linked tables

    • Users unfamiliar with the schema

    • Insufficient or unavailable schema information

  • Propose a principled approach to summarize the contents of a relational database

    • Cluster similar tables

    • Label each cluster by its most important table

Summarizing Relational Databases


TPCE

Summarizing Relational Databases


Outline
Outline

  • Motivation

  • Our Approach

    • Define table importance

    • Define a metric space over the tables

    • Clustering: Weighted k-Center

  • Experimental Results

  • Conclusions

Summarizing Relational Databases


Table importance
Table Importance

  • Depends on

    • Internal information content

    • External connectivity

      • Join behavior

      • Taxrate: 1 join

      • Customer: 5 joins

3 columns

2 columns

24 columns

Summarizing Relational Databases


Table importance cont d
Table Importance (cont’d)

  • Entropy of Attribute A in table R is defined as

    • R.A = {a1, …, ak}

    • pi is the fraction of tuples in R that have value ai on attribute A

  • The Information Content of a table R is defined as

    • Create a primary key R.Key to table R

    • Add a self-loop R.Key – R.Key

Summarizing Relational Databases


Table importance cont d1
Table Importance (cont’d)

  • Entropy transfer matrix associated with schema graph G is defined as:

    • For a join edge e = R.A – S.B

    • For a pair of tables R and S, define

  • qA': number of join edges involving R.A' (including self-join)

  • VE – Variable entropy transfer Model

,

Summarizing Relational Databases


Table importance cont d2
Table Importance (cont’d)

  • [S,T] = Pr(S.S_SymbT.T_S_Symb)

  • = α/ (IC(S)+2α)

  • The importance of table R is defined as the stable-state value of a random walk on G, using probability matrix

    • Vector , s.t.

    • Importance , R G

  • Example

Security (S)

Trade (T)

  • H(S.S_Symb)=α; qS_Symb=3

G(TPCE)

Trade_Request (TR)

=

  • Pr(S.S_Symb T.T_S_Symb)

  • H(S.S_Symb) α

  • IC(S)+(qS_Symb-1) ·H(S.S_Symb) IC(S)+2α

=

Summarizing Relational Databases


Outline1
Outline

  • Motivation

  • Our Approach

    • Define table importance

    • Define a metric space over the tables

    • Clustering: Weighted k-Center

  • Experimental Results

  • Conclusions

Summarizing Relational Databases


Table similarity
Table Similarity

  • Distance = 1 - similarity

  • Goal: define metric distance

    • Enables meaningful clustering over relational databases

  • Table similarity depends on how join edgesand join paths are instantiated

A

A

B

B

1

1

1

1

2

2

2

2

3

3

4

3

R

R

S

S

A

B

1

1

2

1

3

2

R

2

2

3

3

  • R.A = S.B

S

Summarizing Relational Databases


Table similarity cont d
Table Similarity (cont’d)

  • Consider a join edge e = R.A – S.B

    • Tuplest1, t2 instantiate e

    • fanoute(ti) is the fanout of ti along e

      • fanoute(t1) = 3

    • Let q be the number of tuples in Rs.t. fanoute(ti) > 0, define the matching fraction of Rw.r.t. e as fe(R) = q/n, |R|= n

      • fe(R) = 2/3 ≤ 1; fe(S) = 5/5 = 1 ≤ 1

    • Define the matched average fanoutof Rw.r.t. e as

      • mafe(R) = (3+2)/2 = 2.5 ≥ 1; mafe(S) = 5/5 = 1 ≥ 1

A

B

t1

t2

1

1

2

1

3

1

R

3

3

  • R.A = S.B

S

Summarizing Relational Databases


Table similarity cont d1
Table Similarity (cont’d)

  • The similarity of tables R and S (w.r.t. e(R,S)) must satisfy:

    • Property 1: Proportional to the matching fractions fe(R) and fe(S)

    • Property 2: Inverse proportional to the matched average fanoutsmafe(R) and mafe(S)

  • Define the strength of tables R and S (w.r.t. e(R,S)) as

  • e: R.A = S.B

A

A

B

B

1

1

1

1

2

2

2

2

3

3

4

3

R

R

S

S

  • fe(R) = fe(S) =1

  • fe(R) = fe(S) =2/3

  • e: R.A = S.B

A

A

B

B

1

1

1

1

2

2

2

1

3

3

3

2

R

R

S

2

  • Property 2: Inverse proportional to the matched average

  • fanoutsmafe(R) and mafe(S)

2

  • mafe(R) = mafe(S) =1

3

  • mafe(R) = 7/3

  • mafe(S) =1

3

S

Summarizing Relational Databases


Table similarity cont d2
Table Similarity (cont’d)

  • Let : R = R0 - R1 - … - Rα= S be a path in G, define

  • Table similarity (R, S):

  • Distance (R, S)

    • dists(R,S) = 1 – strength(R,S)

    • (R, dists) is a metric space

R

R1

R0

  • ei

Ri-1

S

Ri

  • : R = R0 - R1 - … - Rα= S

Summarizing Relational Databases


Outline2
Outline

  • Motivation

  • Our Approach

    • Define table importance

    • Define a metric space over the tables

    • Clustering: Weighted k-Center

  • Experimental Results

  • Conclusions

Summarizing Relational Databases


Clustering weighted k center
Clustering: Weighted k-Center

  • Clustering Criteria:

    • Minimize the maximum distance between a cluster center and a table in that cluster

    • Take table importance into consideration, avoid grouping top important tables into one cluster

  • Weighted k-Center clustering

    • Weights: table importance

    • Given k clusters C = {C1, C2, …, Ck}, minimize

    • NP-Hard

Summarizing Relational Databases


Weighted k center greedy approximation algorithm
Weighted k-Center: Greedy Approximation Algorithm

  • Start with one cluster, whose center is the top-1 important table.

  • Iteratively chooses the table Ri whose weighted distance

  • from its cluster center is largest, and creates a new cluster with Ri as its center.

  • All tables that are closer to Ri than to their current

  • cluster center are reassigned to cluster Ci.

Summarizing Relational Databases


Outline3
Outline

  • Motivation

  • Our Approach

    • Define table importance

    • Define a metric space over the tables

    • Clustering: Weighted k-Center

  • Experimental Results

  • Conclusions

Summarizing Relational Databases


Experimental results
Experimental Results

  • Validate the proposed three components in our approach

    • Model for table importance IE

    • Distance function dists

    • Clustering: Weighted k-Center

  • Other methods

Entropy-based

Strength-based

Ic: Cardinality-initialized

distc = 1 – coverage

[1]

[1]

[1]

distp = 1 – proximity

[2]

[1] C.Yu and H.V.Jagadish. Schema summarization. VLDB 2006.

[2] H.Tong, C.Faloutsos and Y.Koren. Fast direction-aware proximity for graph mining. KDD 2007.

Summarizing Relational Databases


Experimental results cont d
Experimental Results (cont’d)

  • Data Sets: TPCE schema

    • Benchmark database simulating OLTP workload

    • 33 tables pre-classified into 4 categories

    • Two database instances: TPCE-1 / TPCE-2

      • Affect the size of the majority of tables

      • Affect Pr(R.AS.B), strength(R,S) for most pairs and mafe for 1/3 of edges

Summarizing Relational Databases


TPCE

Summarizing Relational Databases


Table importance1
Table Importance

  • Comparison of IE and IC models

  • Top-5 Important Tables in IE and their ranks in IC

IE more accurate than IC

  • Top-5 Important Tables in IC and their ranks in IE

Summarizing Relational Databases


Table importance cont d3
Table Importance (cont’d)

  • Consistency of IE and IC models

  • Top-7 Important Tables in IE and IC for TPCE-1 and TPCE-2

IE more consistent than IC

Summarizing Relational Databases


Distance between tables
Distance Between Tables

  • Accuracy of distance functions

    • Observation: for each table R, its distances to tables within the same category (pre-defined) should be smaller than its distances to tables in different categories

    • n(R): # top-qnbrs (NNR) of R under dist. d (distS, distC, distP)

    • m(R): # tables (NNR) in the same category as R under dist. d

    • Calculate:

q=5

distS most accurate

Summarizing Relational Databases


Clustering algorithms
Clustering Algorithms

Based on IC and distC(coverage)

  • Accuracy of a summary

    • TPCE is pre-classified into 4 categories: Broker, Customer,Market and Dimension

    • m(Ci): # tables in Ci with the

      same category as center(Ci)

    • Given a summary C = {C1, C2, …, Ck}, calculate

    • Balanced-Summary (BS) [1]

    • Weighted k-Center (WKC)

WKC is more accurate

Summarizing Relational Databases


Summarization algorithms
Summarization Algorithms

  • Weighted k-Center over three distance functions

distS: most balanced and accurate

  • Summary Accuracy

Summarizing Relational Databases


Related work
Related Work

  • C. Yu and H. V. Jagadish. Schema summarization. VLDB’06

  • H.Tong, C.Faloutsos and Y.Koren. Fast direction-aware proximity for graph mining. KDD’07

  • W. Wu, B. Reinwald, Y. Sismanis and R. Manjrekar. Discovering topical structures of databases. SIGMOD’08

Summarizing Relational Databases


Conclusions
Conclusions

  • We proposed a novel approach for summarizing relational schemas

    • A new model for table importance

    • A metric distance over schema tables

    • A summarization algorithm

Q&A

Summarizing Relational Databases


ad