slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Summarizing Relational Databases PowerPoint Presentation
Download Presentation
Summarizing Relational Databases

Loading in 2 Seconds...

play fullscreen
1 / 27

Summarizing Relational Databases - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

Summarizing Relational Databases. Xiaoyan Yang National University of Singapore Cecilia Magda Procopiuc Divesh Srivastava AT&T Labs-Research. Labs-Research. Motivation. Complex databases are challenging to explore and query Consisting of hundreds of inter-linked tables

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Summarizing Relational Databases' - sybil-mcintyre


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Summarizing Relational Databases

Xiaoyan Yang

National University of Singapore

Cecilia Magda Procopiuc

Divesh Srivastava

AT&T Labs-Research

Labs-Research

motivation
Motivation
  • Complex databases are challenging to explore and query
    • Consisting of hundreds of inter-linked tables
    • Users unfamiliar with the schema
    • Insufficient or unavailable schema information
  • Propose a principled approach to summarize the contents of a relational database
    • Cluster similar tables
    • Label each cluster by its most important table

Summarizing Relational Databases

slide3

TPCE

Summarizing Relational Databases

outline
Outline
  • Motivation
  • Our Approach
    • Define table importance
    • Define a metric space over the tables
    • Clustering: Weighted k-Center
  • Experimental Results
  • Conclusions

Summarizing Relational Databases

table importance
Table Importance
  • Depends on
    • Internal information content
    • External connectivity
      • Join behavior
      • Taxrate: 1 join
      • Customer: 5 joins

3 columns

2 columns

24 columns

Summarizing Relational Databases

table importance cont d
Table Importance (cont’d)
  • Entropy of Attribute A in table R is defined as
    • R.A = {a1, …, ak}
    • pi is the fraction of tuples in R that have value ai on attribute A
  • The Information Content of a table R is defined as
    • Create a primary key R.Key to table R
    • Add a self-loop R.Key – R.Key

Summarizing Relational Databases

table importance cont d1
Table Importance (cont’d)
  • Entropy transfer matrix associated with schema graph G is defined as:
    • For a join edge e = R.A – S.B
    • For a pair of tables R and S, define
  • qA': number of join edges involving R.A' (including self-join)
  • VE – Variable entropy transfer Model

,

Summarizing Relational Databases

table importance cont d2
Table Importance (cont’d)
  • [S,T] = Pr(S.S_SymbT.T_S_Symb)
  • = α/ (IC(S)+2α)
  • The importance of table R is defined as the stable-state value of a random walk on G, using probability matrix
    • Vector , s.t.
    • Importance , R G
  • Example

Security (S)

Trade (T)

  • H(S.S_Symb)=α; qS_Symb=3

G(TPCE)

Trade_Request (TR)

=

  • Pr(S.S_Symb T.T_S_Symb)
  • H(S.S_Symb) α
  • IC(S)+(qS_Symb-1) ·H(S.S_Symb) IC(S)+2α

=

Summarizing Relational Databases

outline1
Outline
  • Motivation
  • Our Approach
    • Define table importance
    • Define a metric space over the tables
    • Clustering: Weighted k-Center
  • Experimental Results
  • Conclusions

Summarizing Relational Databases

table similarity
Table Similarity
  • Distance = 1 - similarity
  • Goal: define metric distance
    • Enables meaningful clustering over relational databases
  • Table similarity depends on how join edgesand join paths are instantiated

A

A

B

B

1

1

1

1

2

2

2

2

3

3

4

3

R

R

S

S

A

B

1

1

2

1

3

2

R

2

2

3

3

  • R.A = S.B

S

Summarizing Relational Databases

table similarity cont d
Table Similarity (cont’d)
  • Consider a join edge e = R.A – S.B
    • Tuplest1, t2 instantiate e
    • fanoute(ti) is the fanout of ti along e
      • fanoute(t1) = 3
    • Let q be the number of tuples in Rs.t. fanoute(ti) > 0, define the matching fraction of Rw.r.t. e as fe(R) = q/n, |R|= n
      • fe(R) = 2/3 ≤ 1; fe(S) = 5/5 = 1 ≤ 1
    • Define the matched average fanoutof Rw.r.t. e as
      • mafe(R) = (3+2)/2 = 2.5 ≥ 1; mafe(S) = 5/5 = 1 ≥ 1

A

B

t1

t2

1

1

2

1

3

1

R

3

3

  • R.A = S.B

S

Summarizing Relational Databases

table similarity cont d1
Table Similarity (cont’d)
  • The similarity of tables R and S (w.r.t. e(R,S)) must satisfy:
    • Property 1: Proportional to the matching fractions fe(R) and fe(S)
    • Property 2: Inverse proportional to the matched average fanoutsmafe(R) and mafe(S)
  • Define the strength of tables R and S (w.r.t. e(R,S)) as
  • e: R.A = S.B

A

A

B

B

1

1

1

1

2

2

2

2

3

3

4

3

R

R

S

S

  • fe(R) = fe(S) =1
  • fe(R) = fe(S) =2/3
  • e: R.A = S.B

A

A

B

B

1

1

1

1

2

2

2

1

3

3

3

2

R

R

S

2

  • Property 2: Inverse proportional to the matched average
  • fanoutsmafe(R) and mafe(S)

2

  • mafe(R) = mafe(S) =1

3

  • mafe(R) = 7/3
  • mafe(S) =1

3

S

Summarizing Relational Databases

table similarity cont d2
Table Similarity (cont’d)
  • Let : R = R0 - R1 - … - Rα= S be a path in G, define
  • Table similarity (R, S):
  • Distance (R, S)
    • dists(R,S) = 1 – strength(R,S)
    • (R, dists) is a metric space

R

R1

R0

  • ei

Ri-1

S

Ri

  • : R = R0 - R1 - … - Rα= S

Summarizing Relational Databases

outline2
Outline
  • Motivation
  • Our Approach
    • Define table importance
    • Define a metric space over the tables
    • Clustering: Weighted k-Center
  • Experimental Results
  • Conclusions

Summarizing Relational Databases

clustering weighted k center
Clustering: Weighted k-Center
  • Clustering Criteria:
    • Minimize the maximum distance between a cluster center and a table in that cluster
    • Take table importance into consideration, avoid grouping top important tables into one cluster
  • Weighted k-Center clustering
    • Weights: table importance
    • Given k clusters C = {C1, C2, …, Ck}, minimize
    • NP-Hard

Summarizing Relational Databases

weighted k center greedy approximation algorithm
Weighted k-Center: Greedy Approximation Algorithm
  • Start with one cluster, whose center is the top-1 important table.
  • Iteratively chooses the table Ri whose weighted distance
  • from its cluster center is largest, and creates a new cluster with Ri as its center.
  • All tables that are closer to Ri than to their current
  • cluster center are reassigned to cluster Ci.

Summarizing Relational Databases

outline3
Outline
  • Motivation
  • Our Approach
    • Define table importance
    • Define a metric space over the tables
    • Clustering: Weighted k-Center
  • Experimental Results
  • Conclusions

Summarizing Relational Databases

experimental results
Experimental Results
  • Validate the proposed three components in our approach
    • Model for table importance IE
    • Distance function dists
    • Clustering: Weighted k-Center
  • Other methods

Entropy-based

Strength-based

Ic: Cardinality-initialized

distc = 1 – coverage

[1]

[1]

[1]

distp = 1 – proximity

[2]

[1] C.Yu and H.V.Jagadish. Schema summarization. VLDB 2006.

[2] H.Tong, C.Faloutsos and Y.Koren. Fast direction-aware proximity for graph mining. KDD 2007.

Summarizing Relational Databases

experimental results cont d
Experimental Results (cont’d)
  • Data Sets: TPCE schema
    • Benchmark database simulating OLTP workload
    • 33 tables pre-classified into 4 categories
    • Two database instances: TPCE-1 / TPCE-2
      • Affect the size of the majority of tables
      • Affect Pr(R.AS.B), strength(R,S) for most pairs and mafe for 1/3 of edges

Summarizing Relational Databases

slide20

TPCE

Summarizing Relational Databases

table importance1
Table Importance
  • Comparison of IE and IC models
  • Top-5 Important Tables in IE and their ranks in IC

IE more accurate than IC

  • Top-5 Important Tables in IC and their ranks in IE

Summarizing Relational Databases

table importance cont d3
Table Importance (cont’d)
  • Consistency of IE and IC models
  • Top-7 Important Tables in IE and IC for TPCE-1 and TPCE-2

IE more consistent than IC

Summarizing Relational Databases

distance between tables
Distance Between Tables
  • Accuracy of distance functions
    • Observation: for each table R, its distances to tables within the same category (pre-defined) should be smaller than its distances to tables in different categories
    • n(R): # top-qnbrs (NNR) of R under dist. d (distS, distC, distP)
    • m(R): # tables (NNR) in the same category as R under dist. d
    • Calculate:

q=5

distS most accurate

Summarizing Relational Databases

clustering algorithms
Clustering Algorithms

Based on IC and distC(coverage)

  • Accuracy of a summary
    • TPCE is pre-classified into 4 categories: Broker, Customer,Market and Dimension
    • m(Ci): # tables in Ci with the

same category as center(Ci)

    • Given a summary C = {C1, C2, …, Ck}, calculate
    • Balanced-Summary (BS) [1]
    • Weighted k-Center (WKC)

WKC is more accurate

Summarizing Relational Databases

summarization algorithms
Summarization Algorithms
  • Weighted k-Center over three distance functions

distS: most balanced and accurate

  • Summary Accuracy

Summarizing Relational Databases

related work
Related Work
  • C. Yu and H. V. Jagadish. Schema summarization. VLDB’06
  • H.Tong, C.Faloutsos and Y.Koren. Fast direction-aware proximity for graph mining. KDD’07
  • W. Wu, B. Reinwald, Y. Sismanis and R. Manjrekar. Discovering topical structures of databases. SIGMOD’08

Summarizing Relational Databases

conclusions
Conclusions
  • We proposed a novel approach for summarizing relational schemas
    • A new model for table importance
    • A metric distance over schema tables
    • A summarization algorithm

Q&A

Summarizing Relational Databases