Loading in 5 sec....

Summarizing Relational DatabasesPowerPoint Presentation

Summarizing Relational Databases

- 105 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Summarizing Relational Databases' - sybil-mcintyre

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

Summarizing Relational Databases

Xiaoyan Yang

National University of Singapore

Cecilia Magda Procopiuc

Divesh Srivastava

AT&T Labs-Research

Labs-Research

Motivation

- Complex databases are challenging to explore and query
- Consisting of hundreds of inter-linked tables
- Users unfamiliar with the schema
- Insufficient or unavailable schema information

- Propose a principled approach to summarize the contents of a relational database
- Cluster similar tables
- Label each cluster by its most important table

Summarizing Relational Databases

Summarizing Relational Databases

Outline

- Motivation
- Our Approach
- Define table importance
- Define a metric space over the tables
- Clustering: Weighted k-Center

- Experimental Results
- Conclusions

Summarizing Relational Databases

Table Importance

- Depends on
- Internal information content
- External connectivity
- Join behavior
- Taxrate: 1 join
- Customer: 5 joins

3 columns

2 columns

24 columns

Summarizing Relational Databases

Table Importance (cont’d)

- Entropy of Attribute A in table R is defined as
- R.A = {a1, …, ak}
- pi is the fraction of tuples in R that have value ai on attribute A

- The Information Content of a table R is defined as
- Create a primary key R.Key to table R
- Add a self-loop R.Key – R.Key

Summarizing Relational Databases

Table Importance (cont’d)

- Entropy transfer matrix associated with schema graph G is defined as:
- For a join edge e = R.A – S.B
- For a pair of tables R and S, define

- qA': number of join edges involving R.A' (including self-join)

- VE – Variable entropy transfer Model

,

Summarizing Relational Databases

Table Importance (cont’d)

- [S,T] = Pr(S.S_SymbT.T_S_Symb)
- = α/ (IC(S)+2α)

- The importance of table R is defined as the stable-state value of a random walk on G, using probability matrix
- Vector , s.t.
- Importance , R G

- Example

Security (S)

Trade (T)

- H(S.S_Symb)=α; qS_Symb=3

G(TPCE)

Trade_Request (TR)

=

- Pr(S.S_Symb T.T_S_Symb)
- H(S.S_Symb) α
- IC(S)+(qS_Symb-1) ·H(S.S_Symb) IC(S)+2α

=

Summarizing Relational Databases

Outline

- Motivation
- Our Approach
- Define table importance
- Define a metric space over the tables
- Clustering: Weighted k-Center

- Experimental Results
- Conclusions

Summarizing Relational Databases

Table Similarity

- Distance = 1 - similarity
- Goal: define metric distance
- Enables meaningful clustering over relational databases

- Table similarity depends on how join edgesand join paths are instantiated

A

A

B

B

1

1

1

1

2

2

2

2

3

3

4

3

R

R

S

S

A

B

1

1

2

1

3

2

R

2

2

3

3

- R.A = S.B

S

Summarizing Relational Databases

Table Similarity (cont’d)

- Consider a join edge e = R.A – S.B
- Tuplest1, t2 instantiate e
- fanoute(ti) is the fanout of ti along e
- fanoute(t1) = 3

- Let q be the number of tuples in Rs.t. fanoute(ti) > 0, define the matching fraction of Rw.r.t. e as fe(R) = q/n, |R|= n
- fe(R) = 2/3 ≤ 1; fe(S) = 5/5 = 1 ≤ 1

- Define the matched average fanoutof Rw.r.t. e as
- mafe(R) = (3+2)/2 = 2.5 ≥ 1; mafe(S) = 5/5 = 1 ≥ 1

A

B

t1

t2

1

1

2

1

3

1

R

3

3

- R.A = S.B

S

Summarizing Relational Databases

Table Similarity (cont’d)

- The similarity of tables R and S (w.r.t. e(R,S)) must satisfy:
- Property 1: Proportional to the matching fractions fe(R) and fe(S)
- Property 2: Inverse proportional to the matched average fanoutsmafe(R) and mafe(S)

- Define the strength of tables R and S (w.r.t. e(R,S)) as

- e: R.A = S.B

A

A

B

B

1

1

1

1

2

2

2

2

3

3

4

3

R

R

S

S

- fe(R) = fe(S) =1

- fe(R) = fe(S) =2/3

- e: R.A = S.B

A

A

B

B

1

1

1

1

2

2

2

1

3

3

3

2

R

R

S

2

- Property 2: Inverse proportional to the matched average
- fanoutsmafe(R) and mafe(S)

2

- mafe(R) = mafe(S) =1

3

- mafe(R) = 7/3
- mafe(S) =1

3

S

Summarizing Relational Databases

Table Similarity (cont’d)

- Let : R = R0 - R1 - … - Rα= S be a path in G, define
- Table similarity (R, S):
- Distance (R, S)
- dists(R,S) = 1 – strength(R,S)
- (R, dists) is a metric space

R

R1

R0

Rα

- ei

Ri-1

S

Ri

- : R = R0 - R1 - … - Rα= S

Summarizing Relational Databases

Outline

- Motivation
- Our Approach
- Define table importance
- Define a metric space over the tables
- Clustering: Weighted k-Center

- Experimental Results
- Conclusions

Summarizing Relational Databases

Clustering: Weighted k-Center

- Clustering Criteria:
- Minimize the maximum distance between a cluster center and a table in that cluster
- Take table importance into consideration, avoid grouping top important tables into one cluster

- Weighted k-Center clustering
- Weights: table importance
- Given k clusters C = {C1, C2, …, Ck}, minimize
- NP-Hard

Summarizing Relational Databases

Weighted k-Center: Greedy Approximation Algorithm

- Start with one cluster, whose center is the top-1 important table.

- Iteratively chooses the table Ri whose weighted distance
- from its cluster center is largest, and creates a new cluster with Ri as its center.

- All tables that are closer to Ri than to their current
- cluster center are reassigned to cluster Ci.

Summarizing Relational Databases

- Motivation
- Our Approach
- Define table importance
- Define a metric space over the tables
- Clustering: Weighted k-Center

- Experimental Results
- Conclusions

Summarizing Relational Databases

Experimental Results

- Validate the proposed three components in our approach
- Model for table importance IE
- Distance function dists
- Clustering: Weighted k-Center

- Other methods

Entropy-based

Strength-based

Ic: Cardinality-initialized

distc = 1 – coverage

[1]

[1]

[1]

distp = 1 – proximity

[2]

[1] C.Yu and H.V.Jagadish. Schema summarization. VLDB 2006.

[2] H.Tong, C.Faloutsos and Y.Koren. Fast direction-aware proximity for graph mining. KDD 2007.

Summarizing Relational Databases

Experimental Results (cont’d)

- Data Sets: TPCE schema
- Benchmark database simulating OLTP workload
- 33 tables pre-classified into 4 categories
- Two database instances: TPCE-1 / TPCE-2
- Affect the size of the majority of tables
- Affect Pr(R.AS.B), strength(R,S) for most pairs and mafe for 1/3 of edges

Summarizing Relational Databases

Summarizing Relational Databases

Table Importance

- Comparison of IE and IC models

- Top-5 Important Tables in IE and their ranks in IC

IE more accurate than IC

- Top-5 Important Tables in IC and their ranks in IE

Summarizing Relational Databases

Table Importance (cont’d)

- Consistency of IE and IC models

- Top-7 Important Tables in IE and IC for TPCE-1 and TPCE-2

IE more consistent than IC

Summarizing Relational Databases

Distance Between Tables

- Accuracy of distance functions
- Observation: for each table R, its distances to tables within the same category (pre-defined) should be smaller than its distances to tables in different categories
- n(R): # top-qnbrs (NNR) of R under dist. d (distS, distC, distP)
- m(R): # tables (NNR) in the same category as R under dist. d
- Calculate:

q=5

distS most accurate

Summarizing Relational Databases

Clustering Algorithms

Based on IC and distC(coverage)

- Accuracy of a summary
- TPCE is pre-classified into 4 categories: Broker, Customer,Market and Dimension
- m(Ci): # tables in Ci with the
same category as center(Ci)

- Given a summary C = {C1, C2, …, Ck}, calculate
- Balanced-Summary (BS) [1]
- Weighted k-Center (WKC)

WKC is more accurate

Summarizing Relational Databases

Summarization Algorithms

- Weighted k-Center over three distance functions

distS: most balanced and accurate

- Summary Accuracy

Summarizing Relational Databases

Related Work

- C. Yu and H. V. Jagadish. Schema summarization. VLDB’06
- H.Tong, C.Faloutsos and Y.Koren. Fast direction-aware proximity for graph mining. KDD’07
- W. Wu, B. Reinwald, Y. Sismanis and R. Manjrekar. Discovering topical structures of databases. SIGMOD’08

Summarizing Relational Databases

Conclusions

- We proposed a novel approach for summarizing relational schemas
- A new model for table importance
- A metric distance over schema tables
- A summarization algorithm

Q&A

Summarizing Relational Databases

Download Presentation

Connecting to Server..