Information theory for data management
Download
1 / 150

- PowerPoint PPT Presentation


  • 101 Views
  • Updated On :

Information Theory For Data Management. Divesh Srivastava Suresh Venkatasubramanian. Motivation. -- Abstruse Goose (177). Information Theory is relevant to all of humanity. Background. Many problems in data management need precise reasoning about information content, transfer and loss

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - karah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Information theory for data management l.jpg

Information Theory For Data Management

Divesh Srivastava

Suresh Venkatasubramanian


Motivation l.jpg
Motivation

-- Abstruse Goose (177)

Information Theory is relevant to all of humanity...


Background l.jpg
Background

  • Many problems in data management need precise reasoning about information content, transfer and loss

    • Structure Extraction

    • Privacy preservation

    • Schema design

    • Probabilistic data ?


Information theory l.jpg
Information Theory

  • First developed by Shannon as a way of quantifying capacity of signal channels.

  • Entropy, relative entropy and mutual information capture intrinsic informational aspects of a signal

  • Today:

    • Information theory provides a domain-independent way to reason about structure in data

    • More information = interesting structure

    • Less information linkage = decoupling of structures


Tutorial thesis l.jpg
Tutorial Thesis

Information theory provides a mathematical framework for the quantification of information content, linkage and loss.

This framework can be used in the design of data management strategies that rely on probing the structure of information in data.


Tutorial goals l.jpg
Tutorial Goals

  • Introduce information-theoretic concepts to VLDB audience

  • Give a ‘data-centric’ perspective on information theory

  • Connect these to applications in data management

  • Describe underlying computational primitives

    Illuminate when and how information theory might be of use in new areas of data management.


Outline l.jpg
Outline

Part 1

Introduction to Information Theory

Application: Data Anonymization

Application: Data Integration

Part 2

Review of Information Theory Basics

Application: Database Design

Computing Information Theoretic Primitives

Open Problems

7


Histograms and discrete distributions l.jpg

X

f(X)

X

p(X)

X

x1

4

x1

0.5

aggregate counts

x1

x2

2

x2

0.25

normalize

x3

1

x3

0.125

x3

x4

1

x4

0.125

x2

x4

Probability distribution

Histogram

x1

x1

Column of data

x2

x1

Histograms And Discrete Distributions


Histograms and discrete distributions9 l.jpg

X

f(X)

X

p(X)

X

x1

4

x1

0.667

aggregate counts

x1

x2

2

x2

0.2

x3

1

x3

0.067

x3

x4

1

x4

0.067

x2

x4

Probability distribution

Histogram

x1

x1

Column of data

x2

x1

Histograms And Discrete Distributions

reweight

normalize


From columns to random variables l.jpg
From Columns To Random Variables

  • We can think of a column of data as “represented” by a random variable:

    • X is a random variable

    • p(X) is the column of probabilities p(X = x1), p(X = x2), and so on

    • Also known (in unweighted case) as the empirical distribution induced by the column X.

  • Notation:

    • X (upper case) denotes a random variable (column)

    • x (lower case) denotes a value taken by X (field in a tuple)

    • p(x) is the probability p(X = x)


Joint distributions l.jpg
Joint Distributions

Discrete distribution: probability p(X,Y,Z)

p(Y) = ∑x p(X=x,Y) = ∑x ∑z p(X=x,Y,Z=z)

11

11


Entropy of a column l.jpg

Let h(x) = log2 1/p(x)

h(X) is column of h(x) values.

H(X) = EX[h(x)] = SX p(x) log2 1/p(x)

Two views of entropy

It captures uncertainty in data: high entropy, more unpredictability

It captures information content: higher entropy, more information.

Entropy Of A Column

H(X) = 1.75 < log |X| = 2


Examples l.jpg
Examples

  • X uniform over [1, ..., 4]. H(X) = 2

  • Y is 1 with probability 0.5, in [2,3,4] uniformly.

    • H(Y) = 0.5 log 2 + 0.5 log 6 ~= 1.8 < 2

    • Y is more sharply defined, and so has less uncertainty.

  • Z uniform over [1, ..., 8]. H(Z) = 3 > 2

    • Z spans a larger range, and captures more information

X

Y

Z


Comparing distributions l.jpg
Comparing Distributions

  • How do we measure difference between two distributions ?

  • Kullback-Leibler divergence:

    • dKL(p, q) = Ep[ h(q) – h(p) ] = Si pi log(pi/qi)

Inference mechanism

Prior belief

Resulting belief


Comparing distributions15 l.jpg
Comparing Distributions

  • Kullback-Leibler divergence:

    • dKL(p, q) = Ep[ h(q) – h(p) ] = Si pi log(pi/qi)

    • dKL(p, q) >= 0

    • Captures extra information needed to capture p given q

    • Is asymmetric ! dKL(p, q) != dKL(q, p)

    • Is not a metric (does not satisfy triangle inequality)

  • There are other measures:

    • 2-distance, variational distance, f-divergences, …


Conditional probability l.jpg
Conditional Probability

  • Given a joint distribution on random variables X, Y, how much information about X can we glean from Y ?

  • Conditional probability: p(X|Y)

    • p(X = x1 | Y = y1) = p(X = x1, Y = y1)/p(Y = y1)


Conditional entropy l.jpg
Conditional Entropy

  • Let h(x|y) = log2 1/p(x|y)

  • H(X|Y) = Ex,y[h(x|y)] = SxSy p(x,y) log2 1/p(x|y)

  • H(X|Y) = H(X,Y) – H(Y)

  • H(X|Y) = H(X,Y) – H(Y) = 2.25 – 1.5 = 0.75

  • If X, Y are independent, H(X|Y) = H(X)


Mutual information l.jpg
Mutual Information

  • Mutual information captures the difference between the joint distribution on X and Y, and the marginal distributions on X and Y.

  • Let i(x;y) = log p(x,y)/p(x)p(y)

  • I(X;Y) = Ex,y[I(X;Y)] = SxSy p(x,y) log p(x,y)/p(x)p(y)


Mutual information strength of linkage l.jpg
Mutual Information: Strength of linkage

  • I(X;Y) = H(X) + H(Y) – H(X,Y) = H(X) – H(X|Y) = H(Y) – H(Y|X)

  • If X, Y are independent, then I(X;Y) = 0:

    • H(X,Y) = H(X) + H(Y), so I(X;Y) = H(X) + H(Y) – H(X,Y) = 0

  • I(X;Y) <= max (H(X), H(Y))

    • Suppose Y = f(X) (deterministically)

    • Then H(Y|X) = 0, and so I(X;Y) = H(Y) – H(Y|X) = H(Y)

  • Mutual information captures higher-order interactions:

    • Covariance captures “linear” interactions only

    • Two variables can be uncorrelated (covariance = 0) and have nonzero mutual information:

    • X R [-1,1], Y = X2. Cov(X,Y) = 0, I(X;Y) = H(X) > 0


Information theoretic clustering l.jpg
Information-Theoretic Clustering

  • Clustering takes a collection of objects and groups them.

    • Given a distance function between objects

    • Choice of measure of complexity of clustering

    • Choice of measure of cost for a cluster

  • Usually,

    • Distance function is Euclidean distance

    • Number of clusters is measure of complexity

    • Cost measure for cluster is sum-of-squared-distance to center

  • Goal: minimize complexity and cost

    • Inherent tradeoff between two


Feature representation l.jpg

X

f(X)

X

p(X)

X

v1

4

v1

0.5

aggregate counts

v1

v2

2

v2

0.25

normalize

v3

1

v3

0.125

v3

v4

1

v4

0.125

v2

v4

Probability distribution

Histogram

v1

v1

Column of data

v2

v1

Feature Representation

Let V = {v1, v2, v3, v4}

X is “explained” by distribution over V.

“Feature vector” of X is [0.5, 0.25, 0.125, 0.125]


Feature representation22 l.jpg
Feature Representation

p(v2|X2) = 0.2

Feature vector


Information theoretic clustering23 l.jpg
Information-Theoretic Clustering

  • Clustering takes a collection of objects and groups them.

    • Given a distance function between objects

    • Choice of measure of complexity of clustering

    • Choice of measure of cost for a cluster

  • In information-theoretic setting

    • What is the distance function ?

    • How do we measure complexity ?

    • What is a notion of cost/quality ?

  • Goal: minimize complexity and maximize quality

    • Inherent tradeoff between two


Measuring complexity of clustering l.jpg
Measuring complexity of clustering

  • Take 1: complexity of a clustering = #clusters

    • standard model of complexity.

  • Doesn’t capture the fact that clusters have different sizes.


Measuring complexity of clustering25 l.jpg
Measuring complexity of clustering

  • Take 2: Complexity of clustering = number of bits needed to describe it.

  • Writing down “k” needs log k bits.

  • In general, let cluster t  T have |t| elements.

    • set p(t) = |t|/n

    • #bits to write down cluster sizes = H(T) = S pt log 1/pt

H( ) <

H( )


Information theoretic clustering take i l.jpg
Information-theoretic Clustering (take I)

  • Given data X = x1, ..., xn explained by variable V, partition X into clusters (represented by T) such that

    H(T) is minimized and quality is maximized


Soft clusterings l.jpg
Soft clusterings

  • In a “hard” clustering, each point is assigned to exactly one cluster.

  • Characteristic function

    • p(t|x) = 1 if x  t, 0 if not.

  • Suppose we allow points to partially belong to clusters:

    • p(T|x) is a distribution.

    • p(t|x) is the “probability” of assigning x to t

      How do we describe the complexity of a clustering ?


Measuring complexity of clustering28 l.jpg
Measuring complexity of clustering

  • Take 1:

    • p(t) = Sx p(x) p(t|x)

    • Compute H(T) as before.

  • Problem:

    H(T1) = H(T2) !!


Measuring complexity of clustering29 l.jpg
Measuring complexity of clustering

  • By averaging the memberships, we’ve lost useful information.

  • Take II: Compute I(T;X) !

  • Even better: If T is a hard clustering of X, then I(T;X) = H(T)

I(T2;X) = 0.46

I(T1;X) = 0


Information theoretic clustering take ii l.jpg
Information-theoretic Clustering (take II)

  • Given data X = x1, ..., xn explained by variable V, partition X into clusters (represented by T) such that

    I(T,X) is minimized and quality is maximized


Measuring cost of a cluster l.jpg
Measuring cost of a cluster

Given objects Xt = {X1, X2, …, Xm} in cluster t,

Cost(t) = (1/m)Si d(Xi, C) = Si p(Xi) dKL(p(V|Xi), C)

where C = (1/m) Si p(V|Xi) = Sip(Xi) p(V|Xi) = p(V)


Mutual information cost of cluster l.jpg
Mutual Information = Cost of Cluster

Cost(t) = (1/m)Si d(Xi, C) = Si p(Xi) dKL(p(V|Xi), p(V))

Si p(Xi) KL( p(V|Xi), p(V)) = Si p(Xi) Sj p(vj|Xi) log p(vj|Xi)/p(vj)

= Si,j p(Xi, vj) log p(vj, Xi)/p(vj)p(Xi)

= I(Xt, V) !!

Cost of a cluster = I(Xt,V)


Cost of a clustering l.jpg
Cost of a clustering

  • If we partition X into k clusters X1, ..., Xk

    Cost(clustering) = Si pi I(Xi, V)

    (pi = |Xi|/|X|)


Cost of a clustering34 l.jpg
Cost of a clustering

  • Each cluster center t can be “explained” in terms of V:

    • p(V|t) = Si p(Xi) p(V|Xi)

  • Suppose we treat each cluster center itself as a point:


Cost of a clustering35 l.jpg
Cost of a clustering

  • We can write down the “cost” of this “cluster”

    • Cost(T) = I(T;V)

  • Key result [BMDG05] :

    Cost(clustering) = I(X, V) – (T, V)

    Minimizing cost(clustering) => maximizing I(T, V)


Information theoretic clustering take iii l.jpg
Information-theoretic Clustering (take III)

  • Given data X = x1, ..., xn explained by variable V, partition X into clusters (represented by T) such that

    I(T;X) - bI(T;V) is maximized

  • This is the Information Bottleneck Method [TPB98]

  • Agglomerative techniques exist for the case of ‘hard’ clusterings

  • b is the tradeoff parameter between complexity and cost

  • I(T;X) and I(T;V) are in the same units.


Information theory summary l.jpg
Information Theory: Summary

  • We can represent data as discrete distributions (normalized histograms)

  • Entropy captures uncertainty or information content in a distribution

  • The Kullback-Leibler distance captures the difference between distributions

  • Mutual information and conditional entropy capture linkage between variables in a joint distribution

  • We can formulate information-theoretic clustering problems


Outline38 l.jpg
Outline

Part 1

Introduction to Information Theory

Application: Data Anonymization

Application: Data Integration

Part 2

Review of Information Theory Basics

Application: Database Design

Computing Information Theoretic Primitives

Open Problems


Data anonymization using randomization l.jpg
Data Anonymization Using Randomization

Goal: publish anonymized microdata to enable accurate ad hoc analyses, but ensure privacy of individuals’ sensitive attributes

Key ideas:

Randomize numerical data: add noise from known distribution

Reconstruct original data distribution using published noisy data

Issues:

How can the original data distribution be reconstructed?

What kinds of randomization preserve privacy of individuals?

39

Information Theory for Data Management - Divesh & Suresh


Data anonymization using randomization40 l.jpg
Data Anonymization Using Randomization

Many randomization strategies proposed [AS00, AA01, EGS03]

Example randomization strategies: X in [0, 10]

R = X + μ (mod 11), μ is uniform in {-1, 0, 1}

R = X + μ (mod 11), μ is in {-1 (p = 0.25), 0 (p = 0.5), 1 (p = 0.25)}

R = X (p = 0.6), R = μ, μ is uniform in [0, 10] (p = 0.4)

Question:

Which randomization strategy has higher privacy preservation?

Quantify loss of privacy due to publication of randomized data

40

Information Theory for Data Management - Divesh & Suresh


Data anonymization using randomization41 l.jpg
Data Anonymization Using Randomization

X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

41

Information Theory for Data Management - Divesh & Suresh


Data anonymization using randomization42 l.jpg
Data Anonymization Using Randomization

X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

42

Information Theory for Data Management - Divesh & Suresh


Data anonymization using randomization43 l.jpg
Data Anonymization Using Randomization

X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

43

Information Theory for Data Management - Divesh & Suresh


Reconstruction of original data distribution l.jpg
Reconstruction of Original Data Distribution

X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

Reconstruct distribution of X using knowledge of R1 and μ

EM algorithm converges to MLE of original distribution [AA01]

44

Information Theory for Data Management - Divesh & Suresh


Analysis of privacy as00 l.jpg
Analysis of Privacy [AS00]

X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

If X is uniform in [0, 10], privacy determined by range of μ

45

Information Theory for Data Management - Divesh & Suresh


Analysis of privacy aa01 l.jpg
Analysis of Privacy [AA01]

X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

If X is uniform in [0, 1]  [5, 6], privacy smaller than range of μ

46

Information Theory for Data Management - Divesh & Suresh


Analysis of privacy aa0147 l.jpg
Analysis of Privacy [AA01]

X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

If X is uniform in [0, 1]  [5, 6], privacy smaller than range of μ

In some cases, sensitive value revealed

47

Information Theory for Data Management - Divesh & Suresh


Quantify loss of privacy aa01 l.jpg
Quantify Loss of Privacy [AA01]

Goal: quantify loss of privacy based on mutual information I(X;R)

Smaller H(X|R)  more loss of privacy in X by knowledge of R

Larger I(X;R)  more loss of privacy in X by knowledge of R

I(X;R) = H(X) – H(X|R)

I(X;R)used to capture correlation between X and R

p(X) is the prior knowledge of sensitive attribute X

p(X, R) is the joint distribution of X and R

48

Information Theory for Data Management - Divesh & Suresh


Quantify loss of privacy aa0149 l.jpg
Quantify Loss of Privacy [AA01]

Goal: quantify loss of privacy based on mutual information I(X;R)

X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

49

Information Theory for Data Management - Divesh & Suresh


Quantify loss of privacy aa0150 l.jpg
Quantify Loss of Privacy [AA01]

Goal: quantify loss of privacy based on mutual information I(X;R)

X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

50

Information Theory for Data Management - Divesh & Suresh


Quantify loss of privacy aa0151 l.jpg
Quantify Loss of Privacy [AA01]

Goal: quantify loss of privacy based on mutual information I(X;R)

X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

51

Information Theory for Data Management - Divesh & Suresh


Quantify loss of privacy aa0152 l.jpg
Quantify Loss of Privacy [AA01]

Goal: quantify loss of privacy based on mutual information I(X;R)

X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

I(X;R) = 0.33

52

Information Theory for Data Management - Divesh & Suresh


Quantify loss of privacy aa0153 l.jpg
Quantify Loss of Privacy [AA01]

Goal: quantify loss of privacy based on mutual information I(X;R)

X is uniform in [5, 6], R2 = X + μ (mod 11), μ is uniform in {0, 1}

I(X;R1) = 0.33, I(X;R2) = 0.5  R2 is a bigger privacy risk than R1

53

Information Theory for Data Management - Divesh & Suresh


Quantify loss of privacy aa0154 l.jpg
Quantify Loss of Privacy [AA01]

Equivalent goal: quantify loss of privacy based on H(X|R)

X is uniform in [5, 6], R2 = X + μ (mod 11), μ is uniform in {0, 1}

Intuition: we know more about X given R2, than about X given R1

H(X|R1) = 0.67, H(X|R2) = 0.5  R2 is a bigger privacy risk than R1

54

Information Theory for Data Management - Divesh & Suresh


Quantify loss of privacy l.jpg
Quantify Loss of Privacy

Example: X is uniform in [0, 1]

R3 = e (p = 0.9999), R3 = X (p = 0.0001)

R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)

Is R3 or R4 a bigger privacy risk?

55

Information Theory for Data Management - Divesh & Suresh


Worst case loss of privacy egs03 l.jpg
Worst Case Loss of Privacy [EGS03]

Example: X is uniform in [0, 1]

R3 = e (p = 0.9999), R3 = X (p = 0.0001)

R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)

I(X;R3) = 0.0001 << I(X;R4) = 0.028

56

Information Theory for Data Management - Divesh & Suresh


Worst case loss of privacy egs0357 l.jpg
Worst Case Loss of Privacy [EGS03]

Example: X is uniform in [0, 1]

R3 = e (p = 0.9999), R3 = X (p = 0.0001)

R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)

I(X;R3) = 0.0001 << I(X;R4) = 0.028

But R3 has a larger worst case risk

57

Information Theory for Data Management - Divesh & Suresh


Worst case loss of privacy egs0358 l.jpg
Worst Case Loss of Privacy [EGS03]

Goal: quantify worst case loss of privacy in X by knowledge of R

Use max KL divergence, instead of mutual information

Mutual information can be formulated as expected KL divergence

I(X;R) = ∑x ∑r p(x,r)*log2(p(x,r)/p(x)*p(r)) = KL(p(X,R) || p(X)*p(R))

I(X;R) = ∑r p(r) ∑x p(x|r)*log2(p(x|r)/p(x)) = ER [KL(p(X|r) || p(X))]

[AA01] measure quantifies expected loss of privacy over R

[EGS03] propose a measure based on worst case loss of privacy

IW(X;R) = MAXR [KL(p(X|r) || p(X))]

58

Information Theory for Data Management - Divesh & Suresh


Worst case loss of privacy egs0359 l.jpg
Worst Case Loss of Privacy [EGS03]

Example: X is uniform in [0, 1]

R3 = e (p = 0.9999), R3 = X (p = 0.0001)

R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)

IW(X;R3) = max{0.0, 1.0, 1.0} > IW(X;R4) = max{0.028, 0.028}

59

Information Theory for Data Management - Divesh & Suresh


Worst case loss of privacy egs0360 l.jpg
Worst Case Loss of Privacy [EGS03]

Example: X is uniform in [5, 6]

R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

R2 = X + μ (mod 11), μ is uniform in {0, 1}

IW(X;R1) = max{1.0, 0.0, 0.0, 1.0} = IW(X;R2) = {1.0, 0.0, 1.0}

Unable to capture that R2 is a bigger privacy risk than R1

60

Information Theory for Data Management - Divesh & Suresh


Data anonymization summary l.jpg
Data Anonymization: Summary

Randomization techniques useful for microdata anonymization

Randomization techniques differ in their loss of privacy

Information theoretic measures useful to capture loss of privacy

Expected KL divergence captures expected loss of privacy [AA01]

Maximum KL divergence captures worst case loss of privacy [EGS03]

Both are useful in practice

61

Information Theory for Data Management - Divesh & Suresh


Outline62 l.jpg
Outline

Part 1

Introduction to Information Theory

Application: Data Anonymization

Application: Data Integration

Part 2

Review of Information Theory Basics

Application: Database Design

Computing Information Theoretic Primitives

Open Problems

Information Theory for Data Management - Divesh & Suresh


Schema matching l.jpg
Schema Matching

Goal: align columns across database tables to be integrated

Fundamental problem in database integration

Early useful approach: textual similarity of column names

False positives: Address ≠ IP_Address

False negatives: Customer_Id = Client_Number

Early useful approach: overlap of values in columns, e.g., Jaccard

False positives: Emp_Id ≠ Project_Id

False negatives: Emp_Id = Personnel_Number

63

Information Theory for Data Management - Divesh & Suresh


Opaque schema matching kn03 l.jpg
Opaque Schema Matching [KN03]

Goal: align columns when column names, data values are opaque

Databases belong to different government bureaucracies 

Treat column names and data values as uninterpreted (generic)

Example: EMP_PROJ(Emp_Id, Proj_Id, Task_Id, Status_Id)

Likely that all Id fields are from the same domain

Different databases may have different column names

64

Information Theory for Data Management - Divesh & Suresh


Opaque schema matching kn0365 l.jpg
Opaque Schema Matching [KN03]

Approach: build complete, labeled graph GD for each database D

Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)

Perform graph matching between GD1 and GD2, minimizing distance

Intuition:

Entropy H(X)captures distribution of values in database column X

Mutual information I(X;Y) captures correlations between X, Y

Efficiency: graph matching between schema-sized graphs

65

Information Theory for Data Management - Divesh & Suresh


Opaque schema matching kn0366 l.jpg
Opaque Schema Matching [KN03]

Approach: build complete, labeled graph GD for each database D

Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)

66

Information Theory for Data Management - Divesh & Suresh


Opaque schema matching kn0367 l.jpg
Opaque Schema Matching [KN03]

Approach: build complete, labeled graph GD for each database D

Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)

H(A) = 1.5, H(B) = 2.0, H(C) = 1.0, H(D) = 1.5

67

Information Theory for Data Management - Divesh & Suresh


Opaque schema matching kn0368 l.jpg
Opaque Schema Matching [KN03]

Approach: build complete, labeled graph GD for each database D

Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)

H(A) = 1.5, H(B) = 2.0, H(C) = 1.0, H(D) = 1.5, I(A;B) = 1.5

68

Information Theory for Data Management - Divesh & Suresh


Opaque schema matching kn0369 l.jpg
Opaque Schema Matching [KN03]

Approach: build complete, labeled graph GD for each database D

Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)

1.5

1.5

A

B

2.0

1.0

1.0

1.5

1.0

1.0

C

D

1.5

0.5

69

Information Theory for Data Management - Divesh & Suresh


Opaque schema matching kn0370 l.jpg
Opaque Schema Matching [KN03]

Approach: build complete, labeled graph GD for each database D

Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)

Perform graph matching between GD1 and GD2, minimizing distance

[KN03] uses euclidean and normal distance metrics

1.5

1.5

2.0

1.5

A

W

B

X

2.0

1.5

1.5

1.0

1.0

1.0

1.5

1.0

1.0

0.5

1.0

1.0

Y

C

D

Z

1.5

1.5

1.0

0.5

70

Information Theory for Data Management - Divesh & Suresh


Opaque schema matching kn0371 l.jpg
Opaque Schema Matching [KN03]

Approach: build complete, labeled graph GD for each database D

Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)

Perform graph matching between GD1 and GD2, minimizing distance

1.5

1.5

2.0

1.5

A

W

B

X

2.0

1.5

1.5

1.0

1.0

1.0

1.5

1.0

1.0

0.5

1.0

1.0

Y

C

D

Z

1.5

1.5

1.0

0.5

71

Information Theory for Data Management - Divesh & Suresh


Opaque schema matching kn0372 l.jpg
Opaque Schema Matching [KN03]

Approach: build complete, labeled graph GD for each database D

Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)

Perform graph matching between GD1 and GD2, minimizing distance

1.5

1.5

2.0

1.5

A

W

B

X

2.0

1.5

1.5

1.0

1.0

1.0

1.5

1.0

1.0

0.5

1.0

1.0

Y

C

D

Z

1.5

1.5

1.0

0.5

72

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv06 l.jpg
Heterogeneity Identification [DKOSV06]

Goal: identify columns with semantically heterogeneous values

Can arise due to opaque schema matching [KN03]

Key ideas:

Heterogeneity based on distribution, distinguishability of values

Use Information Bottleneck to compute soft clustering of values

Issues:

Which information theoretic measure characterizes heterogeneity?

How to set parameters in the Information Bottleneck method?

73

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0674 l.jpg
Heterogeneity Identification [DKOSV06]

Example: semantically homogeneous, heterogeneous columns

74

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0675 l.jpg
Heterogeneity Identification [DKOSV06]

Example: semantically homogeneous, heterogeneous columns

75

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0676 l.jpg
Heterogeneity Identification [DKOSV06]

Example: semantically homogeneous, heterogeneous columns

More semantic types in column  greater heterogeneity

Only email versus email + phone

76

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0677 l.jpg
Heterogeneity Identification [DKOSV06]

Example: semantically homogeneous, heterogeneous columns

77

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0678 l.jpg
Heterogeneity Identification [DKOSV06]

Example: semantically homogeneous, heterogeneous columns

Relative distribution of semantic types impacts heterogeneity

Mainly email + few phone versus balanced email + phone

78

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0679 l.jpg
Heterogeneity Identification [DKOSV06]

Example: semantically homogeneous, heterogeneous columns

79

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0680 l.jpg
Heterogeneity Identification [DKOSV06]

Example: semantically homogeneous, heterogeneous columns

80

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0681 l.jpg
Heterogeneity Identification [DKOSV06]

Example: semantically homogeneous, heterogeneous columns

More easily distinguished types  greater heterogeneity

Phone + (possibly) SSN versus balanced email + phone

81

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0682 l.jpg
Heterogeneity Identification [DKOSV06]

Heterogeneity = space complexity of soft clustering of the data

More, balanced clusters  greater heterogeneity

More distinguishable clusters  greater heterogeneity

Soft clustering

Soft  assign probabilities to membership of values in clusters

How many clusters: tradeoff between space versus quality

Use Information Bottleneck to compute soft clustering of values

82

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0683 l.jpg
Heterogeneity Identification [DKOSV06]

Hard clustering

83

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0684 l.jpg
Heterogeneity Identification [DKOSV06]

Soft clustering: cluster membership probabilities

How to compute a good soft clustering?

84

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0685 l.jpg
Heterogeneity Identification [DKOSV06]

Represent strings as q-gram distributions

85

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0686 l.jpg
Heterogeneity Identification [DKOSV06]

iIB: find soft clustering T of X that minimizes I(T;X) – β*I(T;V)

Allow iIB to use arbitrarily many clusters, use β* = H(X)/I(X;V)

Closest to point with minimum space and maximum quality

86

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0687 l.jpg
Heterogeneity Identification [DKOSV06]

Rate distortion curve: I(T;V)/I(X;V) vs I(T;X)/H(X)

β*

87

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0688 l.jpg
Heterogeneity Identification [DKOSV06]

Heterogeneity = mutual information I(T;X) of iIB clustering T at β*

0 ≤I(T;X) (= 0.126) ≤ H(X) (= 2.0), H(T) (= 1.0)

Ideally use iIB with an arbitrarily large number of clusters in T

88

Information Theory for Data Management - Divesh & Suresh


Heterogeneity identification dkosv0689 l.jpg
Heterogeneity Identification [DKOSV06]

Heterogeneity = mutual information I(T;X) of iIB clustering T at β*

89

Information Theory for Data Management - Divesh & Suresh


Data integration summary l.jpg
Data Integration: Summary

Analyzing database instance critical for effective data integration

Matching and quality assessments are key components

Information theoretic measures useful for schema matching

Align columns when column names, data values are opaque

Mutual information I(X;V) captures correlations between X, V

Information theoretic measures useful for heterogeneity testing

Identify columns with semantically heterogeneous values

I(T;X) of iIB clustering T at β* captures column heterogeneity

90

Information Theory for Data Management - Divesh & Suresh


Outline91 l.jpg
Outline

Part 1

Introduction to Information Theory

Application: Data Anonymization

Application: Data Integration

Part 2

Review of Information Theory Basics

Application: Database Design

Computing Information Theoretic Primitives

Open Problems

Information Theory for Data Management - Divesh & Suresh


Review of information theory basics l.jpg
Review of Information Theory Basics

Discrete distribution: probability p(X)

p(X,Y) = ∑z p(X,Y,Z=z)

92

Information Theory for Data Management - Divesh & Suresh


Review of information theory basics93 l.jpg
Review of Information Theory Basics

Discrete distribution: probability p(X)

p(Y) = ∑x p(X=x,Y) = ∑x ∑z p(X=x,Y,Z=z)

93

Information Theory for Data Management - Divesh & Suresh


Review of information theory basics94 l.jpg
Review of Information Theory Basics

Discrete distribution: conditional probability p(X|Y)

p(X,Y) = p(X|Y)*p(Y) = p(Y|X)*p(X)

94

Information Theory for Data Management - Divesh & Suresh


Review of information theory basics95 l.jpg
Review of Information Theory Basics

Discrete distribution: entropy H(X)

h(x) = log2(1/p(x))

H(X) = ∑X=x p(x)*h(x) = 1.75

H(Y) = ∑Y=y p(y)*h(y) = 1.5 (≤ log2(|Y|) = 1.58)

H(X,Y) = ∑X=x ∑Y=y p(x,y)*h(x,y) = 2.25 (≤ log2(|X,Y|) = 2.32)

95

Information Theory for Data Management - Divesh & Suresh


Review of information theory basics96 l.jpg
Review of Information Theory Basics

Discrete distribution: conditional entropy H(X|Y)

h(x|y) = log2(1/p(x|y))

H(X|Y) = ∑X=x ∑Y=y p(x,y)*h(x|y) = 0.75

H(X|Y) = H(X,Y) – H(Y) = 2.25 – 1.5

96

Information Theory for Data Management - Divesh & Suresh


Review of information theory basics97 l.jpg
Review of Information Theory Basics

Discrete distribution: mutual information I(X;Y)

i(x;y) = log2(p(x,y)/p(x)*p(y))

I(X;Y) = ∑X=x ∑Y=y p(x,y)*i(x;y) = 1.0

I(X;Y) = H(X) + H(Y) – H(X,Y) = 1.75 + 1.5 – 2.25

97

Information Theory for Data Management - Divesh & Suresh


Outline98 l.jpg
Outline

Part 1

Introduction to Information Theory

Application: Data Anonymization

Application: Data Integration

Part 2

Review of Information Theory Basics

Application: Database Design

Computing Information Theoretic Primitives

Open Problems

Information Theory for Data Management - Divesh & Suresh


Information dependencies dr00 l.jpg
Information Dependencies [DR00]

Goal: use information theory to examine and reason about information content of the attributes in a relation instance

Key ideas:

Novel InD measure between attribute sets X, Y based on H(Y|X)

Identify numeric inequalities between InD measures

Results:

InD measures are a broader class than FDs and MVDs

Armstrong axioms for FDs derivable from InD inequalities

MVD inference rules derivable from InD inequalities

99

Information Theory for Data Management - Divesh & Suresh


Information dependencies dr00100 l.jpg
Information Dependencies [DR00]

Functional dependency: X → Y

FD X → Y holds iff  t1, t2 ((t1[X] = t2[X])  (t1[Y] = t2[Y]))

100

Information Theory for Data Management - Divesh & Suresh


Information dependencies dr00101 l.jpg
Information Dependencies [DR00]

Functional dependency: X → Y

FD X → Y holds iff  t1, t2 ((t1[X] = t2[X])  (t1[Y] = t2[Y]))

101

Information Theory for Data Management - Divesh & Suresh


Information dependencies dr00102 l.jpg
Information Dependencies [DR00]

Result: FD X → Y holds iff H(Y|X) = 0

Intuition: once X is known, no remaining uncertainty in Y

H(Y|X) = 0.5

102

Information Theory for Data Management - Divesh & Suresh


Information dependencies dr00103 l.jpg
Information Dependencies [DR00]

Multi-valued dependency: X →→ Y

MVD X →→ Y holds iff R(X,Y,Z) = R(X,Y) R(X,Z)

103

Information Theory for Data Management - Divesh & Suresh


Information dependencies dr00104 l.jpg
Information Dependencies [DR00]

Multi-valued dependency: X →→ Y

MVD X →→ Y holds iff R(X,Y,Z) = R(X,Y) R(X,Z)

=

104

Information Theory for Data Management - Divesh & Suresh


Information dependencies dr00105 l.jpg
Information Dependencies [DR00]

Multi-valued dependency: X →→ Y

MVD X →→ Y holds iff R(X,Y,Z) = R(X,Y) R(X,Z)

=

105

Information Theory for Data Management - Divesh & Suresh


Information dependencies dr00106 l.jpg
Information Dependencies [DR00]

Result: MVD X →→ Y holds iff H(Y,Z|X) = H(Y|X) + H(Z|X)

Intuition: once X known, uncertainties in Y and Z are independent

H(Y|X) = 0.5, H(Z|X) = 0.75, H(Y,Z|X) = 1.25

=

106

Information Theory for Data Management - Divesh & Suresh


Information dependencies dr00107 l.jpg
Information Dependencies [DR00]

Result: Armstrong axioms for FDs derivable from InD inequalities

Reflexivity: If Y  X, then X → Y

H(Y|X) = 0 for Y  X

Augmentation: X → Y  X,Z → Y,Z

0 ≤ H(Y,Z|X,Z) = H(Y|X,Z) ≤ H(Y|X) = 0

Transitivity: X → Y & Y → Z  X → Z

0 ≥ H(Y|X) + H(Z|Y) ≥ H(Z|X) ≥ 0

107

Information Theory for Data Management - Divesh & Suresh


Database normal forms l.jpg
Database Normal Forms

Goal: eliminate update anomalies by good database design

Need to know the integrity constraints on all database instances

Boyce-Codd normal form:

Input: a set ∑ of functional dependencies

For every (non-trivial) FD R.X → R.Y  ∑+, R.X is a key of R

4NF:

Input: a set ∑ of functional and multi-valued dependencies

For every (non-trivial) MVD R.X →→ R.Y  ∑+, R.X is a key of R

108

Information Theory for Data Management - Divesh & Suresh


Database normal forms109 l.jpg
Database Normal Forms

Functional dependency: X → Y

Which design is better?

=

109

Information Theory for Data Management - Divesh & Suresh


Database normal forms110 l.jpg
Database Normal Forms

Functional dependency: X → Y

Which design is better?

Decomposition is in BCNF

=

110

Information Theory for Data Management - Divesh & Suresh


Database normal forms111 l.jpg
Database Normal Forms

Multi-valued dependency: X →→ Y

Which design is better?

=

111

Information Theory for Data Management - Divesh & Suresh


Database normal forms112 l.jpg
Database Normal Forms

Multi-valued dependency: X →→ Y

Which design is better?

Decomposition is in 4NF

=

112

Information Theory for Data Management - Divesh & Suresh


Well designed databases al03 l.jpg
Well-Designed Databases [AL03]

Goal: use information theory to characterize “goodness” of a database design and reason about normalization algorithms

Key idea:

Information content measure of cell in a DB instance w.r.t. ICs

Redundancy reduces information content measure of cells

Results:

Well-designed DB  each cell has information content > 0

Normalization algorithms never decrease information content

113

Information Theory for Data Management - Divesh & Suresh


Well designed databases al03114 l.jpg
Well-Designed Databases [AL03]

Information content of cell c in database D satisfying FD X → Y

Uniform distribution p(V) on values for c consistent with D\c and FD

Information content of cell c is entropy H(V)

H(V62) = 2.0

114

Information Theory for Data Management - Divesh & Suresh


Well designed databases al03115 l.jpg
Well-Designed Databases [AL03]

Information content of cell c in database D satisfying FD X → Y

Uniform distribution p(V) on values for c consistent with D\c and FD

Information content of cell c is entropy H(V)

H(V22) = 0.0

115

Information Theory for Data Management - Divesh & Suresh


Well designed databases al03116 l.jpg
Well-Designed Databases [AL03]

Information content of cell c in database D satisfying FD X → Y

Information content of cell c is entropy H(V)

Schema S is in BCNF iff  D  S, H(V) > 0, for all cells c in D

Technicalities w.r.t. size of active domain

116

Information Theory for Data Management - Divesh & Suresh


Well designed databases al03117 l.jpg
Well-Designed Databases [AL03]

Information content of cell c in database D satisfying FD X → Y

Information content of cell c is entropy H(V)

H(V12) = 2.0, H(V42) = 2.0

117

Information Theory for Data Management - Divesh & Suresh


Well designed databases al03118 l.jpg
Well-Designed Databases [AL03]

Information content of cell c in database D satisfying FD X → Y

Information content of cell c is entropy H(V)

Schema S is in BCNF iff  D  S, H(V) > 0, for all cells c in D

118

Information Theory for Data Management - Divesh & Suresh


Well designed databases al03119 l.jpg
Well-Designed Databases [AL03]

Information content of cell c in DB D satisfying MVD X →→ Y

Information content of cell c is entropy H(V)

H(V52) = 0.0, H(V53) = 2.32

119

Information Theory for Data Management - Divesh & Suresh


Well designed databases al03120 l.jpg
Well-Designed Databases [AL03]

Information content of cell c in DB D satisfying MVD X →→ Y

Information content of cell c is entropy H(V)

Schema S is in 4NF iff D  S, H(V) > 0, for all cells c in D

120

Information Theory for Data Management - Divesh & Suresh


Well designed databases al03121 l.jpg
Well-Designed Databases [AL03]

Information content of cell c in DB D satisfying MVD X →→ Y

Information content of cell c is entropy H(V)

H(V32) = 1.58, H(V34) = 2.32

121

Information Theory for Data Management - Divesh & Suresh


Well designed databases al03122 l.jpg
Well-Designed Databases [AL03]

Information content of cell c in DB D satisfying MVD X →→ Y

Information content of cell c is entropy H(V)

Schema S is in 4NF iff D  S, H(V) > 0, for all cells c in D

122

Information Theory for Data Management - Divesh & Suresh


Well designed databases al03123 l.jpg
Well-Designed Databases [AL03]

Normalization algorithms never decrease information content

Information content of cell c is entropy H(V)

123

Information Theory for Data Management - Divesh & Suresh


Well designed databases al03124 l.jpg
Well-Designed Databases [AL03]

Normalization algorithms never decrease information content

Information content of cell c is entropy H(V)

=

124

Information Theory for Data Management - Divesh & Suresh


Well designed databases al03125 l.jpg
Well-Designed Databases [AL03]

Normalization algorithms never decrease information content

Information content of cell c is entropy H(V)

=

125

Information Theory for Data Management - Divesh & Suresh


Database design summary l.jpg
Database Design: Summary

Good database design essential for preserving data integrity

Information theoretic measures useful for integrity constraints

FD X → Y holds iffInD measure H(Y|X) = 0

MVD X →→ Y holds iff H(Y,Z|X) = H(Y|X) + H(Z|X)

Information theory to model correlations in specific database

Information theoretic measures useful for normal forms

Schema S is in BCNF/4NF iff D  S, H(V) > 0, for all cells c in D

Information theory to model distributions over possible databases

126

Information Theory for Data Management - Divesh & Suresh


Outline127 l.jpg
Outline

Part 1

Introduction to Information Theory

Application: Data Anonymization

Application: Data Integration

Part 2

Review of Information Theory Basics

Application: Database Design

Computing Information Theoretic Primitives

Open Problems

Information Theory for Data Management - Divesh & Suresh

127


Domain size matters l.jpg
Domain size matters

  • For random variable X, domain size = supp(X) = {xi | p(X = xi) > 0}

  • Different solutions exist depending on whether domain size is “small” or “large”

  • Probability vectors usually very sparse


Entropy case i small domain size l.jpg
Entropy: Case I - Small domain size

  • Suppose the #unique values for a random variable X is small (i.e fits in memory)

  • Maximum likelihood estimator:

    • p(x) = #times x is encountered/total number of items in set.

1

2

1

2

1

5

4

1

2

3

4

5


Entropy case i small domain size130 l.jpg
Entropy: Case I - Small domain size

  • HMLE = Sx p(x) log 1/p(x)

  • This is a biased estimate:

    • E[HMLE] < H

  • Miller-Madow correction:

    • H’ = HMLE + (m’ – 1)/2n

      • m’ is an estimate of number of non-empty bins

      • n = number of samples

  • Bad news: ALL estimators for H are biased.

  • Good news: we can quantify bias and variance of MLE:

    • Bias <= log(1 + m/N)

    • Var(HMLE) <= (log n)2/N


Entropy case ii large domain size l.jpg
Entropy: Case II - Large domain size

  • |X| is too large to fit in main memory, so we can’t maintain explicit counts.

  • Streaming algorithms for H(X):

    • Long history of work on this problem

    • Bottomline:

      (1+e)-relative-approximation for H(X) that allows for updates to frequencies, and requires “almost constant”, and optimal space [HNO08].


Streaming entropy ccm07 l.jpg
Streaming Entropy [CCM07]

  • High level idea: sample randomly from the stream, and track counts of elements picked [AMS]

  • PROBLEM: skewed distribution prevents us from sampling lower-frequency elements (and entropy is small)

  • Idea: estimate largest frequency, and

    distribution of what’s left (higher entropy)


Streaming entropy ccm07133 l.jpg
Streaming Entropy [CCM07]

  • Maintain set of samples from original distribution and distribution without most frequent element.

  • In parallel, maintain estimator for frequency of most frequent element

    • normally this is hard

    • but if frequency is very large, then simple estimator exists [MG81] (Google interview puzzle!)

  • At the end, compute function of these two estimates

  • Memory usage: roughly 1/e2 log(1/e) (e is the error)


Entropy and mi are related l.jpg
Entropy and MI are related

  • I(X;Y) = H(X,Y) – H(X) – H(Y)

  • Suppose we can c-approximate H(X) for any c > 0:

    Find H’(X) s.t |H(X) – H’(X)| <= c

  • Then we can 3c-approximate I(X;Y):

    • I(X;Y) = H(X,Y) – H(X) – H(Y)

      <= H’(X,Y)+c – (H’(X)-c) – (H’(Y)-c)

      <= H’(X,Y) – H’(X) – H’(Y) + 3c

      <= I’(X,Y) + 3c

  • Similarly, we can 2c-approximate H(Y|X) = H(X,Y) – H(X)

  • Estimating entropy allows us to estimate I(X;Y) and H(Y|X)


Computing kl divergence small domains l.jpg
Computing KL-divergence: Small Domains

  • “easy algorithm”: maintain counts for each of p and q, normalize, and compute KL-divergence.

  • PROBLEM ! Suppose qi = 0:

    • pi log pi/qi is undefined !

  • General problem with ML estimators: all events not seen have probability zero !!

    • Laplace correction: add one to counts for each seen element

    • Slightly better: add 0.5 to counts for each seen element [KT81]

    • Even better, more involved: use Good-Turing estimator [GT53]

  • YIeld non-zero probability for “things not seen”.


Computing kl divergence large domains l.jpg
Computing KL-divergence: Large Domains

  • Bad news: No good relative-approximations exist in small space.

  • (Partial) good news: additive approximations in small space under certain technical conditions (no pi is too small).

  • (Partial) good news: additive approximations for symmetric variant of KL-divergence, via sampling.

  • For details, see [GMV08,GIM08]


Information theoretic clustering137 l.jpg
Information-theoretic Clustering

  • Given a collection of random variables X, each “explained” by a random variable Y, we wish to find a (hard or soft) clustering T such that

    I(T,X) – bI(T, Y)

    is minimized.

  • Features of solutions thus far:

    • heuristic (general problem is NP-hard)

    • address both small-domain and large-domain scenarios.


Agglomerative clustering aib st00 l.jpg
Agglomerative Clustering (aIB) [ST00]

  • Fix number of clusters k

  • While number of clusters < k

    • Determine two clusters whose merge loses the least information

    • Combine these two clusters

  • Output clustering

  • Merge Criterion:

    • merge the two clusters so that change in I(T;V) is minimized

  • Note: no consideration of b (number of clusters is fixed)


Agglomerative clustering aib s l.jpg
Agglomerative Clustering (aIB) [S]

  • Elegant way of finding the two clusters to be merged:

  • Let dJS(p,q) = (1/2)(dKL(p,m) + dKL(q,m)), m = (p+q)/2

  • dJS(p,q) is a symmetric distance between p, q (Jensen-Shannon distance)

  • We merge clusters that have smallest dJS(p,q), (weighted by cluster mass)

p

m

q


Iterative information bottleneck iib s l.jpg
Iterative Information Bottleneck (iIB) [S]

  • aIB yields a hard clustering with k clusters.

  • If you want a soft clustering, use iIB (variant of EM)

    • Step 1: p(t|x) ← exp(-bdKL(p(V|x),p(V|t))

      • assign elements to clusters in proportion (exponentially) to distance from cluster center !

    • Step 2: Compute new cluster centers by computing weighted centroids:

      • p(t) = Sx p(t|x) p(x)

      • p(V|t) = Sx p(V|t) p(t|x) p(x)/p(t)

    • Choose b according to [DKOSV06]


Dealing with massive data sets l.jpg
Dealing with massive data sets

  • Clustering on massive data sets is a problem

  • Two main heuristics:

    • Sampling [DKOSV06]:

      • pick a small sample of the data, cluster it, and (if necessary) assign remaining points to clusters using soft assignment.

      • How many points to sample to get good bounds ?

    • Streaming:

      • Scan the data in one pass, performing clustering on the fly

      • How much memory needed to get reasonable quality solution ?


Limbo for aib atms04 l.jpg
LIMBO (for aIB) [ATMS04]

  • BIRCH-like idea:

    • Maintain (sparse) summary for each cluster (p(t), p(V|t))

    • As data streams in, build clusters on groups of objects

    • Build next-level clusters on cluster summaries from lower level


Outline143 l.jpg
Outline

Part 1

Introduction to Information Theory

Application: Data Anonymization

Application: Data Integration

Part 2

Review of Information Theory Basics

Application: Database Design

Computing Information Theoretic Primitives

Open Problems

Information Theory for Data Management - Divesh & Suresh

143


Open problems l.jpg
Open Problems

  • Data exploration and mining – information theory as first-pass filter

  • Relation to nonparametric generative models in machine learning (LDA, PPCA, ...)

  • Engineering and stability: finding right knobs to make systems reliable and scalable

  • Other information-theoretic concepts ? (rate distortion, higher-order entropy, ...)

THANK YOU !


References information theory l.jpg
References: Information Theory

[CT] Tom Cover and Joy Thomas: Information Theory.

[BMDG05] Arindam Banerjee, Srujana Merugu, Inderjit Dhillon, Joydeep Ghosh. Learning with Bregman Divergences, JMLR 2005.

[TPB98] Naftali Tishby, Fernando Pereira, William Bialek. The Information Bottleneck Method. Proc. 37th Annual Allerton Conference, 1998

145

Information Theory for Data Management - Divesh & Suresh


References data anonymization l.jpg
References: Data Anonymization

[AA01] Dakshi Agrawal, Charu C. Aggarwal: On the design and quantification of privacy preserving data mining algorithms. PODS 2001.

[AS00] Rakesh Agrawal, Ramakrishnan Srikant: Privacy preserving data mining. SIGMOD 2000.

[EGS03] Alexandre Evfimievski, Johannes Gehrke, Ramakrishnan Srikant: Limiting privacy breaches in privacy preserving data mining. PODS 2003.

Information Theory for Data Management - Divesh & Suresh

146

146


References data integration l.jpg
References: Data Integration

[AMT04] Periklis Andritsos, Renee J. Miller, Panayiotis Tsaparas: Information-theoretic tools for mining database structure from large data sets. SIGMOD 2004.

[DKOSV06] Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, Suresh Venkatasubramanian: Rapid identification of column heterogeneity. ICDM 2006.

[DKSTV08] Bing Tian Dai, Nick Koudas, Divesh Srivastava, Anthony K. H. Tung, Suresh Venkatasubramanian: Validating multi-column schema matchings by type. ICDE 2008.

[KN03] Jaewoo Kang, Jeffrey F. Naughton: On schema matching with opaque column names and data values. SIGMOD 2003.

[PPH05] Patrick Pantel, Andrew Philpot, Eduard Hovy: An information theoretic model for database alignment. SSDBM 2005.

147

Information Theory for Data Management - Divesh & Suresh


References database design l.jpg
References: Database Design

[AL03] Marcelo Arenas, Leonid Libkin: An information theoretic approach to normal forms for relational and XML data. PODS 2003.

[AL05] Marcelo Arenas, Leonid Libkin: An information theoretic approach to normal forms for relational and XML data. JACM 52(2), 246-283, 2005.

[DR00] Mehmet M. Dalkilic, Edward L. Robertson: Information dependencies. PODS 2000.

[KL06] Solmaz Kolahi, Leonid Libkin: On redundancy vs dependency preservation in normalization: an information-theoretic study of XML. PODS 2006.

148

Information Theory for Data Management - Divesh & Suresh


References computing it quantities l.jpg
References: Computing IT quantities

[P03] Liam Panninski. Estimation of entropy and mutual information. Neural Computation 15: 1191-1254

[GT53] I. J. Good. Turing’s anticipation of Empirical Bayes in connection with the cryptanalysis of the Naval Enigma. Journal of Statistical Computation and Simulation, 66(2), 2000.

[KT81] R. E. Krichevsky and V. K. Trofimov. The performance of universal encoding. IEEE Trans. Inform. Th. 27 (1981), 199--207.

[CCM07] Amit Chakrabarti, Graham Cormode and Andrew McGregor. A near-optimal algorithm for computing the entropy of a stream. Proc. SODA 2007.

[HNO] Nich Harvey, Jelani Nelson, Krzysztof Onak. Sketching and Streaming Entropy via Approximation Theory. FOCS 2008

[ATMS04] Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller and Kenneth C. Sevcik. LIMBO: Scalable Clustering of Categorical Data. EDBT 2004

Information Theory for Data Management - Divesh & Suresh

149

149


References computing it quantities150 l.jpg
References: Computing IT quantities

[S] Noam Slonim. The Information Bottleneck: theory and applications. Ph.D Thesis. Hebrew University, 2000.

[GMV08] Sudipto Guha, Andrew McGregor, Suresh Venkatasubramanian. Streaming and sublinear approximations for information distances. ACM Trans Alg. 2008

[GIM08] Sudipto Guha, Piotr Indyk, Andrew McGregor. Sketching Information Distances. JMLR, 2008.

Information Theory for Data Management - Divesh & Suresh

150

150


ad