1 / 140

Information Theory For Data Management

Information Theory For Data Management. Divesh Srivastava Suresh Venkatasubramanian. Motivation. Information Theory is relevant to all of humanity. -- Abstruse Goose (177). Information Theory for Data Management - Divesh & Suresh. Background.

rheanna
Download Presentation

Information Theory For Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian

  2. Motivation Information Theory is relevant to all of humanity... -- Abstruse Goose (177) Information Theory for Data Management - Divesh & Suresh

  3. Background • Many problems in data management need precise reasoning about information content, transfer and loss • Structure Extraction • Privacy preservation • Schema design • Probabilistic data ? Information Theory for Data Management - Divesh & Suresh

  4. Information Theory • First developed by Shannon as a way of quantifying capacity of signal channels. • Entropy, relative entropy and mutual information capture intrinsic informational aspects of a signal • Today: • Information theory provides a domain-independent way to reason about structure in data • More information = interesting structure • Less information linkage = decoupling of structures Information Theory for Data Management - Divesh & Suresh

  5. Tutorial Thesis Information theory provides a mathematical framework for the quantification of information content, linkage and loss. This framework can be used in the design of data management strategies that rely on probing the structure of information in data. Information Theory for Data Management - Divesh & Suresh

  6. Tutorial Goals • Introduce information-theoretic concepts to DB audience • Give a ‘data-centric’ perspective on information theory • Connect these to applications in data management • Describe underlying computational primitives Illuminate when and how information theory might be of use in new areas of data management. Information Theory for Data Management - Divesh & Suresh

  7. Outline Part 1 Introduction to Information Theory Application: Data Anonymization Application: Database Design Part 2 Review of Information Theory Basics Application: Data Integration Computing Information Theoretic Primitives Open Problems Information Theory for Data Management - Divesh & Suresh

  8. X f(X) X p(X) X x1 4 x1 0.5 aggregate counts x1 x2 2 x2 0.25 normalize x3 1 x3 0.125 x3 x4 1 x4 0.125 x2 x4 Probability distribution Histogram x1 x1 Column of data x2 x1 Histograms And Discrete Distributions Information Theory for Data Management - Divesh & Suresh

  9. X f(X) X p(X) X x1 4 x1 0.667 aggregate counts x1 x2 2 x2 0.2 x3 1 x3 0.067 x3 x4 1 x4 0.067 x2 x4 Probability distribution Histogram x1 x1 Column of data x2 x1 Histograms And Discrete Distributions reweight normalize Information Theory for Data Management - Divesh & Suresh

  10. From Columns To Random Variables • We can think of a column of data as “represented” by a random variable: • X is a random variable • p(X) is the column of probabilities p(X = x1), p(X = x2), and so on • Also known (in unweighted case) as the empirical distribution induced by the column X. • Notation: • X (upper case) denotes a random variable (column) • x (lower case) denotes a value taken by X (field in a tuple) • p(x) is the probability p(X = x) Information Theory for Data Management - Divesh & Suresh

  11. Joint Distributions Discrete distribution: probability p(X,Y,Z) p(Y) = ∑x p(X=x,Y) = ∑x ∑z p(X=x,Y,Z=z) Information Theory for Data Management - Divesh & Suresh

  12. Let h(x) = log2 1/p(x) h(X) is column of h(x) values. H(X) = EX[h(x)] = SX p(x) log2 1/p(x) Two views of entropy It captures uncertainty in data: high entropy, more unpredictability It captures information content: higher entropy, more information. Entropy Of A Column H(X) = 1.75 < log |X| = 2 Information Theory for Data Management - Divesh & Suresh

  13. Examples • X uniform over [1, ..., 4]. H(X) = 2 • Y is 1 with probability 0.5, in [2,3,4] uniformly. • H(Y) = 0.5 log 2 + 0.5 log 6 ~= 1.8 < 2 • Y is more sharply defined, and so has less uncertainty. • Z uniform over [1, ..., 8]. H(Z) = 3 > 2 • Z spans a larger range, and captures more information X Y Z Information Theory for Data Management - Divesh & Suresh

  14. Comparing Distributions • How do we measure difference between two distributions ? • Kullback-Leibler divergence: • dKL(p, q) = Ep[ h(q) – h(p) ] = Si pi log(pi/qi) Inference mechanism Prior belief Resulting belief Information Theory for Data Management - Divesh & Suresh

  15. Comparing Distributions • Kullback-Leibler divergence: • dKL(p, q) = Ep[ h(q) – h(p) ] = Si pi log(pi/qi) • dKL(p, q) >= 0 • Captures extra information needed to capture p given q • Is asymmetric ! dKL(p, q) != dKL(q, p) • Is not a metric (does not satisfy triangle inequality) • There are other measures: • 2-distance, variational distance, f-divergences, … Information Theory for Data Management - Divesh & Suresh

  16. Conditional Probability • Given a joint distribution on random variables X, Y, how much information about X can we glean from Y ? • Conditional probability: p(X|Y) • p(X = x1 | Y = y1) = p(X = x1, Y = y1)/p(Y = y1) Information Theory for Data Management - Divesh & Suresh

  17. Conditional Entropy • Let h(x|y) = log2 1/p(x|y) • H(X|Y) = Ex,y[h(x|y)] = SxSy p(x,y) log2 1/p(x|y) • H(X|Y) = H(X,Y) – H(Y) • H(X|Y) = H(X,Y) – H(Y) = 2.25 – 1.5 = 0.75 • If X, Y are independent, H(X|Y) = H(X) Information Theory for Data Management - Divesh & Suresh

  18. Mutual Information • Mutual information captures the difference between the joint distribution on X and Y, and the marginal distributions on X and Y. • Let i(x;y) = log p(x,y)/p(x)p(y) • I(X;Y) = Ex,y[I(X;Y)] = SxSy p(x,y) log p(x,y)/p(x)p(y) Information Theory for Data Management - Divesh & Suresh

  19. Mutual Information: Strength of linkage • I(X;Y) = H(X) + H(Y) – H(X,Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) • If X, Y are independent, then I(X;Y) = 0: • H(X,Y) = H(X) + H(Y), so I(X;Y) = H(X) + H(Y) – H(X,Y) = 0 • I(X;Y) <= max (H(X), H(Y)) • Suppose Y = f(X) (deterministically) • Then H(Y|X) = 0, and so I(X;Y) = H(Y) – H(Y|X) = H(Y) • Mutual information captures higher-order interactions: • Covariance captures “linear” interactions only • Two variables can be uncorrelated (covariance = 0) and have nonzero mutual information: • X R [-1,1], Y = X2. Cov(X,Y) = 0, I(X;Y) = H(X) > 0 Information Theory for Data Management - Divesh & Suresh

  20. Information Theory: Summary • We can represent data as discrete distributions (normalized histograms) • Entropy captures uncertainty or information content in a distribution • The Kullback-Leibler distance captures the difference between distributions • Mutual information and conditional entropy capture linkage between variables in a joint distribution Information Theory for Data Management - Divesh & Suresh

  21. Outline Part 1 Introduction to Information Theory Application: Data Anonymization Application: Database Design Part 2 Review of Information Theory Basics Application: Data Integration Computing Information Theoretic Primitives Open Problems Information Theory for Data Management - Divesh & Suresh

  22. Data Anonymization Using Randomization Goal: publish anonymized microdata to enable accurate ad hoc analyses, but ensure privacy of individuals’ sensitive attributes Key ideas: Randomize numerical data: add noise from known distribution Reconstruct original data distribution using published noisy data Issues: How can the original data distribution be reconstructed? What kinds of randomization preserve privacy of individuals? Information Theory for Data Management - Divesh & Suresh

  23. Data Anonymization Using Randomization Many randomization strategies proposed [AS00, AA01, EGS03] Example randomization strategies: X in [0, 10] R = X + μ (mod 11), μ is uniform in {-1, 0, 1} R = X + μ (mod 11), μ is in {-1 (p = 0.25), 0 (p = 0.5), 1 (p = 0.25)} R = X (p = 0.6), R = μ, μ is uniform in [0, 10] (p = 0.4) Question: Which randomization strategy has higher privacy preservation? Quantify loss of privacy due to publication of randomized data Information Theory for Data Management - Divesh & Suresh

  24. Data Anonymization Using Randomization X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh

  25. Data Anonymization Using Randomization X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} → Information Theory for Data Management - Divesh & Suresh

  26. Data Anonymization Using Randomization X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} → Information Theory for Data Management - Divesh & Suresh

  27. Reconstruction of Original Data Distribution X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Reconstruct distribution of X using knowledge of R1 and μ EM algorithm converges to MLE of original distribution [AA01] → → Information Theory for Data Management - Divesh & Suresh

  28. Analysis of Privacy [AS00] X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} If X is uniform in [0, 10], privacy determined by range of μ → → Information Theory for Data Management - Divesh & Suresh

  29. Analysis of Privacy [AA01] X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} If X is uniform in [0, 1]  [5, 6], privacy smaller than range of μ → → Information Theory for Data Management - Divesh & Suresh

  30. Analysis of Privacy [AA01] X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} If X is uniform in [0, 1]  [5, 6], privacy smaller than range of μ In some cases, sensitive value revealed → → Information Theory for Data Management - Divesh & Suresh

  31. Quantify Loss of Privacy [AA01] Goal: quantify loss of privacy based on mutual information I(X;R) Smaller H(X|R)  more loss of privacy in X by knowledge of R Larger I(X;R)  more loss of privacy in X by knowledge of R I(X;R) = H(X) – H(X|R) I(X;R)used to capture correlation between X and R p(X) is the prior knowledge of sensitive attribute X p(X, R) is the joint distribution of X and R Information Theory for Data Management - Divesh & Suresh

  32. Quantify Loss of Privacy [AA01] Goal: quantify loss of privacy based on mutual information I(X;R) X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh

  33. Quantify Loss of Privacy [AA01] Goal: quantify loss of privacy based on mutual information I(X;R) X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh

  34. Quantify Loss of Privacy [AA01] Goal: quantify loss of privacy based on mutual information I(X;R) X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh

  35. Quantify Loss of Privacy [AA01] Goal: quantify loss of privacy based on mutual information I(X;R) X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} I(X;R) = 0.33 Information Theory for Data Management - Divesh & Suresh

  36. Quantify Loss of Privacy [AA01] Goal: quantify loss of privacy based on mutual information I(X;R) X is uniform in [5, 6], R2 = X + μ (mod 11), μ is uniform in {0, 1} I(X;R1) = 0.33, I(X;R2) = 0.5  R2 is a bigger privacy risk than R1 Information Theory for Data Management - Divesh & Suresh

  37. Quantify Loss of Privacy [AA01] Equivalent goal: quantify loss of privacy based on H(X|R) X is uniform in [5, 6], R2 = X + μ (mod 11), μ is uniform in {0, 1} Intuition: we know more about X given R2, than about X given R1 H(X|R1) = 0.67, H(X|R2) = 0.5  R2 is a bigger privacy risk than R1 Information Theory for Data Management - Divesh & Suresh

  38. Quantify Loss of Privacy Example: X is uniform in [0, 1] R3 = e (p = 0.9999), R3 = X (p = 0.0001) R4 = X (p = 0.6), R4 = 1 – X (p = 0.4) Is R3 or R4 a bigger privacy risk? Information Theory for Data Management - Divesh & Suresh

  39. Worst Case Loss of Privacy [EGS03] Example: X is uniform in [0, 1] R3 = e (p = 0.9999), R3 = X (p = 0.0001) R4 = X (p = 0.6), R4 = 1 – X (p = 0.4) I(X;R3) = 0.0001 << I(X;R4) = 0.028 Information Theory for Data Management - Divesh & Suresh

  40. Worst Case Loss of Privacy [EGS03] Example: X is uniform in [0, 1] R3 = e (p = 0.9999), R3 = X (p = 0.0001) R4 = X (p = 0.6), R4 = 1 – X (p = 0.4) I(X;R3) = 0.0001 << I(X;R4) = 0.028 But R3 has a larger worst case risk Information Theory for Data Management - Divesh & Suresh

  41. Worst Case Loss of Privacy [EGS03] Goal: quantify worst case loss of privacy in X by knowledge of R Use max KL divergence, instead of mutual information Mutual information can be formulated as expected KL divergence I(X;R) = ∑x ∑r p(x,r)*log2(p(x,r)/p(x)*p(r)) = KL(p(X,R) || p(X)*p(R)) I(X;R) = ∑r p(r) ∑x p(x|r)*log2(p(x|r)/p(x)) = ER [KL(p(X|r) || p(X))] [AA01] measure quantifies expected loss of privacy over R [EGS03] propose a measure based on worst case loss of privacy IW(X;R) = MAXR [KL(p(X|r) || p(X))] Information Theory for Data Management - Divesh & Suresh

  42. Worst Case Loss of Privacy [EGS03] Example: X is uniform in [0, 1] R3 = e (p = 0.9999), R3 = X (p = 0.0001) R4 = X (p = 0.6), R4 = 1 – X (p = 0.4) IW(X;R3) = max{0.0, 1.0, 1.0} > IW(X;R4) = max{0.028, 0.028} Information Theory for Data Management - Divesh & Suresh

  43. Worst Case Loss of Privacy [EGS03] Example: X is uniform in [5, 6] R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} R2 = X + μ (mod 11), μ is uniform in {0, 1} IW(X;R1) = max{1.0, 0.0, 0.0, 1.0} = IW(X;R2) = {1.0, 0.0, 1.0} Unable to capture that R2 is a bigger privacy risk than R1 Information Theory for Data Management - Divesh & Suresh

  44. Data Anonymization: Summary Randomization techniques useful for microdata anonymization Randomization techniques differ in their loss of privacy Information theoretic measures useful to capture loss of privacy Expected KL divergence captures expected privacy loss [AA01] Maximum KL divergence captures worst case privacy loss [EGS03] Both are useful in practice Information Theory for Data Management - Divesh & Suresh

  45. Outline Part 1 Introduction to Information Theory Application: Data Anonymization Application: Database Design Part 2 Review of Information Theory Basics Application: Data Integration Computing Information Theoretic Primitives Open Problems Information Theory for Data Management - Divesh & Suresh

  46. Information Dependencies [DR00] Goal: use information theory to examine and reason about information content of the attributes in a relation instance Key ideas: Novel InD measure between attribute sets X, Y based on H(Y|X) Identify numeric inequalities between InD measures Results: InD measures are a broader class than FDs and MVDs Armstrong axioms for FDs derivable from InD inequalities MVD inference rules derivable from InD inequalities Information Theory for Data Management - Divesh & Suresh

  47. Information Dependencies [DR00] Functional dependency: X → Y FD X → Y holds iff  t1, t2 ((t1[X] = t2[X])  (t1[Y] = t2[Y])) Information Theory for Data Management - Divesh & Suresh

  48. Information Dependencies [DR00] Functional dependency: X → Y FD X → Y holds iff  t1, t2 ((t1[X] = t2[X])  (t1[Y] = t2[Y])) Information Theory for Data Management - Divesh & Suresh

  49. Information Dependencies [DR00] Result: FD X → Y holds iff H(Y|X) = 0 Intuition: once X is known, no remaining uncertainty in Y H(Y|X) = 0.5 Information Theory for Data Management - Divesh & Suresh

  50. Information Dependencies [DR00] Multi-valued dependency: X →→ Y MVD X →→ Y holds iff R(X,Y,Z) = R(X,Y) R(X,Z) Information Theory for Data Management - Divesh & Suresh

More Related