1 / 59

Privacy in Database s

Privacy in Database s. Umur Türkay 2006103319. Outline. Defining Privacy Optimization Problem First-Cut Solution (k-anonymity) Second-Cut Solution (l-diversity) Decision Problem First-Cut (Query-View Security) Second-Cut (View Safety). Defining Privacy in DB Publishing.

leora
Download Presentation

Privacy in Database s

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy in Databases Umur Türkay 2006103319

  2. Outline • Defining Privacy • Optimization Problem • First-Cut Solution (k-anonymity) • Second-Cut Solution (l-diversity) • Decision Problem • First-Cut (Query-View Security) • Second-Cut (View Safety)

  3. Defining Privacy in DB Publishing If the attacker uses legitimate methods, - can she infer the data I want to keep private? - how can I keep some data private while publishing useful info? Decision Problem Optimization Problem   Attacker ExternalKnowledge Modify Data V1 V2   Alice Secret

  4. Outline • Defining Privacy • Optimization Problem • First-Cut Solution (k-anonymity) • Second-Cut Solution (l-diversity) • Decision Problem • First-Cut (Query-View Security) • Second-Cut (View Safety)

  5. Need for Privacy in DB publishing • Alice is a owner of person-specific data • Public health agency, Telecom provider, Financial Organization • The person-specific data contains • Attribute values which can uniquely identify an individual • { zip-code, gender, date-of-birth } or/and {name} or/and {SSN} • sensitive information corresponding to individuals • medical condition, salary, location • Great demand for sharing of person-specific data • Medical research, new telecom applications • Alice wants to publish this person-specific data s.t. • Information remains practically useful • Identity of the individual cannot be determined  Modify Data 

  6. The Optimization Problem Motivating Example Secret: Alice wants to publish hospital data, while thecorrespondence between name & disease stays private  Modify Data 

  7. Attacker’s Knowledge: Voter registration list # Name Zip Age Nationality 1 John 13067 45 US 2 Paul 13067 22 US 3 Bob 13067 29 US 4 Chris 13067 23 US The Optimization Problem Motivating Example (continued) Published Data: Alice publishes data without the Name  Modify Data 

  8. The Optimization Problem Motivating Example (continued) Published Data: Alice publishes data without the Name  Modify Data  Attacker’s Knowledge: Voter registration list Data Leak !

  9. The Optimization Problem Source of the Problem Even if we do not publish the individuals:• There are some fields that may uniquely identify some individual Quasi Identifier • The attacker can use them to join with other sources and identify the individuals

  10. Outline • Defining Privacy • Optimization Problem • First-Cut Solution (k-anonymity) • Second-Cut Solution (l-diversity) • Decision Problem • First-Cut (Query-View Security) • Second-Cut (View Safety)

  11. # Zip Age Nationality Condition 1 130** < 40 * Heart Disease 2 130** < 40 * Heart Disease 3 130** < 40 * Cancer 4 130** < 40 * Cancer 4-anonymous Table The Optimization Problem First-Cut Solution: k-Anonymity Instead of returning the original data:• Change the data such that for each tuple in the results there are at least k-1 other tuples with the same value for the quasi-identifier e.g. Original Table 2-anonymous Table

  12. The Optimization Problem > k-Anonymity Generalization & Suppression Different ways of modifying data:• Randomization• Data-Swapping … • GeneralizationReplace the value with a less specific but semantically consistent value • Suppression Do not release a value at all  Modify Data 

  13. # Zip Age Nationality Condition # # Zip Zip Age Age Nationality Nationality Condition Condition 1 130** < 40 * Heart Disease 1 1 130** 13053 < 30 < 40 American * Heart Disease Heart Disease 2 130** < 40 * Heart Disease 2 2 130** 13067 < 30 < 40 American * Heart Disease Heart Disease 3 130** < 40 * Cancer 3 3 130** 13053 3* < 40 Asian * Cancer Cancer 4 130** < 40 * Cancer 4 4 130** 13067 3* < 40 Asian * Cancer Cancer The Optimization Problem > k-Anonymity Generalization Hierarchies • Generalization Hierarchies: Data owner defines how values can be generalized Zip Age Nationality 3 *  < 40 130 * 2 < 30 3* American Asian 1305 1306 1 13053 13058 13063 13067 28 29 36 37 Brazilian US Indian Japanese 0 • Table Generalization: A table generalization is created by generalizing all values in a column to a specific level of generalization e.g.2-anonymization

  14. The Optimization Problem > k-Anonymity k-minimal Generalizations • There are manyk-anonymizations. Which to pick?The ones that do not generalize the data more than needed k-minimal Generalization: A k-anonymization that is not a generalization of another k-anonymization   2-minimal Generalization 2-minimal Generalization e.g.  Non-minimal2-anonymization

  15. The Optimization Problem > k-Anonymity k-minimal Distortions • There are manyk-minimal generalizations. Which to pick?The ones the create the minimum distortion to the data k-minimal Distortion: A k-minimal generalization that has the least distortion  Current level of generalization for attribute i Max level of generalization for attribute i attrib i Distortion D = Number of attributes e.g. 0 2 2 2 1 1 D = ( ) / 3 = 0.56 D = ( ) / 3 = 0.5 + + + + 3 3 2 3 3 2

  16. The Optimization Problem > k-Anonymity Complexity & Algorithms Search Space:• Number of generalizations =  (Max level of generalization for attribute i + 1) attrib i If we allow generalization to a different level for each value of an attribute: • Number of generalizations = #tuples  (Max level of generalization for attribute i + 1) attrib i Problem is NP-hard! See paper for: • Naïve Brute force algorithm • Heuristics: Datafly,  - Argus

  17. The Optimization Problem > k-Anonymity k-Anonymity Drawbacks k-Anonymity alone does not provide privacy if:• Sensitive attributes lack diversity • Attacker has background knowledge

  18. The Optimization Problem > k-Anonymity k-Anonymity Attack Example  Original Data The attacker knows: • About quasi-identifiers: • Other background knowledge: Japanese have low incidence of heart disease

  19. Data Leak ! The Optimization Problem > k-Anonymity k-Anonymity Attack Example 4-anonymization Umeko has Viral Infection! Bob has Cancer!

  20. Outline • Defining Privacy • Optimization Problem • First-Cut Solution (k-anonymity) • Second-Cut Solution (l-diversity) • Decision Problem • First-Cut (Query-View Security) • Second-Cut (View Safety)

  21. The Optimization Problem Second-Cut Solution: l-Diversity Return a k-anonymization with the additional property that:• For each distinct value of the quasi-identifier there exist l different values for the sensitive attributes

  22. The Optimization Problem > l-Diversity l-Diversity Example 3-diversified Attack does not work! Umeko has Viral Infectionor Cancer Bob has Viral Infectionor Cancer or Heart Disease

  23. Outline • Defining Privacy • Optimization Problem • First-Cut Solution (k-anonymity) • Second-Cut Solution (l-diversity) • Decision Problem • First-Cut (Query-View Security) • Second-Cut (View Safety)

  24. The Decision Problem Moving from practice to theory… • k-anonymity & l-diversity make it harder for the attacker to figure out private associations… • … but they still give away some knowledge & they do not give any guarantees on the amount of data being disclosed • Alice wants to publish some views of her data and wants to know: • Do her views disclose some sensitive data? • If she adds a new view, will there be an additional data disclosure?  Views V1 V2 

  25. The Decision Problem Motivating Example Secret: Alice wants to keep the correlationbetween Name & Condition secret S = (name, condition)  V1 V2  Published Views: Alice publishes the views V1 = (zip, name) V2 = (zip, condition)

  26. Condition Heart Disease Ronaldo Viral Infection Data Leak ! The Decision Problem Motivating Example Attackers Knowledge: Before seeing the views:(assuming he knows the domain)  Ronaldo V1 V2  After seeing the views: V1 V2 

  27. x1 = 1/2 x2 = 1/2 x3 = 1/2 x4 = 1/2 • Attacker assigns a probability to each tuple The Decision Problem > Model for attacker’s knowledge Probability of possible tuples • Domain of possible values for all attributes: D = {Bob, Mary} • Set of possible tuples of relation R (e.g. cooksFor):

  28. The Decision Problem > Model for attacker’s knowledge Probability of possible Databases • This implies a probability for each possible database instance: x1 = 1/2 1 – x2 = 1/2  = 1/16 1 – x4 = 1/2 1 – x3 = 1/2 1 – x1 = 1/2 16possibleinstances x2 = 1/2  = 1/16 1 – x4 = 1/2 1 – x3 = 1/2 x1 = 1/2 x2 = 1/2  = 1/16 1 – x4 = 1/2 1 – x3 = 1/2

  29. The Decision Problem > Model for attacker’s knowledge Probability of possible Secrets • This implies a probability for each possible secret value: Probability that secret S(y) :- R(x,y) equals s = {(Bob)} Sum of probabilities of instances that can return this query result 3 P[S(I) = s]= 16 Similarly for probability that view V equals v: P(V(I) = v)

  30. The Decision Problem > Model for attacker’s knowledge Prior & Posterior Probability • Prior Probability: Probability before seeing the view instance 3 P[S(I) = {(Bob)}]= Secret S(y) :- R(x,y) 16 • Posterior Probability: Probability after seeing the view instance View V(x) :- R(x,y) If V(I) = {(Mary)} P[S(I) = {(Bob)} | V(I) = {(Mary)}]= P[S(I) = {(Bob)} AND V(I) = {(Mary)}] 1/16 = P[S(I) = {V(I) = {(Mary)}] 3/16

  31. The Decision Problem Query-View Security • A query S is secure w.r.t. a set of views V if • for any possible answer s to S & for any possible answer v to V: • P[S(I) = s] = P[S(I) = s | V(I) = v] PriorProbability PosteriorProbability Intuitively,if some possible answer to S becomes more or less possible after publishing the views V, then S is not secure w.r.t. V

  32. The probability distribution does not affect the security of a query The Decision Problem From Probabilities to Logic • A possible tuple t is a critical tuple if • for some possible instance I: • Q[I] Q[I – {t}] Query resultin presence of t Query resultin absence of t Intuitively,critical tuples are those of interest to the query • A query S is secure w.r.t. a set of views V iff: • crit(S)  crit(V) = 

  33. The Decision Problem Example of Non-Secure Query Previous Example Revisited: Secret S(y) :- R(x,y) Non-Secure Query S View V(x) :- R(x,y) Critical Tuples for S: crit(S) Critical Tuples for V: crit(V)   e.g. S({(Mary,Mary)}  S{}

  34. The Decision Problem Example of Secure Query Example 2: Secret S(x) :- R(x,’Mary’) Secure Query S View V(x) :- R(x,’Bob’) Critical Tuples for S: crit(S) Critical Tuples for V: crit(V)  =

  35. The Decision Problem Example of Secure Query Example 2 revisited using probabilistic definition of security: Secret S(x) :- R(x,’Mary’) Secure Query S View V(x) :- R(x,’Bob’) = P[S(I) = {(Mary)] = 4/16 P[S(I) = {(Mary)} | V(I) = {(Bob)}] = 1/4

  36. The Decision Problem Properties of Query-View Security • Reflexivity • Is S is secure w.r.t. V, V is secure w.r.t. S • No obscurity • view definitions, secret query and schema are not concealed • Instance Independence • If S is secure w.r.t. V even if the underlying database changes • Probability Distribution Independence • If S and V are monotone queries • Domain Independence • If S is secure w.r.t V for a domain D0 such that |D0| <= n(n+1), • then S is secure w.r.t. V for all Domains D where |D0| <= n(n+1) • Complexity of query-view security • P2 - complete

  37. The Decision Problem Prior Knowledge • Prior knowledge • other than domain D and probability distribution P • e.g. key or foreign key constraint or • Represented as a Boolean query K over the instance • Query view security • P[S(I) = s | K(I)]  P[S(I)=s | V(I) = v  K(I)]

  38. The Decision Problem Measuring Disclosure • The query-view security is very strong • rules out most of the views in practical usage as insecure • The applications are ready to tolerate some disclosures • Disclosure examples: • Positive disclosure “Bob” has “Cancer” • Negative disclosure “Umeko” does not have “Heart Disease” • Measure of Positive disclosure: • Leak(S,V)  sup ( P[sS(I) | v V(I)] - P[s S(I)] ) / P[sS(I)] • Disclosure is minute if: • leak(S,V) << 1 s, v

  39. The Decision Problem Query-View Security Drawbacks • Tuples are modeled as mutually independent • This is not the case in presence of constraints(e.g foreign key constraints) • Modeling prior or external knowledge • Boolean predicate does not suffice • Conjunctive queries only is restrictive • Guarantees are instance-independent • There may not be a privacy breach given the current instance

  40. Outline • Defining Privacy • Optimization Problem • First-Cut Solution (k-anonymity) • Second-Cut Solution (l-diversity) • Decision Problem • First-Cut (Query-View Security) • Second-Cut (View Safety)

  41. The Decision Problem More general setting • Alice has a database D which conforms to schema S. • D satisfies a set of constraints . • V is a set of views over D. • Model attacker’s belief as probability distribution • Views and queries are defined using UCQ • Alice wants to publish an additional view N. • Does view N provide any new information to the Attacker about the answer to query Q?

  42. Motivating Example (w/o Constraints) Secret: Alice wants to hide the reviewer of paper P1 S(r) :- RP(r, ‘P1’)  V1 V2 Published Views: New Additional Views:  V1(r) :-RC(r, c) V2(c) :-RC(r, c) N1(r, c) :-RC(r, c) N2(c, p) :-CP(c, p) New views reveal nothing about the secret

  43. Motivating Example (with Constraint 1) Published Views: New Additional Views: V1(r) :-RC(r, c) V2(c) :-RC(r, c) N1(r, c) :-RC(r, c) N2(c, p) :-CP(c, p) Data disclosuredepends on theconstraints Constraint 1:Papers assigned to a committee can only be reviewed by committee members rp RP(r,p)  c RC(r,c)CP(c,p) Possible secretswith new views:

  44. Motivating Example (with Constraint 2) Published Views: New Additional Views: V1(r) :-RC(r, c) V2(c) :-RC(r, c) N1(r, c) :-RC(r, c) N2(c, p) :-CP(c, p) Data disclosuredepends on theconstraints Constraint 1:Papers assigned to a committee can only be reviewed by committee members Constraint 2:Each paper has exactly 2 reviewers Possible secretswith new views:

  45. Motivating Example (different instance) Published Views: New Additional Views: V1(r) :-RC(r, c) V2(c) :-RC(r, c) N1(r, c) :-RC(r, c) N2(c, p) :-CP(c, p) Data disclosuredepends on theinstance Constraint 1:Papers assigned to a committee can only be reviewed by committee members New views reveal nothing about the secret,since any subset of the reviewers in V1 may review paper ‘P1’

  46. Probabilities Revisited: Plausible Secrets • In order to allow correlation of tuples, the attacker assigns probabilities to the plausible secrets (outcomes for query S that are possible given the published views) e.g. in previous example with constraint 1 & secret S(r) :- RP(r, ‘P1’) Published Views: Plausible Secrets: V1(r) :-RC(r, c) V2(c) :-RC(r, c) Any subset of V1 e.g. P1 = 3/8 P2 = 1/8 P3 = 2/8 P4 = 2/8 Pi = 0, i > 4 …

  47. The Decision Problem Possible Worlds • This induces a probability distribution on the set of possible worlds (possible instances that satisfy the constraints & the published views) Published Views: Plausible Secrets: V1(r) :-RC(r, c) V2(c) :-RC(r, c) Any subset of V1 e.g. …

  48. The Decision Problem Possible Worlds • This induces a probability distribution on the set of possible worlds (possible instances that satisfy the constraints & the published views) Possible Worlds where S = {(R1)}: Published Views: Plausible Secrets: PG1 V1(r) :-RC(r, c) V2(c) :-RC(r, c) Any subset of V1 e.g. P1 = 3/8 P2 = 1/8 P3 = 2/8 PG2 P4 = 2/8 Pi = 0, i > 4 … …

  49. Probability Distribution on Possible Worlds • This induced probability distribution can be: General: Sum of probabilities ofpossible worlds for any secret value s is equal to the probability of S = s Possible Worlds if S = {(R1)}: Published Views: Plausible Secrets: PG1 V1(r) :-RC(r, c) V2(c) :-RC(r, c) Any subset of V1 e.g. P1 = 3/8 P1 = 3/8 + P2 = 1/8 P3 = 2/8 PG2 P4 = 2/8 + Pi = 0, i > 4 … …

  50. Probability Distribution on Possible Worlds • This induced probability distribution can be: Equiprobable: Each of the possible worlds for any secret value s is equally probable (i.e. equal to the probability of S = s / # of possible worlds for s) Possible Worlds if S = {(R1)}: Published Views: Plausible Secrets: PG1 V1(r) :-RC(r, c) V2(c) :-RC(r, c) Any subset of V1 P1 = 3/8 = P2 = 1/8 P3 = 2/8 PG2 P4 = 2/8 = Pi = 0, i > 4 … …

More Related