Loading in 5 sec....

CS573 Data Privacy and Security Anonymization methodsPowerPoint Presentation

CS573 Data Privacy and Security Anonymization methods

- 146 Views
- Uploaded on
- Presentation posted in: General

CS573 Data Privacy and Security Anonymization methods

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

CS573 Data Privacy and SecurityAnonymization methods

Li Xiong

- Permutation based anonymization methods (cont.)
- Other privacy principles for microdata publishing
- Statistical databases

- Non-perturbative: don't distort the data
- Generalization
- Suppression

- Perturbative: distort the data
- Microaggregation/clustering
- Additive noise

- Anatomization and permutation
- De-associate relationship between QID and sensitive attribute

Concept of the Anatomy Algorithm

- Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST)
- Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column
- Then produce a sensitive table with Disease statistics

Specifications of Anatomy cont.

DEFINITION 3. (Anatomy)

With a given l-diverse partition anatomy will create QIT and ST tables

QIT will be constructed as the following:

(Aqi1,Aqi2, ..., Aqid,Group-ID)

ST will be constructed as the following:

(Group-ID, As, Count)

Privacy properties

THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l

Comparison with generalization

- Compare with generalization on two assumptions:
- A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata
- If A1 and A2 are true, anatomy is as good as generalization 1/l holds true
- If A1 is true and A2 is false, generalization is stronger
- If A1 and A2 are false, generalization is still stronger

Preserving Data Correlation

- Examine the correlation between Age and Disease in T using probability density function pdf
- Example: t1

Preserving Data Correlation cont.

- To re-construct an approximate pdf of t1 from the generalization table:

Preserving Data Correlation cont.

- To re-construct an approximate pdf of t1 from the QIT and ST tables:

Preserving Data Correlation cont.

- To figure out a more rigorous comparison, calculate the “L2 distance” with the following equation:
- The distance for anatomy is 0.5 while the distance for generalization is 22.5

Preserving Data Correlation cont.

Idea: Measure the error for each tuple by using the following formula:

Objective: for all tuplest in T and obtain a minimal re-construction error (RCE):

Algorithm: Nearly-Optimal Anatomizing Algorithm

Experiments

- dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes
- Created two sets of microdata tables
- Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤d≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute As
- Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤d≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute Asg

Experiments cont.

- Permutation based anonymization methods (cont.)
- Other privacy principles for microdata publishing
- Statistical databases
- Differential privacy

- k-Anonymity does not provide privacy if
- Sensitive values in an equivalence class lack diversity
- The attacker has background knowledge

A 3-anonymous patient table

Homogeneity attack

Background knowledge attack

[Machanavajjhala et al. ICDE ‘06]

Sensitive attributes must be

“diverse” within each

quasi-identifier equivalence class

- Each equivalence class has at least l well-represented sensitive values
- Doesn’t prevent probabilistic inference attacks

8 records have HIV

10 records

2 records have other values

- Probabilistic l-diversity
- The frequency of the most frequent value in an equivalence class is bounded by 1/l

- Entropy l-diversity
- The entropy of the distribution of sensitive values in each equivalence class is at least log(l)

- Recursive (c,l)-diversity
- r1<c(rl+rl+1+…+rm) where ri is the frequency of the ith most frequent value
- Intuition: the most frequent value does not appear too frequently

Original dataset

99% have cancer

Original dataset

Anonymization A

50% cancer quasi-identifier group is “diverse”

99% have cancer

Original dataset

Anonymization A

Anonymization B

99% cancer quasi-identifier group is not “diverse”

50% cancer quasi-identifier group is “diverse”

This leaks a ton of information

99% have cancer

- Example: sensitive attribute is HIV+ (1%) or HIV- (99%)
- Very different degrees of sensitivity!

- l-diversity is unnecessary
- 2-diversity is unnecessary for an equivalence class that contains only HIV- records

- l-diversity is difficult to achieve
- Suppose there are 10000 records in total
- To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes

- Example: sensitive attribute is HIV+ (1%) or HIV- (99%)
- Consider an equivalence class that contains an equal number of HIV+ and HIV- records
- Diverse, but potentially violates privacy!

- l-diversity does not differentiate:
- Equivalence class 1: 49 HIV+ and 1 HIV-
- Equivalence class 2: 1 HIV+ and 49 HIV-

l-diversity does not consider overall distribution of sensitive values!

A 3-diverse patient table

Similarity attack

Conclusion

Bob’s salary is in [20k,40k], which is relatively low

Bob has some stomach-related disease

l-diversity does not consider semantics of sensitive values!

- Rationale

- Observations
- Q is public or can be derived
- Potential knowledge gain from Q and Pi about Specific individuals

- Principle
- The distance between Q and Pi should be bounded by a threshold t.

ExternalKnowledge

Overall distribution Q of sensitive values

Distribution Pi of sensitive values in each equi-class

[Li et al. ICDE ‘07]

Distribution of sensitive

attributes within each

quasi-identifier group should

be “close” to their distribution

in the entire original database

- P=(p1,p2,…,pm), Q=(q1,q2,…,qm)

- Trace-distance

- KL-divergence

- None of these measures reflect the semantic distance among values.
- Q:{3K,4K,5K,6K,7K,8K,9K,10K,11k}
P1:{3K,4K,5k}

P2:{5K,7K,10K}

- Intuitively, D[P1,Q]>D[P2,Q]

- Q:{3K,4K,5K,6K,7K,8K,9K,10K,11k}

- If the distributions are interpreted as two different ways of piling up a certain amount of dirt over region D, EMD is the minimum cost of turning one pile into the other
- the cost is amount of dirt moved * the distance by which it is moved
- Assume two piles have the same amount of dirt

- Extensions for comparison of distributions with different total masses.
- allow for a partial match, discard leftover "dirt“, without cost
- allow for mass to be created or destroyed, but with a cost penalty

- Formulation
- P=(p1,p2,…,pm), Q=(q1,q2,…,qm)
- dij: the ground distance between element i of P and element j of Q.
- Find a flow F=[fij] where fij is the flow of mass from element i of P to element j of Q that minimizes the overall work:
subject to the constraints:

- EMD for categorical attributes
- Hierarchical distance
- Hierarchical distance is a metric

- Example
- {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k}
- Move 1/9 probability for each of the following pairs
- 3k->6k,3k->7k cost: 1/9*(3+4)/8
- 4k->8k,4k->9k cost: 1/9*(4+5)/8
- 5k->10k,5k->11k cost: 1/9*(5+6)/8

- Total cost: 1/9*27/8=0.375
- With P2={6k,8k,11k} , we can get the total cost is 1/9 * 12/8 = 0.167 < 0.375. This make more sense than the other two distance calculation method.

- Goal
- To show l-diversity does not provide sufficient privacy protection (the similarity attack).
- To show the efficiency and data quality of using t-closeness are comparable with other privacy measures.

- Setup
- Adult dataset from UC Irvine ML repository
- 30162 tuples, 9 attributes (2 sensitive attributes)
- Algorithm: Incognito

- Comparisons of privacy measurements
- k-Anonymity
- Entropy l-diversity
- Recursive (c,l)-diversity
- k-Anonymity with t-closeness

- Efficiency
- The efficiency of using t-closeness is comparable with other privacy measurements

- Data utility
- Discernibility metric; Minimum average group size
- The data quality of using t-closeness is comparable with other privacy measurements

This is k-anonymous,

l-diverse and t-close…

…so secure, right?

Bob is Caucasian and

I heard he was

admitted to hospital

with flu…

Bob is Caucasian and

I heard he was

admitted to hospital

…

And I know three other Caucasions admitted to hospital with Acne or Shingles …

- Syntactic
- Focuses on data transformation, not on what can be learned from the anonymized dataset
- “k-anonymous” dataset can leak sensitive information

- “Quasi-identifier” fallacy
- Assumes a priori that attacker will not
know certain information about his target

- Assumes a priori that attacker will not

- Permutation based anonymization methods (cont.)
- Other privacy principles for microdata publishing
- Statistical databases
- Definitions and early methods
- Output perturbation and differential privacy

- Originated from the study on statistical database
- A statistical database is a database which provides statistics on subsets of records
- OLAP vs. OLTP
- Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of records

Static – a static database is made once and never changes

Example: U.S. Census

Dynamic – changes continuously to reflect real-time data

Example: most online research databases

Centralized – one database

Decentralized – multiple decentralized databases

- General purpose – like census

- Special purpose – like bank, hospital, academia, etc

- Exact compromise – a user is able to determine the exact value of a sensitive attribute of an individual
- Partial compromise – a user is able to obtain an estimator for a sensitive attribute with a bounded variance
- Positive compromise – determine an attribute has a particular value
- Negative compromise – determine an attribute does not have a particular value
- Relative compromise – determine the ranking of some confidential values

- Bias – difference between the unperturbed statistic and the expected value of its perturbed estimate
- Precision – variance of the estimators obtained by users
- Consistency – lack of contradictions and paradoxes
- Contradictions: different responses to same query; average differs from sum/count
- Paradox: negative count

- Query restriction
- Data perturbation/anonymization
- Output perturbation

Query

Results

Results

Query

- Data anonymization is one technique that can be used to build statistical database
- Other techniques such as query restriction and output purterbation can be used to build statistical database or release statistical data
- Different privacy principles can be used

- Query restriction (early methods)
- Query size control
- Query set overlap control
- Query auditing

- Data perturbation/anonymization
- Output perturbation

- A query-set size control limit the number of records that must be in the result set
- Allows the query results to be displayed only if the size of the query set |C| satisfies the condition
K <= |C| <= L – K

where L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2

- Q1: Count ( Sex = Female ) = A
- Q2: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) ) = B

What if B = A+1?

- Q1: Count ( Sex = Female ) = A
- Q2: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) ) = B

If B = A+1

- Q3: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) &

Diagnosis = Schizophrenia)

Positively or negatively compromised!

- With query set size control the database can be easily compromised within a frame of 4-5 queries
- For query set control, if the threshold value k is large, then it will restrict too many queries
- And still does not guarantee protection from compromise

- Basic idea: successive queries must be checked against the number of common records.
- If the number of common records in any query exceeds a given threshold, the requested statistic is not released.
- A query q(C) is only allowed if:
|q (C ) ^ q (D) | ≤ r, r> 0

Where r is set by the administrator

- Ineffective for cooperation of several users
- Statistics for a set and its subset cannot be released – limiting usefulness
- Need to keep user profile
- High processing overhead – every new query compared with all previous ones
- No formal privacy guarantee

- Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued
- Excessive computation and storage requirements
- “Efficient” methods for special types of queries

- Query auditing method for SUM queries
- A SUM query can be considered as a linear equation
where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result

- A set of SUM queries can be thought of as a system of linear equations
- Maintains the binary matrix representing linearly independent queries and update it when a new query is issued
- A row with all 0s except for ith column indicates disclosure

- Only stores linearly independent queries
- Not all queries are linearly independent
Q1: Sum(Sex=M)

Q2: Sum(Sex=M AND Age>20)

Q3: Sum(Sex=M AND Age<=20)

- O(L2) time complexity
- Further work reduced to O(L) time and space when number of queries < L
- Only for SUM queries
- No restrictions on query set size
- Maximizing non-confidential information is NP-complete

- Online auditing
- “Detect and deny” queries that violate privacy requirement
- Denial themselves may implicitly disclose sensitive information

- Offline auditing
- Check if a privacy requirement has been violated after the queries have been executed
- Not to prevent

Query restriction

Data perturbation/anonymization

Output perturbation and differential privacy

Sampling

Output perturbation

- Partial slides:
http://www.cs.jmu.edu/users/aboutams

- Adam, Nabil R. ; Wortmann, John C.; Security-Control Methods for Statistical Databases: A Comparative Study; ACM Computing Surveys, Vol. 21, No. 4, December 1989
- Fung et al. Privacy Preserving Data Publishing: A Survey of Recent Development, ACM Computing Surveys, in press, 2009