Loading in 5 sec....

CS573 Data Privacy and Security Anonymization methodsPowerPoint Presentation

CS573 Data Privacy and Security Anonymization methods

- By
**drake** - Follow User

- 157 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' CS573 Data Privacy and Security Anonymization methods' - drake

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### CS573 Data Privacy and SecurityAnonymization methods

Li Xiong

Today

- Permutation based anonymization methods (cont.)
- Other privacy principles for microdata publishing
- Statistical databases

Anonymization methods

- Non-perturbative: don't distort the data
- Generalization
- Suppression

- Perturbative: distort the data
- Microaggregation/clustering
- Additive noise

- Anatomization and permutation
- De-associate relationship between QID and sensitive attribute

Concept of the Anatomy Algorithm

- Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST)
- Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column
- Then produce a sensitive table with Disease statistics

Specifications of Anatomy cont.

DEFINITION 3. (Anatomy)

With a given l-diverse partition anatomy will create QIT and ST tables

QIT will be constructed as the following:

(Aqi1,Aqi2, ..., Aqid,Group-ID)

ST will be constructed as the following:

(Group-ID, As, Count)

THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l

Comparison with generalization

- Compare with generalization on two assumptions:
- A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata
- If A1 and A2 are true, anatomy is as good as generalization 1/l holds true
- If A1 is true and A2 is false, generalization is stronger
- If A1 and A2 are false, generalization is still stronger

- Examine the correlation between Age and Disease in T using probability density function pdf
- Example: t1

Preserving Data Correlation cont.

- To re-construct an approximate pdf of t1 from the generalization table:

Preserving Data Correlation cont.

- To re-construct an approximate pdf of t1 from the QIT and ST tables:

Preserving Data Correlation cont.

- To figure out a more rigorous comparison, calculate the “L2 distance” with the following equation:
- The distance for anatomy is 0.5 while the distance for generalization is 22.5

Preserving Data Correlation cont.

Idea: Measure the error for each tuple by using the following formula:

Objective: for all tuplest in T and obtain a minimal re-construction error (RCE):

Algorithm: Nearly-Optimal Anatomizing Algorithm

- dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes
- Created two sets of microdata tables
- Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤d≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute As
- Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤d≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute Asg

Experiments cont.

Today

- Permutation based anonymization methods (cont.)
- Other privacy principles for microdata publishing
- Statistical databases
- Differential privacy

Attacks on k-Anonymity

- k-Anonymity does not provide privacy if
- Sensitive values in an equivalence class lack diversity
- The attacker has background knowledge

A 3-anonymous patient table

Homogeneity attack

Background knowledge attack

l-Diversity

[Machanavajjhala et al. ICDE ‘06]

Sensitive attributes must be

“diverse” within each

quasi-identifier equivalence class

Distinct l-Diversity

- Each equivalence class has at least l well-represented sensitive values
- Doesn’t prevent probabilistic inference attacks

8 records have HIV

10 records

2 records have other values

Other Versions of l-Diversity

- Probabilistic l-diversity
- The frequency of the most frequent value in an equivalence class is bounded by 1/l

- Entropy l-diversity
- The entropy of the distribution of sensitive values in each equivalence class is at least log(l)

- Recursive (c,l)-diversity
- r1<c(rl+rl+1+…+rm) where ri is the frequency of the ith most frequent value
- Intuition: the most frequent value does not appear too frequently

Neither Necessary, Nor Sufficient

Original dataset

Anonymization A

50% cancer quasi-identifier group is “diverse”

99% have cancer

Neither Necessary, Nor Sufficient

Original dataset

Anonymization A

Anonymization B

99% cancer quasi-identifier group is not “diverse”

50% cancer quasi-identifier group is “diverse”

This leaks a ton of information

99% have cancer

Limitations of l-Diversity

- Example: sensitive attribute is HIV+ (1%) or HIV- (99%)
- Very different degrees of sensitivity!

- l-diversity is unnecessary
- 2-diversity is unnecessary for an equivalence class that contains only HIV- records

- l-diversity is difficult to achieve
- Suppose there are 10000 records in total
- To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes

Skewness Attack

- Example: sensitive attribute is HIV+ (1%) or HIV- (99%)
- Consider an equivalence class that contains an equal number of HIV+ and HIV- records
- Diverse, but potentially violates privacy!

- l-diversity does not differentiate:
- Equivalence class 1: 49 HIV+ and 1 HIV-
- Equivalence class 2: 1 HIV+ and 49 HIV-

l-diversity does not consider overall distribution of sensitive values!

Sensitive Attribute Disclosure

A 3-diverse patient table

Similarity attack

Conclusion

Bob’s salary is in [20k,40k], which is relatively low

Bob has some stomach-related disease

l-diversity does not consider semantics of sensitive values!

t-Closeness: A New Privacy Measure

- Rationale

- Observations
- Q is public or can be derived
- Potential knowledge gain from Q and Pi about Specific individuals

- Principle
- The distance between Q and Pi should be bounded by a threshold t.

ExternalKnowledge

Overall distribution Q of sensitive values

Distribution Pi of sensitive values in each equi-class

t-Closeness

[Li et al. ICDE ‘07]

Distribution of sensitive

attributes within each

quasi-identifier group should

be “close” to their distribution

in the entire original database

Distance Measures

- P=(p1,p2,…,pm), Q=(q1,q2,…,qm)

- Trace-distance

- KL-divergence

- None of these measures reflect the semantic distance among values.
- Q:{3K,4K,5K,6K,7K,8K,9K,10K,11k}
P1:{3K,4K,5k}

P2:{5K,7K,10K}

- Intuitively, D[P1,Q]>D[P2,Q]

- Q:{3K,4K,5K,6K,7K,8K,9K,10K,11k}

Earth Mover’s Distance

- If the distributions are interpreted as two different ways of piling up a certain amount of dirt over region D, EMD is the minimum cost of turning one pile into the other
- the cost is amount of dirt moved * the distance by which it is moved
- Assume two piles have the same amount of dirt

- Extensions for comparison of distributions with different total masses.
- allow for a partial match, discard leftover "dirt“, without cost
- allow for mass to be created or destroyed, but with a cost penalty

Earth Mover’s Distance

- Formulation
- P=(p1,p2,…,pm), Q=(q1,q2,…,qm)
- dij: the ground distance between element i of P and element j of Q.
- Find a flow F=[fij] where fij is the flow of mass from element i of P to element j of Q that minimizes the overall work:
subject to the constraints:

How to calculate EMD(Cont’d)

- EMD for categorical attributes
- Hierarchical distance
- Hierarchical distance is a metric

Earth Mover’s Distance

- Example
- {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k}
- Move 1/9 probability for each of the following pairs
- 3k->6k,3k->7k cost: 1/9*(3+4)/8
- 4k->8k,4k->9k cost: 1/9*(4+5)/8
- 5k->10k,5k->11k cost: 1/9*(5+6)/8

- Total cost: 1/9*27/8=0.375
- With P2={6k,8k,11k} , we can get the total cost is 1/9 * 12/8 = 0.167 < 0.375. This make more sense than the other two distance calculation method.

Experiments

- Goal
- To show l-diversity does not provide sufficient privacy protection (the similarity attack).
- To show the efficiency and data quality of using t-closeness are comparable with other privacy measures.

- Setup
- Adult dataset from UC Irvine ML repository
- 30162 tuples, 9 attributes (2 sensitive attributes)
- Algorithm: Incognito

Experiments

- Comparisons of privacy measurements
- k-Anonymity
- Entropy l-diversity
- Recursive (c,l)-diversity
- k-Anonymity with t-closeness

Experiments

- Efficiency
- The efficiency of using t-closeness is comparable with other privacy measurements

Experiments

- Data utility
- Discernibility metric; Minimum average group size
- The data quality of using t-closeness is comparable with other privacy measurements

What Does Attacker Know?

Bob is Caucasian and

I heard he was

admitted to hospital

…

And I know three other Caucasions admitted to hospital with Acne or Shingles …

k-Anonymity and Partition-based notions

- Syntactic
- Focuses on data transformation, not on what can be learned from the anonymized dataset
- “k-anonymous” dataset can leak sensitive information

- “Quasi-identifier” fallacy
- Assumes a priori that attacker will not
know certain information about his target

- Assumes a priori that attacker will not

Today

- Permutation based anonymization methods (cont.)
- Other privacy principles for microdata publishing
- Statistical databases
- Definitions and early methods
- Output perturbation and differential privacy

Statistical Data Release

- Originated from the study on statistical database
- A statistical database is a database which provides statistics on subsets of records
- OLAP vs. OLTP
- Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of records

Static – a static database is made once and never changes

Example: U.S. Census

Dynamic – changes continuously to reflect real-time data

Example: most online research databases

Types of Statistical DatabasesCentralized – one database

Decentralized – multiple decentralized databases

Types of Statistical Databases- General purpose – like census

- Special purpose – like bank, hospital, academia, etc

Data Compromise

- Exact compromise – a user is able to determine the exact value of a sensitive attribute of an individual
- Partial compromise – a user is able to obtain an estimator for a sensitive attribute with a bounded variance
- Positive compromise – determine an attribute has a particular value
- Negative compromise – determine an attribute does not have a particular value
- Relative compromise – determine the ranking of some confidential values

Statistical Quality of Information

- Bias – difference between the unperturbed statistic and the expected value of its perturbed estimate
- Precision – variance of the estimators obtained by users
- Consistency – lack of contradictions and paradoxes
- Contradictions: different responses to same query; average differs from sum/count
- Paradox: negative count

Methods

- Query restriction
- Data perturbation/anonymization
- Output perturbation

Statistical data release vs. data anonymization

- Data anonymization is one technique that can be used to build statistical database
- Other techniques such as query restriction and output purterbation can be used to build statistical database or release statistical data
- Different privacy principles can be used

Security Methods

- Query restriction (early methods)
- Query size control
- Query set overlap control
- Query auditing

- Data perturbation/anonymization
- Output perturbation

Query Set Size Control

- A query-set size control limit the number of records that must be in the result set
- Allows the query results to be displayed only if the size of the query set |C| satisfies the condition
K <= |C| <= L – K

where L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2

Tracker

- Q1: Count ( Sex = Female ) = A
- Q2: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) ) = B

What if B = A+1?

Tracker

- Q1: Count ( Sex = Female ) = A
- Q2: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) ) = B

If B = A+1

- Q3: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) &

Diagnosis = Schizophrenia)

Positively or negatively compromised!

Query set size control

- With query set size control the database can be easily compromised within a frame of 4-5 queries
- For query set control, if the threshold value k is large, then it will restrict too many queries
- And still does not guarantee protection from compromise

Query Set Overlap Control

- Basic idea: successive queries must be checked against the number of common records.
- If the number of common records in any query exceeds a given threshold, the requested statistic is not released.
- A query q(C) is only allowed if:
| q (C ) ^ q (D) | ≤ r, r> 0

Where r is set by the administrator

Query-set-overlap control

- Ineffective for cooperation of several users
- Statistics for a set and its subset cannot be released – limiting usefulness
- Need to keep user profile
- High processing overhead – every new query compared with all previous ones
- No formal privacy guarantee

Auditing

- Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued
- Excessive computation and storage requirements
- “Efficient” methods for special types of queries

Audit Expert (Chin 1982)

- Query auditing method for SUM queries
- A SUM query can be considered as a linear equation
where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result

- A set of SUM queries can be thought of as a system of linear equations
- Maintains the binary matrix representing linearly independent queries and update it when a new query is issued
- A row with all 0s except for ith column indicates disclosure

Audit Expert

- Only stores linearly independent queries
- Not all queries are linearly independent
Q1: Sum(Sex=M)

Q2: Sum(Sex=M AND Age>20)

Q3: Sum(Sex=M AND Age<=20)

Audit Expert

- O(L2) time complexity
- Further work reduced to O(L) time and space when number of queries < L
- Only for SUM queries
- No restrictions on query set size
- Maximizing non-confidential information is NP-complete

Auditing – recent developments

- Online auditing
- “Detect and deny” queries that violate privacy requirement
- Denial themselves may implicitly disclose sensitive information

- Offline auditing
- Check if a privacy requirement has been violated after the queries have been executed
- Not to prevent

Security Methods

Query restriction

Data perturbation/anonymization

Output perturbation and differential privacy

Sampling

Output perturbation

Sources

- Partial slides:
http://www.cs.jmu.edu/users/aboutams

- Adam, Nabil R. ; Wortmann, John C.; Security-Control Methods for Statistical Databases: A Comparative Study; ACM Computing Surveys, Vol. 21, No. 4, December 1989
- Fung et al. Privacy Preserving Data Publishing: A Survey of Recent Development, ACM Computing Surveys, in press, 2009

Download Presentation

Connecting to Server..