Cs573 data privacy and security anonymization methods
This presentation is the property of its rightful owner.
Sponsored Links
1 / 65

CS573 Data Privacy and Security Anonymization methods PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on
  • Presentation posted in: General

CS573 Data Privacy and Security Anonymization methods. Li Xiong. Today. Permutation based anonymization methods (cont.) Other privacy principles for m icrodata publishing Statistical databases. Anonymization methods. Non-perturbative: don't distort the data Generalization

Download Presentation

CS573 Data Privacy and Security Anonymization methods

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cs573 data privacy and security anonymization methods

CS573 Data Privacy and SecurityAnonymization methods

Li Xiong


Today

Today

  • Permutation based anonymization methods (cont.)

  • Other privacy principles for microdata publishing

  • Statistical databases


Anonymization methods

Anonymization methods

  • Non-perturbative: don't distort the data

    • Generalization

    • Suppression

  • Perturbative: distort the data

    • Microaggregation/clustering

    • Additive noise

  • Anatomization and permutation

    • De-associate relationship between QID and sensitive attribute


Cs573 data privacy and security anonymization methods

Concept of the Anatomy Algorithm

  • Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST)

  • Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column

  • Then produce a sensitive table with Disease statistics


Cs573 data privacy and security anonymization methods

Specifications of Anatomy cont.

DEFINITION 3. (Anatomy)

With a given l-diverse partition anatomy will create QIT and ST tables

QIT will be constructed as the following:

(Aqi1,Aqi2, ..., Aqid,Group-ID)

ST will be constructed as the following:

(Group-ID, As, Count)


Cs573 data privacy and security anonymization methods

Privacy properties

THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l


Cs573 data privacy and security anonymization methods

Comparison with generalization

  • Compare with generalization on two assumptions:

  • A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata

  • If A1 and A2 are true, anatomy is as good as generalization 1/l holds true

  • If A1 is true and A2 is false, generalization is stronger

  • If A1 and A2 are false, generalization is still stronger


Cs573 data privacy and security anonymization methods

Preserving Data Correlation

  • Examine the correlation between Age and Disease in T using probability density function pdf

  • Example: t1


Cs573 data privacy and security anonymization methods

Preserving Data Correlation cont.

  • To re-construct an approximate pdf of t1 from the generalization table:


Cs573 data privacy and security anonymization methods

Preserving Data Correlation cont.

  • To re-construct an approximate pdf of t1 from the QIT and ST tables:


Cs573 data privacy and security anonymization methods

Preserving Data Correlation cont.

  • To figure out a more rigorous comparison, calculate the “L2 distance” with the following equation:

  • The distance for anatomy is 0.5 while the distance for generalization is 22.5


Cs573 data privacy and security anonymization methods

Preserving Data Correlation cont.

Idea: Measure the error for each tuple by using the following formula:

Objective: for all tuplest in T and obtain a minimal re-construction error (RCE):

Algorithm: Nearly-Optimal Anatomizing Algorithm


Cs573 data privacy and security anonymization methods

Experiments

  • dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes

  • Created two sets of microdata tables

  • Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤d≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute As

  • Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤d≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute Asg


Cs573 data privacy and security anonymization methods

Experiments cont.


Today1

Today

  • Permutation based anonymization methods (cont.)

  • Other privacy principles for microdata publishing

  • Statistical databases

  • Differential privacy


Attacks on k anonymity

Attacks on k-Anonymity

  • k-Anonymity does not provide privacy if

    • Sensitive values in an equivalence class lack diversity

    • The attacker has background knowledge

A 3-anonymous patient table

Homogeneity attack

Background knowledge attack


L diversity

l-Diversity

[Machanavajjhala et al. ICDE ‘06]

Sensitive attributes must be

“diverse” within each

quasi-identifier equivalence class


Distinct l diversity

Distinct l-Diversity

  • Each equivalence class has at least l well-represented sensitive values

  • Doesn’t prevent probabilistic inference attacks

8 records have HIV

10 records

2 records have other values


Other versions of l diversity

Other Versions of l-Diversity

  • Probabilistic l-diversity

    • The frequency of the most frequent value in an equivalence class is bounded by 1/l

  • Entropy l-diversity

    • The entropy of the distribution of sensitive values in each equivalence class is at least log(l)

  • Recursive (c,l)-diversity

    • r1<c(rl+rl+1+…+rm) where ri is the frequency of the ith most frequent value

    • Intuition: the most frequent value does not appear too frequently


Neither necessary nor sufficient

Neither Necessary, Nor Sufficient

Original dataset

99% have cancer


Neither necessary nor sufficient1

Neither Necessary, Nor Sufficient

Original dataset

Anonymization A

50% cancer  quasi-identifier group is “diverse”

99% have cancer


Neither necessary nor sufficient2

Neither Necessary, Nor Sufficient

Original dataset

Anonymization A

Anonymization B

99% cancer  quasi-identifier group is not “diverse”

50% cancer  quasi-identifier group is “diverse”

This leaks a ton of information

99% have cancer


Limitations of l diversity

Limitations of l-Diversity

  • Example: sensitive attribute is HIV+ (1%) or HIV- (99%)

    • Very different degrees of sensitivity!

  • l-diversity is unnecessary

    • 2-diversity is unnecessary for an equivalence class that contains only HIV- records

  • l-diversity is difficult to achieve

    • Suppose there are 10000 records in total

    • To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes


Skewness attack

Skewness Attack

  • Example: sensitive attribute is HIV+ (1%) or HIV- (99%)

  • Consider an equivalence class that contains an equal number of HIV+ and HIV- records

    • Diverse, but potentially violates privacy!

  • l-diversity does not differentiate:

    • Equivalence class 1: 49 HIV+ and 1 HIV-

    • Equivalence class 2: 1 HIV+ and 49 HIV-

l-diversity does not consider overall distribution of sensitive values!


Sensitive attribute disclosure

Sensitive Attribute Disclosure

A 3-diverse patient table

Similarity attack

Conclusion

Bob’s salary is in [20k,40k], which is relatively low

Bob has some stomach-related disease

l-diversity does not consider semantics of sensitive values!


T closeness a new privacy measure

t-Closeness: A New Privacy Measure

  • Rationale

  • Observations

    • Q is public or can be derived

    • Potential knowledge gain from Q and Pi about Specific individuals

  • Principle

    • The distance between Q and Pi should be bounded by a threshold t.

ExternalKnowledge

Overall distribution Q of sensitive values

Distribution Pi of sensitive values in each equi-class


T closeness

t-Closeness

[Li et al. ICDE ‘07]

Distribution of sensitive

attributes within each

quasi-identifier group should

be “close” to their distribution

in the entire original database


Distance measures

Distance Measures

  • P=(p1,p2,…,pm), Q=(q1,q2,…,qm)

  • Trace-distance

  • KL-divergence

  • None of these measures reflect the semantic distance among values.

    • Q:{3K,4K,5K,6K,7K,8K,9K,10K,11k}

      P1:{3K,4K,5k}

      P2:{5K,7K,10K}

    • Intuitively, D[P1,Q]>D[P2,Q]


Earth mover s distance

Earth Mover’s Distance

  • If the distributions are interpreted as two different ways of piling up a certain amount of dirt over region D, EMD is the minimum cost of turning one pile into the other

    • the cost is amount of dirt moved * the distance by which it is moved

    • Assume two piles have the same amount of dirt

  • Extensions for comparison of distributions with different total masses.

    • allow for a partial match, discard leftover "dirt“, without cost

    • allow for mass to be created or destroyed, but with a cost penalty


Earth mover s distance1

Earth Mover’s Distance

  • Formulation

    • P=(p1,p2,…,pm), Q=(q1,q2,…,qm)

    • dij: the ground distance between element i of P and element j of Q.

    • Find a flow F=[fij] where fij is the flow of mass from element i of P to element j of Q that minimizes the overall work:

      subject to the constraints:


How to calculate emd cont d

How to calculate EMD(Cont’d)

  • EMD for categorical attributes

    • Hierarchical distance

    • Hierarchical distance is a metric


Earth mover s distance2

Earth Mover’s Distance

  • Example

    • {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k}

    • Move 1/9 probability for each of the following pairs

      • 3k->6k,3k->7k cost: 1/9*(3+4)/8

      • 4k->8k,4k->9k cost: 1/9*(4+5)/8

      • 5k->10k,5k->11k cost: 1/9*(5+6)/8

    • Total cost: 1/9*27/8=0.375

    • With P2={6k,8k,11k} , we can get the total cost is 1/9 * 12/8 = 0.167 < 0.375. This make more sense than the other two distance calculation method.


Experiments

Experiments

  • Goal

    • To show l-diversity does not provide sufficient privacy protection (the similarity attack).

    • To show the efficiency and data quality of using t-closeness are comparable with other privacy measures.

  • Setup

    • Adult dataset from UC Irvine ML repository

    • 30162 tuples, 9 attributes (2 sensitive attributes)

    • Algorithm: Incognito


Experiments1

Experiments

  • Comparisons of privacy measurements

    • k-Anonymity

    • Entropy l-diversity

    • Recursive (c,l)-diversity

    • k-Anonymity with t-closeness


Experiments2

Experiments

  • Efficiency

    • The efficiency of using t-closeness is comparable with other privacy measurements


Experiments3

Experiments

  • Data utility

    • Discernibility metric; Minimum average group size

    • The data quality of using t-closeness is comparable with other privacy measurements


Anonymous t close dataset

Anonymous, “t-Close” Dataset

This is k-anonymous,

l-diverse and t-close…

…so secure, right?


What does attacker know

What Does Attacker Know?

Bob is Caucasian and

I heard he was

admitted to hospital

with flu…


What does attacker know1

What Does Attacker Know?

Bob is Caucasian and

I heard he was

admitted to hospital

And I know three other Caucasions admitted to hospital with Acne or Shingles …


K anonymity and partition based notions

k-Anonymity and Partition-based notions

  • Syntactic

    • Focuses on data transformation, not on what can be learned from the anonymized dataset

    • “k-anonymous” dataset can leak sensitive information

  • “Quasi-identifier” fallacy

    • Assumes a priori that attacker will not

      know certain information about his target


Today2

Today

  • Permutation based anonymization methods (cont.)

  • Other privacy principles for microdata publishing

  • Statistical databases

    • Definitions and early methods

    • Output perturbation and differential privacy


Statistical data release

Statistical Data Release

  • Originated from the study on statistical database

  • A statistical database is a database which provides statistics on subsets of records

  • OLAP vs. OLTP

  • Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of records


Types of statistical databases

Static – a static database is made once and never changes

Example: U.S. Census

Dynamic – changes continuously to reflect real-time data

Example: most online research databases

Types of Statistical Databases


Types of statistical databases1

Centralized – one database

Decentralized – multiple decentralized databases

Types of Statistical Databases

  • General purpose – like census

  • Special purpose – like bank, hospital, academia, etc


Data compromise

Data Compromise

  • Exact compromise – a user is able to determine the exact value of a sensitive attribute of an individual

  • Partial compromise – a user is able to obtain an estimator for a sensitive attribute with a bounded variance

  • Positive compromise – determine an attribute has a particular value

  • Negative compromise – determine an attribute does not have a particular value

  • Relative compromise – determine the ranking of some confidential values


Statistical quality of information

Statistical Quality of Information

  • Bias – difference between the unperturbed statistic and the expected value of its perturbed estimate

  • Precision – variance of the estimators obtained by users

  • Consistency – lack of contradictions and paradoxes

    • Contradictions: different responses to same query; average differs from sum/count

    • Paradox: negative count


Methods

Methods

  • Query restriction

  • Data perturbation/anonymization

  • Output perturbation


Data perturbation

Data Perturbation


Output perturbation

Output Perturbation

Query

Results

Results

Query


Statistical data release vs data anonymization

Statistical data release vs. data anonymization

  • Data anonymization is one technique that can be used to build statistical database

  • Other techniques such as query restriction and output purterbation can be used to build statistical database or release statistical data

  • Different privacy principles can be used


Security methods

Security Methods

  • Query restriction (early methods)

    • Query size control

    • Query set overlap control

    • Query auditing

  • Data perturbation/anonymization

  • Output perturbation


Query set size control

Query Set Size Control

  • A query-set size control limit the number of records that must be in the result set

  • Allows the query results to be displayed only if the size of the query set |C| satisfies the condition

    K <= |C| <= L – K

    where L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2


Query set size control1

Query Set Size Control


Tracker

Tracker

  • Q1: Count ( Sex = Female ) = A

  • Q2: Count ( Sex = Female OR

    (Age = 42 & Sex = Male & Employer = ABC) ) = B

    What if B = A+1?


Tracker1

Tracker

  • Q1: Count ( Sex = Female ) = A

  • Q2: Count ( Sex = Female OR

    (Age = 42 & Sex = Male & Employer = ABC) ) = B

    If B = A+1

  • Q3: Count ( Sex = Female OR

    (Age = 42 & Sex = Male & Employer = ABC) &

    Diagnosis = Schizophrenia)

    Positively or negatively compromised!


Query set size control2

Query set size control

  • With query set size control the database can be easily compromised within a frame of 4-5 queries

  • For query set control, if the threshold value k is large, then it will restrict too many queries

  • And still does not guarantee protection from compromise


Query set overlap control

Query Set Overlap Control

  • Basic idea: successive queries must be checked against the number of common records.

  • If the number of common records in any query exceeds a given threshold, the requested statistic is not released.

  • A query q(C) is only allowed if:

    |q (C ) ^ q (D) | ≤ r, r> 0

    Where r is set by the administrator


Query set overlap control1

Query-set-overlap control

  • Ineffective for cooperation of several users

  • Statistics for a set and its subset cannot be released – limiting usefulness

  • Need to keep user profile

  • High processing overhead – every new query compared with all previous ones

  • No formal privacy guarantee


Auditing

Auditing

  • Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued

  • Excessive computation and storage requirements

  • “Efficient” methods for special types of queries


Audit expert chin 1982

Audit Expert (Chin 1982)

  • Query auditing method for SUM queries

  • A SUM query can be considered as a linear equation

    where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result

  • A set of SUM queries can be thought of as a system of linear equations

  • Maintains the binary matrix representing linearly independent queries and update it when a new query is issued

  • A row with all 0s except for ith column indicates disclosure


Audit expert

Audit Expert

  • Only stores linearly independent queries

  • Not all queries are linearly independent

    Q1: Sum(Sex=M)

    Q2: Sum(Sex=M AND Age>20)

    Q3: Sum(Sex=M AND Age<=20)


Audit expert1

Audit Expert

  • O(L2) time complexity

  • Further work reduced to O(L) time and space when number of queries < L

  • Only for SUM queries

  • No restrictions on query set size

  • Maximizing non-confidential information is NP-complete


Auditing recent developments

Auditing – recent developments

  • Online auditing

    • “Detect and deny” queries that violate privacy requirement

    • Denial themselves may implicitly disclose sensitive information

  • Offline auditing

    • Check if a privacy requirement has been violated after the queries have been executed

    • Not to prevent


Security methods1

Security Methods

Query restriction

Data perturbation/anonymization

Output perturbation and differential privacy

Sampling

Output perturbation


Sources

Sources

  • Partial slides:

    http://www.cs.jmu.edu/users/aboutams

  • Adam, Nabil R. ; Wortmann, John C.; Security-Control Methods for Statistical Databases: A Comparative Study; ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • Fung et al. Privacy Preserving Data Publishing: A Survey of Recent Development, ACM Computing Surveys, in press, 2009


  • Login