- 174 Views
- Uploaded on

Download Presentation
## Privacy-Preserving Data Publishing

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Privacy-Preserving Data Publishing

Donghui Zhang

Northeastern University

Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

motivation

- several agencies, institutions, bureaus, organizations make (sensitive) data involving people publicly available
- termed microdata (vs. aggregated macrodata) used for analysis
- often required and imposed by law
- to protect privacy microdata are sanitized
- explicit identifiers (SSN, name, phone #) are removed
- is this sufficient for preserving privacy?
- no! susceptible to link attacks
- publicly available databases (voter lists, city directories) can reveal the “hidden” identity

link attack example

- looking for governor’s record
- join the tables:
- 6 people had his birth date
- 3 were men
- 1 in his zipcode

- regarding the US 1990 census data
- 87% of the population are unique based on (zipcode, gender, dob)

- [Sweeney01]managed to re-identify the medical record of the governor of Massachussetts
- MA collects and publishes sanitized medical data for state employees (microdata) left circle
- voter registration list of MA (publicly available data) right circle

k-anonymity

How many people with age in [30, 50] contracted flu?

generalization with low utility:

answer less accurately: [0..3]

generalization with high utility:

answer queries more accurately: 2.

k-anonymity with utility

- Among all generalizations that enforce k-anonymity, we should maximize utility by minimizing the “rectangle” sizes!
- Several measures. E.g. to minimize the maximal perimeter size of the rectangles.

Our contributions [DXT+07]

- Proved that to find the optimal partitioning is NP-hard.
- Proved that to find a partitioning with approximation ratio less than 1.25 is also NP-hard.
- Provided three algorithms with tradeoffs in complexity and approximation ratio.

Divide-And-Group (DAG)

- Divide the space into square cells with proper size
- Find a set of non-overlapping tiles of 2 x 2 cells to cover the points, such that each tile covers at least k points
- Assign the rest of (uncovered) points to the nearest tile

Min-MBR-Group (MMG)

- For each point p, find the smallest MBR which covers at least k points including p
- Find a set of non-overlapping MBRs from the result of previous step
- Assign the points to the nearest MBR

Nearest-Neighbor-Group (NNG)

- For each point p, find the MBR which covers p and its k-1 nearest neighbors
- Find a set of non-overlapping MBRs from the result of previous step
- Assign the points to the nearest MBR

Drawback of k-anonymity

- In a QI group, if many records have the same sensitive attribute value...

Quasi-identifier (QI) attributes

Sensitive attribute

If Bob is in this group, he must have pneumonia.

l-diversity [ICDE06]

- A QI-group with m tuples is l-diverse, iff each sensitive value appears no more than m /l times in the QI-group.
- A table is l-diverse, iff all of its QI-groups are l-diverse.
- The above table is 2-diverse.

Quasi-identifier (QI) attributes

Sensitive attribute

2 QI-groups

What l-diversity guarantees

- From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l

A 2-diverse generalized table

A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity.

ICDE 2006

Problem with multi-publishing

- A hospital keeps track of the medical records collected in the last three months.
- The microdata table T(1), and its generalization T*(1), published in Apr. 2007.

2-diverse Generalization T*(1)

Microdata T(1)

Problem with multi-publishing

- One month later, in May 2007
- Some obsolete tuples are deleted from the microdata.

Microdata T(1)

Problem with multi-publishing

- The hospital published T*(2).

2-diverse Generalization T*(2)

Microdata T(2)

Problem with multi-publishing

- What the adversary learns from T*(1).
- What the adversary learns from T*(2).
- So Bob must have contracted dyspepsia!
- A new generalization principle is needed.

m-invariance [SIGMOD07]

- A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if
- T*(1), …, T*(n) are m-unique, and
- each individual has the same signature in every generalized table s/he is involved.
- Explanation
- m-unique: every QI group contains at least m tuples with different sensitive attributes
- signature: all the sensitive attributes in the individual’s QI group.

m-unique

- A generalized table T*(j) is m-unique, if and only if
- each QI-group in T*(j) contains at least m tuples
- all tuples in the same QI-group have different sensitive values.

A 2-unique generalized table

Signature

- The signature of Bob in T*(1) is {dyspepsia, bronchitis}
- The signature of Jane in T*(1) is {dyspepsia, flu, gastritis}

T*(1)

The m-invariance principle

- Lemma: if a sequence of generalized tables {T*(1), …, T*(n)} is m-invariant, then for any individual o involved in any of these tables, we have

risk(o) <= 1/m

The m-invariance principle

- Lemma: let {T*(1), …, T*(n-1)} be m-invariant. {T*(1), …, T*(n-1), T*(n)} is also m-invariant, if and only if {T*(n-1), T*(n)} is m-invariant
- Only T*(n - 1) is needed for the generation of T*(n).

T*(1), T*(2), …, T*(n-2), T*(n-1), T*(n)

Can be discarded

Solution idea

- Goal: Given T(n) and T*(n-1), create T*(n) such that {T*(n-1) and T*(n)} is m-invariant.
- Idea: create counterfeits.
- Optimization goal: to impose as little amount of generalization as possible.

A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if

- T*(1), …, T*(n) are m-unique, and
- each individual has the same signature in every generalized table s/he is involved.

Generalization T*(1)

Generalization T*(2)

A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if

- T*(1), …, T*(n) are m-unique, and
- each individual has the same signature in every generalized table s/he is involved.

Generalization T*(1)

Generalization T*(2)

A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if

- T*(1), …, T*(n) are m-unique, and
- each individual has the same signature in every generalized table s/he is involved.

Generalization T*(1)

Generalization T*(2)

In case of corruption…

- If an adversary knows from Alice that she has bronchitis, he can conclude that Bob has dyspepsia.

2-diverse Generalization

Microdata

Anti-corruption publishing [ICDE08]

- We formalized anti-corruption publishing, by modeling the degree of privacy preservation as a function of an adversary’s background knowledge.
- We proposed a solution, by integrating generalization with
- perturbation: switch selected records’ sensitive information.
- stratified sampling: sample some records from each QI group.

Summary

- Introduced the problem of privacy-preserving publishing.
- Two principles:
- k-anonymity
- l-diversity
- Two extensions:
- multi-publishing
- corruption

Download Presentation

Connecting to Server..