privacy preserving data publishing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Privacy-Preserving Data Publishing PowerPoint Presentation
Download Presentation
Privacy-Preserving Data Publishing

Loading in 2 Seconds...

play fullscreen
1 / 44

Privacy-Preserving Data Publishing - PowerPoint PPT Presentation


  • 174 Views
  • Uploaded on

Privacy-Preserving Data Publishing. Donghui Zhang Northeastern University. Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis. motivation. several agencies, institutions, bureaus, organizations make (sensitive) data involving people publicly available

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Privacy-Preserving Data Publishing' - libitha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
privacy preserving data publishing

Privacy-Preserving Data Publishing

Donghui Zhang

Northeastern University

Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

motivation
motivation
  • several agencies, institutions, bureaus, organizations make (sensitive) data involving people publicly available
    • termed microdata (vs. aggregated macrodata) used for analysis
    • often required and imposed by law
  • to protect privacy microdata are sanitized
    • explicit identifiers (SSN, name, phone #) are removed
  • is this sufficient for preserving privacy?
  • no! susceptible to link attacks
    • publicly available databases (voter lists, city directories) can reveal the “hidden” identity
link attack example
link attack example
  • looking for governor’s record
  • join the tables:
    • 6 people had his birth date
    • 3 were men
    • 1 in his zipcode
  • regarding the US 1990 census data
    • 87% of the population are unique based on (zipcode, gender, dob)
  • [Sweeney01]managed to re-identify the medical record of the governor of Massachussetts
    • MA collects and publishes sanitized medical data for state employees (microdata) left circle
    • voter registration list of MA (publicly available data) right circle
inference attack
Inference Attack

Published table

An adversary

Quasi-identifier (QI) attributes

k anonymity samarati and sweeney02
k-anonymity [Samarati and Sweeney02]
  • Transform the QI values into less specific forms

generalize

generalization
Generalization
  • Transform each QI value into a less specific form

A generalized table

An adversary

graphically

35000

12000

14000

18000

25000

20000

26000

27000

33000

34000

52

24

43

56

22

40

21

36

37

41

23

Graphically…

Alice

Bob

why not

35000

12000

14000

18000

25000

20000

26000

27000

33000

34000

52

24

43

56

22

40

21

36

37

41

23

Why not…

How many people with age in [30, 50] contracted flu?

k anonymity
k-anonymity

How many people with age in [30, 50] contracted flu?

generalization with low utility:

answer less accurately: [0..3]

generalization with high utility:

answer queries more accurately: 2.

k anonymity with utility
k-anonymity with utility
  • Among all generalizations that enforce k-anonymity, we should maximize utility by minimizing the “rectangle” sizes!
  • Several measures. E.g. to minimize the maximal perimeter size of the rectangles.
mondrian ldr06
Mondrian [LDR06]

Recursive half-plane partitioning, alternating dimensions.

let k=2

mondrian ldr061
Mondrian [LDR06]

Unbounded approximation ratio!

let k=4

our contributions dxt 07
Our contributions [DXT+07]
  • Proved that to find the optimal partitioning is NP-hard.
  • Proved that to find a partitioning with approximation ratio less than 1.25 is also NP-hard.
  • Provided three algorithms with tradeoffs in complexity and approximation ratio.
divide and group dag
Divide-And-Group (DAG)
  • Divide the space into square cells with proper size
  • Find a set of non-overlapping tiles of 2 x 2 cells to cover the points, such that each tile covers at least k points
  • Assign the rest of (uncovered) points to the nearest tile
min mbr group mmg
Min-MBR-Group (MMG)
  • For each point p, find the smallest MBR which covers at least k points including p
  • Find a set of non-overlapping MBRs from the result of previous step
  • Assign the points to the nearest MBR
nearest neighbor group nng
Nearest-Neighbor-Group (NNG)
  • For each point p, find the MBR which covers p and its k-1 nearest neighbors
  • Find a set of non-overlapping MBRs from the result of previous step
  • Assign the points to the nearest MBR
drawback of k anonymity
Drawback of k-anonymity
  • In a QI group, if many records have the same sensitive attribute value...

Quasi-identifier (QI) attributes

Sensitive attribute

If Bob is in this group, he must have pneumonia.

l diversity icde06
l-diversity [ICDE06]
  • A QI-group with m tuples is l-diverse, iff each sensitive value appears no more than m /l times in the QI-group.
  • A table is l-diverse, iff all of its QI-groups are l-diverse.
  • The above table is 2-diverse.

Quasi-identifier (QI) attributes

Sensitive attribute

2 QI-groups

what l diversity guarantees
What l-diversity guarantees
  • From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l

A 2-diverse generalized table

A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity.

ICDE 2006

problem with multi publishing
Problem with multi-publishing
  • A hospital keeps track of the medical records collected in the last three months.
  • The microdata table T(1), and its generalization T*(1), published in Apr. 2007.

2-diverse Generalization T*(1)

Microdata T(1)

problem with multi publishing1
Problem with multi-publishing
  • Bob was hospitalized in Mar. 2007

2-diverse Generalization T*(1)

problem with multi publishing2
Problem with multi-publishing
  • One month later, in May 2007

Microdata T(1)

problem with multi publishing3
Problem with multi-publishing
  • One month later, in May 2007
  • Some obsolete tuples are deleted from the microdata.

Microdata T(1)

problem with multi publishing4
Problem with multi-publishing
  • Bob’s tuple stays.

Microdata T(1)

problem with multi publishing5
Problem with multi-publishing
  • Some new records are inserted.

Microdata T(2)

problem with multi publishing6
Problem with multi-publishing
  • The hospital published T*(2).

2-diverse Generalization T*(2)

Microdata T(2)

problem with multi publishing7
Problem with multi-publishing
  • Consider the previous adversary.

2-diverse Generalization T*(2)

problem with multi publishing8
Problem with multi-publishing
  • What the adversary learns from T*(1).
  • What the adversary learns from T*(2).
  • So Bob must have contracted dyspepsia!
  • A new generalization principle is needed.
m invariance sigmod07
m-invariance [SIGMOD07]
  • A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if
    • T*(1), …, T*(n) are m-unique, and
    • each individual has the same signature in every generalized table s/he is involved.
  • Explanation
    • m-unique: every QI group contains at least m tuples with different sensitive attributes
    • signature: all the sensitive attributes in the individual’s QI group.
m unique
m-unique
  • A generalized table T*(j) is m-unique, if and only if
    • each QI-group in T*(j) contains at least m tuples
    • all tuples in the same QI-group have different sensitive values.

A 2-unique generalized table

signature
Signature
  • The signature of Bob in T*(1) is {dyspepsia, bronchitis}
  • The signature of Jane in T*(1) is {dyspepsia, flu, gastritis}

T*(1)

the m invariance principle
The m-invariance principle
  • Lemma: if a sequence of generalized tables {T*(1), …, T*(n)} is m-invariant, then for any individual o involved in any of these tables, we have

risk(o) <= 1/m

the m invariance principle1
The m-invariance principle
  • Lemma: let {T*(1), …, T*(n-1)} be m-invariant. {T*(1), …, T*(n-1), T*(n)} is also m-invariant, if and only if {T*(n-1), T*(n)} is m-invariant
  • Only T*(n - 1) is needed for the generation of T*(n).

T*(1), T*(2), …, T*(n-2), T*(n-1), T*(n)

Can be discarded

solution idea
Solution idea
  • Goal: Given T(n) and T*(n-1), create T*(n) such that {T*(n-1) and T*(n)} is m-invariant.
  • Idea: create counterfeits.
  • Optimization goal: to impose as little amount of generalization as possible.
slide37

Microdata T(2)

Counterfeited generalization T*(2)

The auxiliary relation R(2) for T*(2)

slide38

Generalization T*(1)

Counterfeited Generalization T*(2)

The auxiliary relation R(2) for T*(2)

slide39
A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if
    • T*(1), …, T*(n) are m-unique, and
    • each individual has the same signature in every generalized table s/he is involved.

Generalization T*(1)

Generalization T*(2)

slide40
A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if
    • T*(1), …, T*(n) are m-unique, and
    • each individual has the same signature in every generalized table s/he is involved.

Generalization T*(1)

Generalization T*(2)

slide41
A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if
    • T*(1), …, T*(n) are m-unique, and
    • each individual has the same signature in every generalized table s/he is involved.

Generalization T*(1)

Generalization T*(2)

in case of corruption
In case of corruption…
  • If an adversary knows from Alice that she has bronchitis, he can conclude that Bob has dyspepsia.

2-diverse Generalization

Microdata

anti corruption publishing icde08
Anti-corruption publishing [ICDE08]
  • We formalized anti-corruption publishing, by modeling the degree of privacy preservation as a function of an adversary’s background knowledge.
  • We proposed a solution, by integrating generalization with
    • perturbation: switch selected records’ sensitive information.
    • stratified sampling: sample some records from each QI group.
summary
Summary
  • Introduced the problem of privacy-preserving publishing.
  • Two principles:
    • k-anonymity
    • l-diversity
  • Two extensions:
    • multi-publishing
    • corruption