global disclosure risk for microdata with continuous attributes n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Global Disclosure Risk for Microdata with Continuous Attributes PowerPoint Presentation
Download Presentation
Global Disclosure Risk for Microdata with Continuous Attributes

Loading in 2 Seconds...

play fullscreen
1 / 38

Global Disclosure Risk for Microdata with Continuous Attributes - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

Global Disclosure Risk for Microdata with Continuous Attributes. Traian Marius Truta Northern Kentucky University. HIPAA Privacy Rule. The Health Insurance Portability and Accountability Act (1996)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Global Disclosure Risk for Microdata with Continuous Attributes' - neila


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
global disclosure risk for microdata with continuous attributes

Global Disclosure Risk for Microdata with Continuous Attributes

Traian Marius Truta

Northern Kentucky University

hipaa privacy rule
HIPAA Privacy Rule
  • The Health Insurance Portability and Accountability Act (1996)
  • The Privacy Rule protects the privacy of the individually identifiable health information by establishing conditions for its use and disclosure
  • Privacy Rule effective date: 14 April 2003
  • Define 18 identifiers that must be removed in order to de-identify the data

Traian Truta - Northern Kentucky University

the identifiers in the privacy rule
Names

Telephone #

Fax #

E-mail address

Social Security #

Medical record, prescription #

Health Plan beneficiary #

Account #

Certificates/license #

VIN and serial #, license plate #

Device identifiers, serial #,

Web URLs

IP address

Biometric identifiers (finger prints)

Full face photo images

Unique identifying #

The Identifiers in the Privacy Rule

Traian Truta - Northern Kentucky University

the identifiers in the privacy rule1
Names

Telephone #

Fax #

E-mail address

Social Security #

Medical record, prescription #

Health Plan beneficiary #

Account #

Certificates/license #

VIN and serial #, license plate #

Device identifiers, serial #,

Web URLs

IP address

Biometric identifiers (finger prints)

Full face photo images

Unique identifying #

The Identifiers in the Privacy Rule
  • Geographic info (including city, state, and zip)
  • Elements of dates

Traian Truta - Northern Kentucky University

de identification process
De-identification Process
  • Remove all 18 defined identifiers and no knowledge that remaining information can identify the individual (Safe Harbor)
  • Statistically “de-identified” information where a statistician certifies that there is a “very small” risk that the information could be used to identify the individual

Traian Truta - Northern Kentucky University

disclosure control problem
Disclosure Control Problem

Individuals

Submit

Collect

Data

Masking

Process

Data Owner

Release

Receive

Masked Data

Researcher

Intruder

Traian Truta - Northern Kentucky University

disclosure control problem1
Disclosure Control Problem

Individuals

Submit

Collect

Data

Confidentiality

of Individuals

Measures of

Disclosure Risk

Masking

Process

Data Owner

Preserve

Data Utility

Measures of

Information Loss

Release

Receive

Masked Data

Researcher

Intruder

Traian Truta - Northern Kentucky University

disclosure control problem2
Disclosure Control Problem

Individuals

Submit

Collect

Data

Confidentiality

of Individuals

Measures of

Disclosure Risk

Masking

Process

Data Owner

Preserve

Data Utility

Measures of

Information Loss

Release

Receive

Masked Data

Researcher

Intruder

Use Masked Data for

Statistical Analysis

Use Masked Data and External Data

to disclose confidential information

External Data

Traian Truta - Northern Kentucky University

disclosure control problem3
Disclosure Control Problem

Individuals

This Presentation

Submit

Collect

Data

Confidentiality

of Individuals

Measures of

Disclosure Risk

Masking

Process

Data Owner

Preserve

Data Utility

Measures of

Information Loss

Release

Receive

Masked Data

Researcher

Intruder

Use Masked Data for

Statistical Analysis

Use Masked Data and External Data

to disclose confidential information

External Data

Traian Truta - Northern Kentucky University

general framework for microdata
General Framework for Microdata
  • I – Identifier Attributes (Name, SSN, etc. )
  • K – Key Attributes (Zip Code, Age, Race, etc.)
  • S– Confidential Attributes (Income, Diagnosis, etc.)

Traian Truta - Northern Kentucky University

disclosure control techniques
Disclosure Control Techniques
  • Different disclosure control techniques are applied to the following initial microdata:

Traian Truta - Northern Kentucky University

remove identifiers
Remove Identifiers
  • Identifiers such as Names, SSN etc. are removed

Traian Truta - Northern Kentucky University

sampling
Sampling
  • Sampling is the disclosure control method in which only a subset of records is released
  • If n is the number of elements in initial microdata and t the released number of elements we call sf = t / n the sampling factor
  • Simple random sampling is more frequently used. In this technique, each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample

Traian Truta - Northern Kentucky University

microaggregation
Microaggregation
  • Order records from the initial microdata by an attribute, create groups of consecutive values, replace those values by the group average
  • Microaggregation for attribute Income and minimum size 3
  • The total sum for all Income values remains the same.

Traian Truta - Northern Kentucky University

global disclosure risk measures
Global Disclosure Risk Measures

Assumptions

  • The intruder does not know any confidential information
  • The intruder knows all the key and identifier values for population

Objectives

  • DR Measures for specific DC methods (Remove Identifiers, Sampling, Microaggregation, etc.)
  • DR Measures for any combinations of DC methods

Proposed measures

DRmin DRW DRmax

Traian Truta - Northern Kentucky University

notations for im and imm
Notations for IM and IMM
  • n – the number of entities in the population.
  • F – the number of clusters with the same values for key attributes.
  • Ak – the set of elements from the k-th cluster for all k, 1k  F.
  • Fi= |{Ak | |Ak| = i, for all k = 1, .., F } | for all i, 1i  n. Fi represents the number of clusters with the same length.
  • ni=|{x  Ak | |Ak| = i, for all k = 1, .., F } | for all i, 1i  n. ni represents the number of records in clusters of length i.

Traian Truta - Northern Kentucky University

disclosure risk measures for remove identifiers method
Disclosure Risk Measures for Remove Identifiers Method
  • {1, 2, 4}
  • {3, 5, 9}
  • {6, 10}
  • {7}
  • {8}

Traian Truta - Northern Kentucky University

disclosure risk measures for remove identifiers method1
Disclosure Risk Measures for Remove Identifiers Method

- considers probabilistic linkage

- percentage of unique records

- weights defined by data owner

w = (w1, w2, …, wN) disclosure risk weight vector.

Properties

a) wiR+for all i = 1, .. , n;

b) wiwjfor all i  j, i,j = 1, .. , n;

Traian Truta - Northern Kentucky University

disclosure risk measures for remove identifiers method2
Disclosure Risk Measures for Remove Identifiers Method
  • w1= (5, 5, 0, 0, ..., 0)
  • w2= (4, 3, 3, 0, ..., 0)

Traian Truta - Northern Kentucky University

disclosure risk measures for ri method with continuous attribute
Disclosure Risk Measures for RI Method with Continuous Attribute
  • What if the intruder has only approximations of income?
  • w1= (5, 5, 0, 0, ..., 0)
  • w2= (4, 3, 3, 0, ..., 0)

Traian Truta - Northern Kentucky University

disclosure risk measures for ri method with continuous attribute1
Disclosure Risk Measures for RI Method with Continuous Attribute
  • We consider vicinity sets!
  • w1= (5, 5, 0, 0, ..., 0)
  • w2= (4, 3, 3, 0, ..., 0)

Traian Truta - Northern Kentucky University

notations for masked microdata
Notations for Masked Microdata
  • f – the number of clusters with the same values for key attributes in M.
  • We cluster all records from M based on their key values. Bk– the set of elements from the k-th cluster for all k, 1k  f.
  • fi= |{Bk | |Bk| = i, for all k = 1, .., f } | for all i, 1i  n. fi represents the number of clusters with the same length.
  • ti=|{x  Bk | |Bk| = i, for all k = 1, .., f } | for all i, 1i  n. ti represents the number of records in clusters of length i.
  • C – the classification matrix. For all i,j = 1, .., n; cij ==|{x  Bkand x  Ap | |Bk| = i, for all k = 1, .., f and |Ap| = j, for all p = 1, .., F }|. Each element of C, cij, represents the number of records that appears in clusters of size i in the masked microdata and appeared in clusters of size j in the initial masked microdata.

Traian Truta - Northern Kentucky University

algorithm for creating classification matrix
Algorithm for Creating Classification Matrix

Initialize each element from C with 0.

For each element s from masked microdata MM do

Count the number of occurrences of key values of s in masked microdata MM.Let ibe this number.

Count the number of occurrences of key values of s in initial microdata IM.Let j be this number.

Increment cij by 1.

End for.

Traian Truta - Northern Kentucky University

disclosure risk measures for microaggregation method
Disclosure Risk Measures for Microaggregation Method
  • What if data is continuous ?

Traian Truta - Northern Kentucky University

disclosure risk measures for microaggregation method1
Disclosure Risk Measures for Microaggregation Method

Initial Microdata

Traian Truta - Northern Kentucky University

disclosure risk measures for microaggregation method2
Disclosure Risk Measures for Microaggregation Method

Univariate microaggregation for attribute Age and size = 2,4,8;

Masked Microdata 2

Masked Microdata 3

Masked Microdata 1

Traian Truta - Northern Kentucky University

disclosure risk measures for microaggregation method3
Disclosure Risk Measures for Microaggregation Method

Traian Truta - Northern Kentucky University

disclosure risk measures for microaggregation method4
Disclosure Risk Measures for Microaggregation Method

Example – Disclosure risk values NO VICINITY!

Traian Truta - Northern Kentucky University

disclosure risk measures for microaggregation method5
Disclosure Risk Measures for Microaggregation Method

Example – Disclosure risk values WITH VICINITY!

Traian Truta - Northern Kentucky University

general disclosure risk measures
General Disclosure Risk Measures
  • icfk– inversion-change factor for attribute k
  • p – number of key attributes
  • v – binary vector associated to key attribute

Traian Truta - Northern Kentucky University

experimental data
Experimental Data
  • Simulated medical record billing data
  • Age, Sex, Zip and Amount_Billed
  • Three initial microdata:
    • n= 1,000 (called IM1000)
    • n= 5,000 (IM5000)
    • n= 25,000 (IM25000)
  • Key attributes:
    • KA1= {Age, Sex, Zip}
    • KA2= {Age, Sex}

Traian Truta - Northern Kentucky University

results for sampling and microaggregation
Results for Sampling and Microaggregation

Sampling, followed by microaggregation for Age when IM5000 and KA1 are used.

Traian Truta - Northern Kentucky University

results for sampling and microaggregation1
Results for Sampling and Microaggregation

Sampling and microaggregation for Age when IM5000 and KA1 are used.

Traian Truta - Northern Kentucky University

conclusions
Conclusions
  • The data owner may customize its disclosure risk measure to reflect better the characteristics of the microdata. Privacy requirements may help data owner to define the disclosure risk weight matrix.
  • Importance of masking key attributes with small vicinity sets

Traian Truta - Northern Kentucky University

future work
Future Work
  • Our experiments were focused on healthcare microdata; experiments for other types of data, such as financial data are needed.
  • To study disclosure control for microdata under the assumption that the initial microdata is frequently updated (Dynamic Disclosure Control)

Traian Truta - Northern Kentucky University

some papers
Some Papers
  • Details about DR Measures
    • “Disclosure Risk Measures for Sampling Disclosure Control Method,” to appear in the Proceedings of ACM Symposium on Applied Computing (SAC2004), special track on Computer Applications in Health Care (COMPAHEC2004), Nicosia, Cyprus
    • “Disclosure Risk Measures for Microdata,” Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM2003), Cambridge, Ma, pp. 15 – 22, 2003
  • Information Loss Measures
    • “Privacy and Confidentiality Management for the Microaggregation Disclosure Control Method,” Proceedings of the Workshop on Privacy and Electronic Society (WPES2003), In Conjunction with 10th ACM CCS, Washington DC, pp. 21 – 30, 2003
  • Automatic Masked Microdata Generator
    • “Automatic Generation of Masked Microdata,” to appear in the Acta Universitatis Apulensis, Alba Iulia, Romania

Traian Truta - Northern Kentucky University

acknowledgements
Acknowledgements
  • Dr. Farshad Fotouhi
  • Dr. Daniel Barth-Jones

Traian Truta - Northern Kentucky University

questions
Questions?

Traian Truta - Northern Kentucky University