- 136 Views
- Uploaded on
- Presentation posted in: General

A Primer on Data Masking Techniques for Numerical Data

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

A Primer on Data Masking Techniques for Numerical Data

Krish Muralidhar

Gatton College of Business & Economics

- I would first like to acknowledge that most of my work in this area is with my co-author Dr. Rathindra Sarathy at Oklahoma State University

- Data masking deals with techniques that can be used in situations where data sets consisting of sensitive (confidential) information are “masked”. The masked data retains its usefulness without compromising privacy and/or confidentiality. The masked data can be analyzed, shared, or disseminated without risk of disclosure.

Original Data

Masked Data

- Minimize risk of disclosure resulting from providing access to the data
- Maximize the analytical usefulness of the data

- We are talking about protecting data that is made available to users, shared with others, or disseminated to the general public
- We are not dealing with unauthorized access to the data
- Encryption is not a solution
- We cannot perform analysis on encrypted data
- There are a few exceptions

- To perform analysis on the data, it must be decrypted
- Decrypted data offers no protection

- We cannot perform analysis on encrypted data

- Since we have the data set, we know the characteristics of the data set. We are trying to create a new data set that essentially contains the same characteristics as the original data set. We are not trying discern the characteristics of the original data set using the information in the masked data.
- In other words, I will not be talking about Agrawal and Srikant
- Or about the “distributed data” situation

- In addition, since most of you are probably familiar with the CS literature on this area, I will focus on the literature in the “statistical disclosure limitation” area

- It is assumed that the data will be used primarily for analysis at the aggregate level using statistical or other analytical techniques
- The data will not be accurate at the individual record level

- The organization that owns the data could potentially release aggregate information about the characteristics of the data set
- The users can still perform some types of analyses using the aggregate data, but limits the ability of the users to perform ad hoc analysis
- Releasing the microdata provides the users with the flexibility to perform any type of analysis
- In this talk, we assume that the intent is to release microdata

- Restricted access
- Query restrictions
- Other methods

- Typically, the data is historical and consists of
- Categorical variables (or attributes)
- Numerical variables
- Discrete variables
- Continuous variables

- In cases where identity is not to be revealed, key identification variables will be removed from the data set (de-identified)

- The common misconception is that, in order to prevent disclosure, all that is required is to remove “key identifiers”. However, even if the “key identifiers” are removed, in many cases it would be easy to indentify an individual using external data sources
- Latanya Sweeney’s work on k-anonymity
- The availability of numerical data makes it easy to re-identify records through record linkage

- Since de-identification alone does not prevent disclosure, it is necessary to “mask” the original data so that an intruder, even using external sources of data, cannot
- Identify a particular released record as belonging to a particular individual
- Estimate the value of a confidential variable for a particular record accurately

- Every user is potentially an intruder
- Since microdata is released, we cannot prevent the user from performing any type of analysis on the released data
- Must account for disclosure risk from any and all types of analyses
- Worst case scenario

- Data masking techniques are used to mask all types of data (categorical, discrete numerical, and continuous numerical)
- The focus of our research, and of this talk, is data masking for continuous numerical data

- Identify the data set to be released and the sensitive variables in the data
- Release all aggregate information regarding the data
- Characteristics of individual variables
- Relationship measures
- Any other relevant information

- Release non-sensitive data
- Since my focus is on numerical microdata, I will assume that all categorical and discrete data are either released unmasked or are masked prior to release

- Release masked numerical microdata

- Minimize disclosure risk (or maximize security)
- Minimize information loss (or maximize data utility)
- Other characteristics
- Must be easy to use
- The user must be able to analyze the masked data exactly as he/she would the original data

- Must be easy to implement

- Must be easy to use

- Dalenius defines disclosure as having occurred if, using the released data, an intruder is able to identify an individual or estimate the value of a confidential variable with a greater level of accuracy than was possible prior to such data release

- A data masking technique minimizes disclosure risk, IFF, the release of the masked microdata does not allow an intruder to gain additional information about an individual record over and above what was already available (from the release of aggregate information, the non-confidential variables, and the masked categorical variables)
- Does not mean that the disclosure risk from the entire data release process is minimum; only that the disclosure risk from releasing microdata is minimized

- Can be achieved in practice

- Identity Disclosure
- Re-identification rate

- Value disclosure
- Variability in the confidential attribute explained by the masked data

- Information loss is minimized IFF, for any arbitrary analysis (or query), the response from the masked data is exactly the same as that from the original data
- Impossible to achieve in practice
- Since an arbitrary analysis may involve a single record, the only way to achieve this objective is to release unmasked data

- In practice, we attempt to minimize information loss by maintaining the characteristics of the masked data to be the same as that of the original data
- From a statistical perspective, we attempt to maintain the masked data to be “similar to” the original data so that responses to analyses using the masked data will be the approximately same as that using the original data
- Maintain the distribution of the masked data to be the same as the original data

- Ability to maintain
- The marginal distribution
- Relationships between variables
- Linear
- Monotonic
- Non-monotonic

- Noise addition
- Micro-aggregation
- Data swapping
- Other similar approaches
- Any approach in which the masked value yij (the masked value for the jth variable of the ith record) is generated as a function of xi.

- A data set consisting of 2 categorical variables, 1 discrete, and 3 (confidential) numerical variables
- 50000 records

- Home value and Mortgage balance have heavily skewed distributions

- Relationships are not necessarily linear
- Measured by both product moment and rank order correlation

- The most rudimentary method of data masking. Add random noise to every confidential value of the form
- yi = xi + ei
- Typically e ~ Normal(0, d*Var(Xi))
- The selection of d specifies the level of noise. Large d indicates higher level of masking
- The variance is changed resulting in biased estimates
- Many variations exist

- The addition of noise results in an increase in variance
- This can be addressed easily, but there are other issues that cannot be, such as
- The marginal distribution is modified
- All Relationships are attenuated

Mortgage Balance versus Asset balance (Noise Added)

- Everything looks good
- Bias is small
- Relationships seem to be maintained

- So what is the problem?
- The problem is security
- Since very little noise is added, there is very little protection afforded to the records

The correlation between the original and masked values is of the order of 0.99. The masked values themselves are excellent predictors of the original value. Little or no “masking” is involved.

- Adding very little noise (10% of the variance of the individual variable) results in low information loss, but also results in high disclosure risk
- In order to decrease disclosure risk, it would be necessary to increase the noise (say 50% of the variance), but that would result in higher information loss

- At first glance, it does not seem too bad, but on closer observation, we notice that there are lots of negative values that did not exist in the original data
- Negative values can be addressed

Mortgage Balance versus Asset balance (Noise Added)

There is a considerable difference between the original and masked data. The correlations are considerably lower.

- The marginal distribution is completely modified
- This is an unavoidable consequence of any noise “addition” procedure

- In summary, noise addition is a rudimentary procedure that is easy to implement and easy to explain. There is always a trade-off between disclosure risk and information loss. If the disclosure risk is low (high) then the corresponding information loss is high (low).
- Unfortunately, this is an inherent characteristic of all noise based methods of the form Y = f(X,e) whether the noise is additive or multiplicative or some other form

- Recently, we have developed a new technique that is similar to noise addition, but maintains the mean vector and covariance matrix of the masked data to be the same as the original data
- Offers the same characteristics as noise addition, but assures that results for traditional statistical analyses using the masked data will be the same as the original data

- Model:
yi = γ + αxi + βsi + εi

- The only parameter that must selected is the “proximity parameter” α. All other parameters are dictated by the selection of this parameter

- The parameter α (0 <α< 1) dictates the strength of the relationship between X and Y.
- When α = 1, Y = X.
- When α = 0, the perturbed variable is generated independent of X (the GADP model to be discussed later)
- We provide the ability to specify α to achieve any degree of proximity between these two extremes

- γ = (1 – α) – β
- β = (1 – α)(σXS/σ2SS)
- ε ~ Normal(0, (1 – α2)((σXS)2/σ2SS)
- Can be generated from other distributions

- ε orthogonal to X and S

- In order to maintain sufficient statistics, it is NECESSARY that the model for generating the perturbed values MUST be specified in this manner

- There is a direct correspondence between the proximity parameter α and the level of noise added in the simple noise addition approach. This procedure will result in incremental disclosure risk except when α = 0
- The level of noise added is approximately equal to (1 – α2)

- Information loss characteristics of the sufficiency based approach is exactly the same as that of the simple noise addition approach with one major difference. Results of statistical analyses for which the mean vector and covariance matrix are sufficient statistics will be exactly the same using the masked data as they are using the original data.

- If noise addition will be used to mask the data, we should always use sufficiency based noise addition (and never simple noise addition). It provides all the same characteristics of simple noise addition with one major advantage that, for many traditional statistical analyses, it provides the guarantee that the masked data will yield the same results as the original data.

- Replace the values of the variables for a set of k records in close proximity with the average value of k records
- Many different methods of determining close proximity
- Univariate microaggregation where each variable is aggregated individually
- Multivariate microaggregation where the values of all the confidential variables for a given set of records are aggregated

- Many different methods of determining close proximity
- Results in variance reduction and attenuation of covariance
- All relationships are modified … some correlations higher others are lower
- Poor security even for relatively large k
- Consistent with the idea of “k anonymity” since at least k records in the data set will have the same values

- Bill Winkler at the Census Bureau has shown that the risk of identity disclosure is very high even with large k

- Swap values of variables within a specified proximity
- When the swapped values are in close proximity, it results in low information loss but high disclosure risk and vice versa
- The proximity is usually specified by the rank of the record

- The advantage of data swapping is that it does not change (or perturb) the values; the original values are used
- The marginal distribution of the masked data is exactly the same as the original

- Unfortunately, it results in high information loss and offers poor disclosure risk characteristics

- Information loss is low
- Unfortunately disclosure risk is very high
- The correlation between original and masked net asset value is 0.999

- Now information loss is very high, but disclosure risk is better

- There is an inherent problem with all approaches that generate the perturbed value as a function of the original value …. Y ~ f(X,e)
- These include all noise addition approaches, data swapping, microaggregation, and any variation of these approaches

- Using Delanius’ definition of disclosure risk, all these techniques result in disclosure
- If we attempt to improve disclosure risk, it will adversely affect information loss (and vice versa)

- Is a method that will ensure that the released of the masked data does not result in any additional disclosure, but provides characteristics for the masked data that closely resemble the original data
- From a statistical perspective, at least theoretically, there is a relatively easy solution

- Data set consisting of a set of non-confidential variables S and confidential variables X
- Identify the joint distribution f(S,X)
- Compute the conditional distribution f(X|S)
- Generate the masked values yi using f(X|S = si)
- When S is null, simply generate a new data set with the same characteristics as f(X)

- Then the joint distribution of (S and Y) is the same as that of (S and X)
- f(S,Y) = f(S,X)
- Little or no information loss since the joint distribution of the original and masked data are the same

- When the masked data is generated using CDA, it can be verified that f(X|Y,S,A) = f(X|S,A)
- Releasing the masked microdata Y does not provide any new information to the intruder over and above the non-confidential variables S and A (the aggregate information regarding the joint distribution of S and X)

- The CDA approach results in very low information loss and minimizes disclosure risk and represents a complete solution to the data masking problem
- Unfortunately, in practice
- Identifying f(S,X) may be very difficult
- Deriving f(X|S) may be very difficult
- Generating yi using f(X|S) may be very difficult

- In practice, it is unlikely that we can use the conditional distribution approach

- Model based approaches for data masking essentially attempt to model the data set by using an assumed f*(S,X) for the joint distribution of (S and X), derive f*(X|S), and generate the masked values from this distribution
- The masked data f(S,Y) will have the joint distribution f*(S,X) rather than the true joint distribution f(S,X)
- If the data is generated using f*(X|S) then the masking procedure minimizes disclosure risk since f(X|Y,S,A) = f(X|S,A)

- Assume that we have one non-confidential variable S and one confidential variable X
- Y = (a × S) + e
- (where e is the noise term)

- We will always get better prediction if we attempt to predict X using S rather than Y (since Y is noisier than S)
- Since we have access to both S and Y, and since S would always provide more information about X than Y, an intelligent intruder will always prefer to use S to predict X than Y
- More importantly, since Y is a function of S and random noise, once S is used to predict X, including Y will not improve your predictive ability

- Methods that we have developed and I will be talking about
- General additive data perturbation
- Copula based perturbation
- Data shuffling

- Other Methods
- PRAM
- Multiple imputation
- Skew t perturbation

- A linear model based approach. Can maintain the mean vector and covariance matrix of the masked data to be exactly the same as the original data
- The same as sufficiency based noise addition with proximity parameter = 0

- Ensures that the results of all traditional, parametric statistical analyses using the masked data are exactly the same as that using the original data
- Ensure that the release of the masked microdata results in no incremental disclosure

- From original data estimate the linear regression model X = β0 + β1S + ε. Let b0 and b1 represent the estimates of β0 and β1 and let Σee represent estimate of the covariance of the noise term ε.
- Generate a set of noise terms e with mean vector 0 and covariance matrix (exactly equal to) Σee and also orthogonal to both X and S. Distribution of e is immaterial although typically MV normal.
- Generate yi = b0 + b1Si + ei (i = 1 , 2, …, N)
- The mean vector and covariance matrix of (S,Y) is exactly the same as (S,X)
- In the original GADP, these measures were maintained only asymptotically. Burridge (2003) suggested the methodology for maintaining these exactly. We modified this further to ensure minimum disclosure risk (Muralidhar and Sarathy 2005).

- GADP results in minimizing disclosure risk. We can show that an intruder would get the “best estimate” of the confidential values using just the non-confidential variables. The masked variables provide no additional information.

- Had say 90% of the entire data set, you would not be able to predict the value of the confidential variables for the remaining 10% with any greater accuracy than you would using only the non-confidential data
- Had 100% of all confidential variables except one AND 90% of the values for the last confidential variable, you would not be able to predict the confidential value of remaining records with any greater accuracy than you would using only the non-confidential variables.

- By maintaining the mean vector and covariance matrix of the two data sets to be exactly the same, for any statistical analysis for which the mean vector and covariance matrix are sufficient statistics, we ensure that the parameter estimates using the masked data will be exactly the same as the original data

- Unfortunately, the marginal distribution of the original data set is altered significantly. In most situations, the marginal distribution of the masked variable bears little or no relationship to the original variable
- The data also could have negative values when the original variable had only positive values

Negative values that did not exist in the original data

- The change in the marginal distribution means that other analyses pertaining to the distribution of the confidential variables are not maintained
- Residual analysis from regression would be very different

- Since a linear model is used, any non-linear relationships that may have been present in the data are modified (linearized)

- GADP is useful in a limited context. If the confidential variables do not exhibit significant deviations from normality, then GADP would represent a good solution to the problem
- In other cases, GADP represents a limited solution to the specific users who will use the data mainly for traditional statistical analysis

- We would like the masking procedure to provide some additional benefits (while still minimizing disclosure risk)
- Maintain the marginal distribution
- Maintain non-linear relationships

- To do this, we need to move beyond linear models
- Multiplicative models are not very useful since, in essence, they are just variations of the linear model

- In statistics, copulas have traditionally been used to model the joint distribution of a set of variables with arbitrary marginal distributions and a specified dependence characteristics
- the ability to maintain the marginal, nonnormal distribution of the original attributes to be the same after masking and to preserve certain types of dependence between the attributes

- C-GADP minimizes disclosure risk
- C-GADP provides the following information loss characteristics
- The marginal distribution of the confidential variables is maintained
- All monotonic relationships are preserved
- Rank order correlation
- Product moment correlation

- Non-monotonic relationships will be modified

- Consider a situation where we have a confidential variable X and a set of non-confidential variables S. If we assume that the MV Copula is appropriate for modeling the data, then the perturbed data Y can be viewed as an independent realization from f(X|S). The marginal of Y is simple a different realization from the same marginal as X. This being the case, reverse map the original values of X in place of the masked values Y. Now the “values” of Y are the same as that of X, but they have been “shuffled”.

- In the above, we use the multivariate normal copula to generate YP.

- Offers all the benefits as CGADP
- Minimum disclosure risk
- Information loss
- Maintains the marginal distribution
- Maintain all monotonic relationships

- Additional benefits
- There is no “modification” of the values. The original values are used
- The marginal distribution of the masked data is exactly the same as the original data
- Implementation can be performed using only the ranks

- Some shuffled values are far apart, others are closer
- Impossible to predict original position after the fact which assures low disclosure risk

- Rank order correlation pre and post masking are very close. Improves with the size of the data set
- X is less correlated with Y and more correlated with S

- Data shuffling is a hybrid (perturbation and swapping), non-parametric (can be implemented only with rank information) technique for data masking that minimizes disclosure risk and offers the lowest level of information loss among existing methods of data masking
- Will not maintain non-monotonic relationships
- Does not preserve tail dependence
- Can be overcome by using t-copula instead of normal copula

- Data shuffling can be implemented easily even for relatively large data sets. We are in the process of developing two versions of software based on Data shuffling
- Java based for large applications
- Excel based for smaller applications

- Investigate other methods for modeling the joint distribution of the variables to reduce information loss further.
- Other copula functions?
- Some other approach?

- Investigate non-statistical approaches for producing a masked data set that closely resembles the original data (while minimizing disclosure risk)
- Masking methods for discrete numerical data

- Dalenius, T., “Towards a methodology for statistical disclosure control,” Statistisktidskrift, 5, 429–444, 1977.
- Fuller, W. A., “Masking procedures for microdata disclosure limitation,” Journal of Official Statististics, 9, 383–406, 1993.
- Rubin, D. B., “Discussion of statistical disclosure limitation,” Journal of Official Statistics, 9, 461–468, 1993.
- Moore, R. A., “Controlled data swapping for masking public use microdata sets,” Research report series no. RR96/04, U.S. Census Bureau, Statistical Research Division, Washington, D.C., 1996.
- Burridge, J., “Information preserving statistical obfuscation,” Statistics and Computing, 13, 321–327, 2003.
- Domingo-Ferrer, J. and J.M. Mateo-Sanz, “Practical data-oriented microaggregation for statistical disclosure control,” IEEE Transactions on Knowledge and Data Engineering, 14, 189-201, 2002.

- Muralidhar, K. and R. Sarathy, " Generating Sufficiency-based Non-Synthetic Perturbed Data," Transactions on Data Privacy, 1(1), 17-33, 2008.
- Muralidhar, K. and R. Sarathy, "Data Shuffling- A New Masking Approach for Numerical Data," Management Science, 52(5), 658-670, 2006.
- Muralidhar, K. and R. Sarathy, “A Comparison of Multiple Imputation and Data Perturbation for Masking Numerical Variables,” Journal of Official Statistics, 22(3), 507-524, 2006.
- Muralidhar, K. and R. Sarathy, " A Theoretical Basis for Perturbation Methods," Statistics and Computing, 13(4), 329-335, 2003.
- Sarathy, R., K. Muralidhar, and R. Parsa, "Perturbing Non-Normal Confidential Attributes: The Copula Approach," Management Science, 48(12), 1613-1627, 2002.
- Muralidhar, K., R. Parsa, and R. Sarathy, "A General Additive Data Perturbation Method for Database Security," Management Science, 45(10), 1399-1415, 1999.
- Muralidhar, K., D. Batra, and P. Kirs, “Accessibility, Security, and Accuracy in Statistical Databases: The Case for the Multiplicative Fixed Data Perturbation Approach,” Management Science, 41(9), 1549-1564,1995.

- Assessing disclosure risk
- Muralidhar, K. and R. Sarathy, "Security of Random Data Perturbation Methods," ACM Transactions on Database Systems, 24(4), 487-493, 1999.
- Sarathy, R. and K. Muralidhar, "The Security of Confidential Numerical Data in Databases," Information Systems Research, 13(4), 389-403, 2002.
- Li, H., K. Muralidhar, and R. Sarathy, “Assessment of Disclosure Risk when using Confidentiality via Camouflage,” Operations Research, 55(6), 1178-1182, 2007.

- Framework for evaluating masking techniques
- Muralidhar, K. and R. Sarathy, “A Theoretical Comparison of Data Masking Techniques for Numerical Microdata,” to be presented at the 3rd IAB Workshop on Confidentiality and Disclosure - SDC for Microdata, Nuremberg, Germany, 2008

- You can many of our papers and presentations at our web site:
http://gatton.uky.edu/faculty/muralidhar/maskingpapers/

- I will be happy to share any papers or presentations that are not available on the web site.

- There are a host of techniques that are available for masking numerical data. These techniques have a long history in the statistical disclosure limitation literature. There is considerable overlap between the data masking research in the statistical disclosure limitation research community and the privacy preserving data mining research in the CS community. Unfortunately, there seems to be only a limited cooperation between the researchers in the two fields. I believe that each field can make a significant contribution to the other. I hope that this presentation contributes to enhancing the discussion between CS and SDL researchers … at least at UK.

Questions, Suggestions or Comments?

Thank you