A Primer on Data Masking Techniques for Numerical Data

A Primer on Data Masking Techniques for Numerical Data Krish Muralidhar Gatton College of Business & Economics

My Co-author • I would first like to acknowledge that most of my work in this area is with my co-author Dr. Rathindra Sarathy at Oklahoma State University

Introduction • Data masking deals with techniques that can be used in situations where data sets consisting of sensitive (confidential) information are “masked”. The masked data retains its usefulness without compromising privacy and/or confidentiality. The masked data can be analyzed, shared, or disseminated without risk of disclosure.

A Simple Example Original Data Masked Data

Objectives of Data Masking • Minimize risk of disclosure resulting from providing access to the data • Maximize the analytical usefulness of the data

What this talk is not about … • We are talking about protecting data that is made available to users, shared with others, or disseminated to the general public • We are not dealing with unauthorized access to the data • Encryption is not a solution • We cannot perform analysis on encrypted data • There are a few exceptions • To perform analysis on the data, it must be decrypted • Decrypted data offers no protection

What this talk is not about … • Since we have the data set, we know the characteristics of the data set. We are trying to create a new data set that essentially contains the same characteristics as the original data set. We are not trying discern the characteristics of the original data set using the information in the masked data. • In other words, I will not be talking about Agrawal and Srikant • Or about the “distributed data” situation • In addition, since most of you are probably familiar with the CS literature on this area, I will focus on the literature in the “statistical disclosure limitation” area

Purpose of Dissemination • It is assumed that the data will be used primarily for analysis at the aggregate level using statistical or other analytical techniques • The data will not be accurate at the individual record level

Aggregate versus Micro Data • The organization that owns the data could potentially release aggregate information about the characteristics of the data set • The users can still perform some types of analyses using the aggregate data, but limits the ability of the users to perform ad hoc analysis • Releasing the microdata provides the users with the flexibility to perform any type of analysis • In this talk, we assume that the intent is to release microdata

Other Protection Measures • Restricted access • Query restrictions • Other methods

The Data • Typically, the data is historical and consists of • Categorical variables (or attributes) • Numerical variables • Discrete variables • Continuous variables • In cases where identity is not to be revealed, key identification variables will be removed from the data set (de-identified)

De-identification does not necessarily prevent Re-identification • The common misconception is that, in order to prevent disclosure, all that is required is to remove “key identifiers”. However, even if the “key identifiers” are removed, in many cases it would be easy to indentify an individual using external data sources • Latanya Sweeney’s work on k-anonymity • The availability of numerical data makes it easy to re-identify records through record linkage

Data Masking • Since de-identification alone does not prevent disclosure, it is necessary to “mask” the original data so that an intruder, even using external sources of data, cannot • Identify a particular released record as belonging to a particular individual • Estimate the value of a confidential variable for a particular record accurately

Who is an intruder? • Every user is potentially an intruder • Since microdata is released, we cannot prevent the user from performing any type of analysis on the released data • Must account for disclosure risk from any and all types of analyses • Worst case scenario

The focus of our research • Data masking techniques are used to mask all types of data (categorical, discrete numerical, and continuous numerical) • The focus of our research, and of this talk, is data masking for continuous numerical data

The Data Release Process • Identify the data set to be released and the sensitive variables in the data • Release all aggregate information regarding the data • Characteristics of individual variables • Relationship measures • Any other relevant information • Release non-sensitive data • Since my focus is on numerical microdata, I will assume that all categorical and discrete data are either released unmasked or are masked prior to release • Release masked numerical microdata

Characteristics of a good masking technique • Minimize disclosure risk (or maximize security) • Minimize information loss (or maximize data utility) • Other characteristics • Must be easy to use • The user must be able to analyze the masked data exactly as he/she would the original data • Must be easy to implement

Disclosure Risk • Dalenius defines disclosure as having occurred if, using the released data, an intruder is able to identify an individual or estimate the value of a confidential variable with a greater level of accuracy than was possible prior to such data release

Minimum Disclosure Risk • A data masking technique minimizes disclosure risk, IFF, the release of the masked microdata does not allow an intruder to gain additional information about an individual record over and above what was already available (from the release of aggregate information, the non-confidential variables, and the masked categorical variables) • Does not mean that the disclosure risk from the entire data release process is minimum; only that the disclosure risk from releasing microdata is minimized • Can be achieved in practice

Practical Measure of Disclosure Risk • Identity Disclosure • Re-identification rate • Value disclosure • Variability in the confidential attribute explained by the masked data

Minimum Information Loss • Information loss is minimized IFF, for any arbitrary analysis (or query), the response from the masked data is exactly the same as that from the original data • Impossible to achieve in practice • Since an arbitrary analysis may involve a single record, the only way to achieve this objective is to release unmasked data

Information Loss … continued • In practice, we attempt to minimize information loss by maintaining the characteristics of the masked data to be the same as that of the original data • From a statistical perspective, we attempt to maintain the masked data to be “similar to” the original data so that responses to analyses using the masked data will be the approximately same as that using the original data • Maintain the distribution of the masked data to be the same as the original data

Some Practical Measures of Information Loss • Ability to maintain • The marginal distribution • Relationships between variables • Linear • Monotonic • Non-monotonic

Simple Masking Approaches • Noise addition • Micro-aggregation • Data swapping • Other similar approaches • Any approach in which the masked value yij (the masked value for the jth variable of the ith record) is generated as a function of xi.

An Illustrative Example • A data set consisting of 2 categorical variables, 1 discrete, and 3 (confidential) numerical variables • 50000 records

Marginal Distribution • Home value and Mortgage balance have heavily skewed distributions

Relationships • Relationships are not necessarily linear • Measured by both product moment and rank order correlation

Relationship Measures

Simple Noise Addition • The most rudimentary method of data masking. Add random noise to every confidential value of the form • yi = xi + ei • Typically e ~ Normal(0, d*Var(Xi)) • The selection of d specifies the level of noise. Large d indicates higher level of masking • The variance is changed resulting in biased estimates • Many variations exist

Problems with Noise Addition • The addition of noise results in an increase in variance • This can be addressed easily, but there are other issues that cannot be, such as • The marginal distribution is modified • All Relationships are attenuated

Results for Noise Addition(Noise level = 10%) Mortgage Balance versus Asset balance (Noise Added)

Relationship – Product Moment

Looks good … • Everything looks good • Bias is small • Relationships seem to be maintained • So what is the problem? • The problem is security • Since very little noise is added, there is very little protection afforded to the records

High Disclosure Risk The correlation between the original and masked values is of the order of 0.99. The masked values themselves are excellent predictors of the original value. Little or no “masking” is involved.

Improved Predictive Ability

Disclosure Risk versus Information Loss • Adding very little noise (10% of the variance of the individual variable) results in low information loss, but also results in high disclosure risk • In order to decrease disclosure risk, it would be necessary to increase the noise (say 50% of the variance), but that would result in higher information loss

Results for Noise Addition(Noise level = 50%) • At first glance, it does not seem too bad, but on closer observation, we notice that there are lots of negative values that did not exist in the original data • Negative values can be addressed Mortgage Balance versus Asset balance (Noise Added)

Correlation There is a considerable difference between the original and masked data. The correlations are considerably lower.

Marginal Distribution of Home Value • The marginal distribution is completely modified • This is an unavoidable consequence of any noise “addition” procedure

Summary • In summary, noise addition is a rudimentary procedure that is easy to implement and easy to explain. There is always a trade-off between disclosure risk and information loss. If the disclosure risk is low (high) then the corresponding information loss is high (low). • Unfortunately, this is an inherent characteristic of all noise based methods of the form Y = f(X,e) whether the noise is additive or multiplicative or some other form

Sufficiency based Noise Addition • Recently, we have developed a new technique that is similar to noise addition, but maintains the mean vector and covariance matrix of the masked data to be the same as the original data • Offers the same characteristics as noise addition, but assures that results for traditional statistical analyses using the masked data will be the same as the original data

Sufficiency Based Noise Addition • Model: yi = γ + αxi + βsi + εi • The only parameter that must selected is the “proximity parameter” α. All other parameters are dictated by the selection of this parameter

The Proximity Parameter • The parameter α (0 <α< 1) dictates the strength of the relationship between X and Y. • When α = 1, Y = X. • When α = 0, the perturbed variable is generated independent of X (the GADP model to be discussed later) • We provide the ability to specify α to achieve any degree of proximity between these two extremes

Other Model Parameters • γ = (1 – α) – β • β = (1 – α)(σXS/σ2SS) • ε ~ Normal(0, (1 – α2)((σXS)2/σ2SS) • Can be generated from other distributions • ε orthogonal to X and S

Note that … • In order to maintain sufficient statistics, it is NECESSARY that the model for generating the perturbed values MUST be specified in this manner

Disclosure Risk • There is a direct correspondence between the proximity parameter α and the level of noise added in the simple noise addition approach. This procedure will result in incremental disclosure risk except when α = 0 • The level of noise added is approximately equal to (1 – α2)

Information Loss • Information loss characteristics of the sufficiency based approach is exactly the same as that of the simple noise addition approach with one major difference. Results of statistical analyses for which the mean vector and covariance matrix are sufficient statistics will be exactly the same using the masked data as they are using the original data.

Results of Regression to predict Net Assets using all other variables

Simple versus Sufficiency Based Noise Addition • If noise addition will be used to mask the data, we should always use sufficiency based noise addition (and never simple noise addition). It provides all the same characteristics of simple noise addition with one major advantage that, for many traditional statistical analyses, it provides the guarantee that the masked data will yield the same results as the original data.

Microaggregation • Replace the values of the variables for a set of k records in close proximity with the average value of k records • Many different methods of determining close proximity • Univariate microaggregation where each variable is aggregated individually • Multivariate microaggregation where the values of all the confidential variables for a given set of records are aggregated • Results in variance reduction and attenuation of covariance • All relationships are modified … some correlations higher others are lower • Poor security even for relatively large k • Consistent with the idea of “k anonymity” since at least k records in the data set will have the same values

A Primer on Data Masking Techniques for Numerical Data