1 / 59

EUCCONET Record Linkage Workshop 15 th to 17 th June 2011

Personal identity protection solutions in the presence of low copy number fields. EUCCONET Record Linkage Workshop 15 th to 17 th June 2011. Dr Kerina Jones HIRU, Swansea University. Overview.

noel-deleon
Download Presentation

EUCCONET Record Linkage Workshop 15 th to 17 th June 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Personal identity protection solutions in the presence of low copy number fields EUCCONET Record Linkage Workshop 15th to 17th June 2011 Dr Kerina Jones HIRU, Swansea University

  2. Overview • Nature of the issue – when anonymisation (de‑identification) may not be enough to protect privacy • Risks associated low copy number fields • Origins and descriptions of the kinds of solutions in use • Utility versus privacy • Current viewpoints

  3. Anonymised (de-identified) data • Linked-anonymised: separation of the common identifiers from the clinical data, with capability to re-join via a key • Pseudonymised and anonymously-linked data: replacement of common identifiers with unique anonymous identifier and measures to prevent capability to re-join • Unlinked-anonymised data: permanent removal of common identifiers and no capability to re-join R I S K O F R E - I D

  4. Nature of the issue • Not focussed on - • Using the linkage key to the demographics • Cracking the anonymisation or encryption codes • How robust the anonymisation is • It’s about - • Risk of re-identification of individuals in an anonymised dataset due to their records being present as unique combinations of variables or in low copy number • Can be accidental or intentional

  5. Nature of the issue

  6. Risk issues • Presence of unusual variables • Rare conditions • Extremes of age • Multiple births – triplets, etc. • Large families • Minority groups • Unusual combinations • Increases with increasing number of variables in dataset

  7. Risk areas • Authorised: • Data access and analysis • Data sharing between individuals/organisations • Release of results • Data publication Unauthorised – for example: • Security breach by intruder - intentional • Loss of data, release of wrong data - accidental

  8. Origins • Privacy preservation in anonymised datasets • 3 main origins: • Database community – database management • Cryptographers – cryptographic protocols • Statistical disclosure control (SDC) – national statistics • Variety of techniques • Often parallel developments – similar in outcome • Fusion of ideas • ‘Re-identification science’

  9. Why is it needed? • Reasons for privacy preservation in anonymised datasets • Demonstrated to be relatively easy to re-identify individuals from some anonymous datasets • 87% of people in the US have a unique combination of ZIP code, birth date and gender • Netflix – anonymous customer movie recommendations • AOL anonymous internet searches • Sweeney – re-identification of a US Governor after he had stated his confidence in a de-identified health dataset • Clearly, removal of commonly-recognised identifiers is not enough to prevent re-identification

  10. Why is it needed? • Linkage attack: using publicly-available information in combination with an anonymised dataset to attempt to re-identify individuals • Prosecutor risk – the re-identification of a given individual • Journalist risk – the re-identification of any individual • Marketer risk – the re-identification of as many individuals as possible • Purposeful attempts to prove it can be done • Risks to data linkage units: legislation, litigation, cost and reputation

  11. Some definitions • Suppression: removal of certain variables or records from a dataset • Aggregation/Generalisation: grouping data items (such as ages) into bands • Encryption: transforming via an algorithm • Masking: obscuring values – functionally similar to original • Perturbation: introducing noise into a dataset • Data swapping: exchange of values between records • Synthetic data: generated to retain certain statistics

  12. Nature of the issue

  13. Privacy preservation methods Vary with access model: • Restricted data • Altered data - Important for researchers to know • Data views - Cannot take away • Meta-data • Results/test statistics only

  14. Privacy preservation methods • Methods to quantify anonymity in a de-identified dataset: • k-anonymisation • Early work by Samarati and Sweeney (1998) • k-anonymisation – in a k-anonymised dataset a given record cannot be distinguished from at least (k - 1) other records • For example in a dataset where k = 3, a given record will be identical to at least 2 other records • Minimum value of k is 2 • Higher values of k are considered less risky

  15. Privacy preservation algorithms • Algorithms to achieve the desired level of k-anonymisation in microdata • Set a threshold for k depending on perception of risk • Ubiquitous rule or a rule of thumb • Specific to a particular dataset • Vary according to levels of trust • Differ depending on dataset destination • Linked to risk appetite of organisation/unit

  16. Argus algorithm

  17. Argus algorithm • User specification on level of generalisation • Criticisms – • Only checks for low copy numbers at 2 and 3 • There may be sensitive combinations at copy level 4 • Checking all combinations would be computationally challenging (1996) • Unable to offer solution quality guarantees

  18. Datafly algorithm

  19. Datafly algorithm • Not limited to equivalence classes of 2 or 3 • Criticisms – • Distortions and generalisations not necessarily k-minimal • Makes crude decisions on generalisation and suppression • Unable to offer solution quality guarantees

  20. MinGen algorithm

  21. MinGen algorithm • Designed to provide minimal distortion • Deliver maximum quality • Criticisms – • Impractical for large datasets • Inefficient

  22. Other k-anonymisation algorithms • Numerous more k-anonymisation algorithms • General pattern – develop, criticise, improve, etc. • Problem of non-global changes, local recodes • Different levels of generalisation on different variables • Loss of data due to suppression or over-generalisation • Introduction of other measures alongside k • l-diversity - designed to avoid inference of sensitive values • t-closeness – enhances l-diversity

  23. Globally optimal method

  24. Globally optimal method • Globally optimal k-anonymity method • Optimal Lattice Anonymisation (OLA) • Produces a lattice of solutions using different generalisation strategies • Comparison with other known k-anonymisation algorithms • Uses a set of metrics to measure information loss and for evaluation

  25. Metrics Metrics – • Precision (Prec): a measure of the loss of precision due to generalisation • Discernability Metric (DM): assigns a penalty to each record relative to the number of records identical to it • Non-uniform entropy: a measure to quantify differences in loss of information when generalising a given variable whose distribution is different between datasets • E.g. gender is 60M:40F in dataset1, 1M:99F in dataset2

  26. Generalisation Suppression OLA illustration k-minimal node

  27. OLA method • k-minimal node is the lowest node on a given generalisation strategy that satisfies k-anonymisation • Generalisation strategy – the systematic approach taken to k-anonymise the dataset, e.g. successive banding on age until k is reached • If a node in a strategy is k-anonymous, all nodes above it in same strategy will be k-anonymous • Therefore nodes above, in the same strategy, are discarded as they have greater information loss

  28. How OLA operates How OLA operates – • A threshold for k is set • A maximum level for suppression (MaxSup) is chosen • Find: A binary search for all k-anonymous nodes • Discard: Only the k-anonymous nodes at lowest height in the lattice (on a given generalisation strategy) are retained • Compare: All the k-anonymous nodes are compared in terms of information loss (using the metrics) and the optimal node is chosen

  29. How OLA operates • Evaluation found it to be better in terms of efficiency and information loss than some other k-anonymisation algorithms • Limitations – • It is possible that an optimal solution will not be found • Works on principle that suppression is better than generalisation • Based on metrics that are monotonic with respect to generalisation strategies • Does it effectively solve attack strategies?

  30. Angles of attack • Linkage attack: using publicly-available information in combination with an anonymised dataset to attempt to re‑identify individuals • Homogeneity attack: can occur where the value for a sensitive attribute in an anonymised dataset is the same for a number of records • Can occur in equivalence classes • Disclosure by being in the class • Additional metrics to assess vulnerability

  31. Other types of algorithm • Other types of algorithm – SUDA and SUDA2 • Special Uniques Detection Algorithm • ONS in London and Australian Bureau of Statistics • Set of algorithms and software system • Special unique – • A record which is unique on coarser-grained variables is more risky than one unique on fine-grained variables • Unique on a set of variables and also unique on a sub-set of those variables

  32. SUDA and SUDA2 • Takes into account Minimal Sample Uniques (MSUs) – the size and number of sub-sets within the dataset that are themselves unique • Use this information to estimate underlying risk • Recognises that some of the metrics used with other algorithms, such measures of distance in a dataset may not work for categorical variables • SUDA2 improves on SUDA methodologically to be more computationally efficient

  33. PARAT • Privacy Analytics Risk Assessment Tool (PARAT) • Electronic Health Information Laboratory – Ontario • http://www.privacyanalytics.ca/products/products.html • Windows based application is compatible with a number of databases • Risk based approach to de-identification • NB – will only be applicable for certain data linkage unit models

  34. Using PARAT • Select variables at risk of re-identification • These can be ranked for importance • Ranking used in de-identification process • Balance risk and data utility

  35. Using PARAT • Set the acceptable re‑identification risk • User input • Level of trust • Accounting for nature of dataset

  36. Using PARAT • Carry out the risk assessment for prosecutor, journalist and marketer • Risk is high (>0.2 for all these) • Many potential uniques

  37. Using PARAT • Automatically de‑identifies the data • Suppression and generalisation to reduce risk to acceptable level • Before dataset is made available to researcher

  38. Evaluation of PARAT • Which models does it apply to? • Repositories/DL units that prepare data views? • Repositories/DL units that release linked datasets? • DL units that provide links, but data comes from providers? • Federated queries/distributed systems via a co-ordination centre? • Others?

  39. NEMO Numerical Evaluation of Multiple Outputs (NEMO) • SQL-based algorithm • Counts unique and low-copy number records • Allows the judicious application of suppression and/or aggregation • Project-by-project basis • Can apply at dataview and at results stages • Also – may use sequential analysis to limit views

  40. Privacy in free text Free-text/Narrative text - • Review – Automatic de-identification of textual documents in the electronic health record: a review of recent research (Meystreet al, 2010, Uni of Utah) • Categorised two main approaches – • Pattern matching – rule-based via constructed dictionaries • Machine learning – data-mining techniques using training of algorithms • Some use a combination to improve efficiency

  41. Privacy in free text • Performance assessment: • Recall (Sensitivity) – proportion of health information identified compared to total • Precision (Positive predictivity) – proportion of true positives among all the terms identified • Fallout (False positive rate) – proportion of non-QI terms mistakenly identified as QIs

  42. Challenges • Some challenges – • Words such as ‘brown’, ‘grey’ or ‘white’ may be names or adjectives • Drug names can be mistaken for person names and omitted • Time consuming to generate dictionaries • Domain-specific knowledge needed • Computationally challenging • Particular concerns and sensitivities around free-text data

  43. Privacy in geodata • Studying the relationships between health and environment • Residential Anonymous Linking Fields (RALFs) in SAIL • Common disclosure control practice is to aggregate and/or suppress values in population areas of specified size • Some methods – distortions, loss of spatial relationships • Risk-based approaches using uniqueness thresholds to manage the risk of re-identification

  44. Privacy in outputs and results Privacy preservation at output of results and/or publication • Differs with model – • Repositories/DL units that prepare data views? • Menu-driven query servers/meta-data views? • Federated queries/distributed systems via a co-ordination centre? • Repositories/DL units that release linked datasets? • DL units that provide links, but data comes from providers?

  45. Privacy in outputs and results Privacy preservation at output of results and/or publication • Types of outputs • Descriptive statistics, distributions, mean, median, SD, etc. • Tables of values • Contingency tables • Plots – scatter, box, bar charts • Single statistics, regression coefficients

  46. Privacy preservation models • SAIL model - repository providing views only • Scrutinise the results before release • Numerical Evaluation of Multiple Outputs • Always conduct manual review • No results can leave SAIL without assessment – release is dependent on authorisation • Risk remaining – alteration of results post-release

  47. Privacy preservation models • Australian Bureau of Statistics (ABS) • Confidentialised Unit Record Files (CURFs) • Removal of name and address, etc. • Control on level of detail • Changes to some values • Addition of noise to continuous variables • Some degree of suppression • Limited access to sensitive variables, company names, etc

  48. Privacy preservation models • Stringently confidentialised data can be released on CD-ROM • Data enclave – limited access to more detailed data – secure on‑site facility - trusted researcher status • Remote Analysis Server (RAS) • User does not access the data themselves • Queries submitted in SAS, SPSS or STATA • Results checked for confidentiality and sent to user

  49. Privacy preservation models Protection of privacy in ABS Remote Analysis System – • CSIRO (Commonwealth Scientific and Industrial Research Organisation) – Privacy Preserving Analytics • Uses risk mitigation in outputs of descriptive and inferential statistics, e.g. • Limiting extreme values in EDA output • Replacement of table cell counts with Correspondence Analysis (CA) of cell counts • Limited transformations allowed and covariance matrix not supplied in linear regression

  50. Stated advantages of RAS model Stated advantages of ABS Remote Analysis System – • No information loss due to data perturbation • No need for special statistics to deal with perturbed data • Can be easier to confidentialise the output than the data • Fitted models on RAS should be better than on confidentialised data • Less risky as researcher only receives confidentialised output, not records • Users can be given different levels of access

More Related