1 / 46

Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop

Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011. Overview. The SAIL system and how it operates Privacy Protection Issues and Drivers Privacy Protection approach Current developments Examples of research studies Future work.

sorley
Download Presentation

Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy Protection • & the SAIL Databank • David Ford • ECCONET Data Linkage Workshop • Bergen 15 – 17 June 2011

  2. Overview • The SAIL system and how it operates • Privacy Protection Issues and Drivers • Privacy Protection approach • Current developments • Examples of research studies • Future work

  3. What are HIRU and SAIL? • HIRU – the Health Information Research Unit • SAIL – Secure Anonymous Information Linkage • Main aim of HIRU is to realise the potential of electronically-held, routinely-collected, person-based data to conduct and support health-related studies • The SAIL databank already holds over 1 billion anonymised and encrypted individual-level records, from a range of sources relevant to health and well-being

  4. Is SAIL a Cohort? • Perhaps! • Total population databank for the 3 million people of Wales • Multi source data (administrative, clinical, research) • Many nested e-cohorts within SAIL (such as WECC )

  5. Data Linkage is the key! • Data linkage (at a person level) is essential to reap the benefits of routine data • Good quality data linkage needs some form of consistent personal identifiers on which to link • In the UK multi source data do not share a common ID number. Names, Address, Date of Birth, ARE however, normally collected.

  6. In the beginning . . . • There was a real opportunity to create this data resource • We had established how linked, routine data was useful for research • We knew there were numerous technical (computing) challenges to overcome • But the idea required data owners / guardians to feel able to provide their data to SAIL. • Constructing the circumstances that enabled data guardians to supply data to SAIL become the single biggest challenge!

  7. The issues facing SAIL • Data guardians across Wales: • Wanted to participate and saw the potential benefits • Were nervous of breaching the Data Protection Act • Did not have clear guidelines that helped • Needed a way of guaranteeing the privacy of their data • Were nervous about the uses to which the data might be put • Wanted access to be controlled (in some way)

  8. The issues facing SAIL • Researchers (across the UK) wanted: • As much data as they wanted, whenever they wanted it • To avoid detailing what they wanted to do • Data delivered to them • Data to arrive quickly • No admin, no approvals, no constraints • Clean, easy and consistent data • Simple, flat data structures

  9. Our response • Set a series of objectives • Undertook pilot work • Consulted very widely • Understood relevant legislation and good practice guidance (Information Governance) • Developed the approaches over time • Continued to consult and have external inspection • Continuous improvement process

  10. The initial IG challenge • Matching up the same people in different datasets (data linkage) is very inaccurate without access to identifiers • (Matching with imperfect identifiers is still a challenge!) • Sophisticated but standardised matching was therefore required. • Data owners felt able to part with “anonymised at source” data. However including identifiers in the supply was seen to be illegal without consent

  11. Setting out • Pilot to prove the concept • One health economy area – Swansea (pop. c. 250k) • Data General Practices (36), Patient Episode Database Wales (PEDW) and social services data extracts • Purpose: to develop, review, refine technical and procedural methodologies.

  12. Setting out • Consultations with regulatory and professional bodies (local and national) • Suitability of system • In the public interest • Protection of patient privacy • Ethics and governance • Usefulness to enhance research and inform policy • Value for money • Exhaustive (and exhausting!) efforts • File of evidence of acceptability

  13. The base level • Response: development of “Split-file anonymisation” technique • Using the “separation principle” • No flow of identifiable information to SAIL • No flow of identifiable confidential information to ANYONE • Clear, written, formal data sharing agreements • Clarity about use cases, conditions and exception clauses

  14. Other design constraints • A pledge to data providers that no data will ever leave the databank • They can ask for it to be deleted • They know who has accessed it • They know what it has been used for

  15. Validate Construct ALF HIRU (Blue C) Health Solutions Wales Data Provider Other recombined data Anonymisation process Demographic data only Validated, anonymised data Recombine Encrypt and load Clinical / activity data Operational system HIRU (Blue C) HIRU methodology (illustration)

  16. Available Computing infrastructure • Blue C supercomputer, one of the fastest computers in Europe dedicated to Life Science research • Strategic partnership with IBM (through School of Medicine’s Institute of Life Sciences initiative) • Advanced software toolset (database, data mining, GIS)

  17. Objectives • Secure data transportation • Reliable matching process • Anonymisation and encryption • Disclosure control • Data access controls • Scrutiny of data utilisation proposals • External verification of compliance with IG

  18. Objective 1 • Secure data transportation • Data transported using HTTPS (Hyper-Text Transfer Protocol Secure) • DPOs split datasets at source • Clinical details to HIRU (none to HSW) • Demographics to HSW for matching and anonymisation • Brown Envelope principle) • Linking key – re-join after anonymisation

  19. Objective 2 • 2) Reliable matching process • Partnership with Trusted Third Party – HSW • HSW = NHS Agency with right to hold identifiers for NHS admin purposes • Use the Welsh Demographic Service administrative register as gold standard • MACRAL (Matching Algorithm for Consistent Results in Anonymous Linkage) - SQL-based algorithm – sequential passes • Deterministic and probabilistic record linkage

  20. MACRAL • Exact match on valid NHS number • Exact match on firstname, surname, d.o.b, gender, postcode • Soundexing • Lexicon matching • Assigns match probability on Bayesian model • Informs analysts

  21. Validation and optimisation • Firstly – • Validation exercise • Obtained specificity values >99.8% and sensitivity > 94.6% with error rates <0.2% Then – • Optimised techniques for matching a variety of datasets: primary care (GP), hospital/secondary care (PEDW), and social care (PARIS)

  22. Matching rates

  23. Objective 3 • 3) Anonymisation and encryption • Anonymous Linking Field (ALF) • One person – one ALF • Aggregation and categorisation • Further processing at HIRU • Into ALF_E • Recombination

  24. Objective 4 • 4) Disclosure controls • Assessment of Uniques and low-copy numbers • Data reduced to minimum required for study • Operated at various stages: • When the data view is created • Before dissemination • Numerical Evaluation of Multiple Outputs • Combination of expert review and machines processes

  25. Numerical Evaluation of Multiple Outputs • NEMO • SQL-based algorithm • Counts unique and low-copy number records • Allows the judicious application of suppression and/or aggregation • Manual review • Linkage/Homogeneity attack

  26. Objective 5 • 5) Data access controls • Technical and permission-based control • Policies and Standard Operating Procedures (SOPs) • User agreements – clarity + penalties • Project-based approvals and linked access • Physical restrictions - technology • Time-limited, specific data views per approved project • SAIL Gateway

  27. Improving access: SAIL Gateway

  28. SAIL Gateway: Critical features • Firewalled network • Windows XP Desktops one per user running in a virtualised environment (VPN) • All desktop and server members of active directory and specific group policies applied • Only remote desktop (RDP) allowed through firewall to the windows XP desktops • Localised file storage for windows XP desktops both private and shared between desktops within the Gateway • Ability to host application servers within environment • Automated one-way transfer of data into the environment • Authorised limited transfer of data out of the environment

  29. Objective 6 • 6) Scrutiny of data utilisation proposals • Collaboration Review System – applies to all uses • Information Governance Review Panel (IGRP) • British Medical Association • Public Health Wales • National Research Ethics Service • Informing Healthcare • Involving People

  30. Objective 7 • 7) External verification of compliance with IG (Audit) • Important to: • Reassure DPOs and other partners • Gain recommendations for improvement • Conduct: • Policies and SOPs • Interviews • System verification

  31. Data Users Project View Project Request IGRP HIRU & IGRP SOPs and Policies Disclosure control HIRU Access controls Views SAIL databank Masking and encryption HSW Anonymisation service Data Sources National Datasets NHS Social care Others The SAIL system

  32. Subsequent refinements • Role based access • Technical, Senior Analyst, Approved Analyst, User / statistician, HSW technical • SAIL Gateway • Uploading, tool selection, performance • Results out / approvals • Wiki, help, training materials, code of conduct, messaging • Tighter user agreements (line management sign off) & Clearer sanctions for misuse • Purpose-built virtual IGRP committee technology

  33. Data • Data on 3 million people, ≈ 2 billion records, and growing! • Historical data 5 – 20 years • Maintains address history for full period (exposures) • Most codified using ICD, Read codes, OPCS codes, SNOMED, etc. • Many hundreds of separate data suppliers • Free text a real (IG) challenge • Unknown use of identifiers • Potential for ‘risky’ comments • Hard to analyse in quantity • Now a major work stream

  34. National datasets - examples • PEDW - in-patients & day cases and out-patients • National Community Child Health Database • NHS Direct Wales • Cancer incidence registry for Wales • National screening programmes • Congenital abnormalities • Ambulance service data • National Pupil Database (performance and attendance of children at School) And much more…

  35. Local datasets - examples • General Practice • Pathology • A&E departments • Social services • Local authority housing data – RALFs • And more….

  36. Research datasets • Data collected as part of research studies where the aim is to use routine data as well • Permissions, consent and regulatory approvals • Do not release SAIL data to researchers to link to study datasets • Treat as dataset from DPO – study dataset anonymised and loaded into SAIL for linkage with SAIL data

  37. Clinical systems • Introduced new clinical systems to send data direct to SAIL (via standard mechanisms) • Working with NHS Wales to introduce new national systems • SAIL now central part of the NHS’s “secondary uses” approaches – new data from new national systems e.g. - radiology, pathology, emergency, etc.

  38. Other advancements • Data collected directly from the people of Wales (and beyond) via internet portals. Currently disease cohort specific, moving to all-Wales • SAIL data now linked to local histopathology sample archive (tissue bank), with potential to link to national cancer (tissue) bank • Flow of imaging data (MRI, ECG, etc.) from local NHS providers • Set up of a public advocates group • Linkage of national (cross-sectional) surveys – consent issues • Genomics data under consideration (special IG issues!) • Increasingly used by NHS to monitor and plan services – change of use • Residential Linking Fields (RALFs) . . .

  39. RALFs • Desire to know more about: • The properties people live in (characteristics, proximity to geographical features) • Who they live with (household relationships, familial relationships etc) • A real problem to do while maintaining anonymity • Our Solution: RALFs • An ALF has a RALF, all RALFs have 1+ ALFs (usually)

  40. Residential Anonymous Linking Fields - RALFs HIRU HSW • a. Create environment metrics OS Data b. KEY and addresses with environment metrics HIRU GIS WDS c. Match incoming address data and attach RALFs d. RALFs and environment metrics Encrypt Encrypt e. Combination of RALFs with ALFs SAIL

  41. Methodology references - Architecture

  42. Methodology references - Matching

  43. Methodology references - RALFs

  44. Summary • Privacy is not just about the individual – it sometimes relates to the organisation • Preserving privacy reduces research utility • Finding the balance between privacy protection and research utility is the key • There is no perfect balance

  45. Thanks Data providers - NHS organisations, local authorities and government agencies, and more Health Solutions Wales NHS Wales Informatics Service National Institute for Social Care and Health Research Welsh Government Information Governance Review Panel Researchers of Wales and beyond And to you for listening!

  46. Thanks

More Related