Privacy Protection
This presentation is the property of its rightful owner.
Sponsored Links
1 / 46

Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on
  • Presentation posted in: General

Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011. Overview. The SAIL system and how it operates Privacy Protection Issues and Drivers Privacy Protection approach Current developments Examples of research studies Future work.

Download Presentation

Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Privacy protection the sail databank david ford ecconet data linkage workshop

  • Privacy Protection

  • & the SAIL Databank

  • David Ford

  • ECCONET Data Linkage Workshop

  • Bergen 15 – 17 June 2011


Overview

Overview

  • The SAIL system and how it operates

  • Privacy Protection Issues and Drivers

  • Privacy Protection approach

  • Current developments

  • Examples of research studies

  • Future work


What are hiru and sail

What are HIRU and SAIL?

  • HIRU – the Health Information Research Unit

  • SAIL – Secure Anonymous Information Linkage

  • Main aim of HIRU is to realise the potential of electronically-held, routinely-collected, person-based data to conduct and support health-related studies

  • The SAIL databank already holds over 1 billion anonymised and encrypted individual-level records, from a range of sources relevant to health and well-being


Is sail a cohort

Is SAIL a Cohort?

  • Perhaps!

  • Total population databank for the 3 million people of Wales

  • Multi source data (administrative, clinical, research)

  • Many nested e-cohorts within SAIL (such as WECC )


Data linkage is the key

Data Linkage is the key!

  • Data linkage (at a person level) is essential to reap the benefits of routine data

  • Good quality data linkage needs some form of consistent personal identifiers on which to link

  • In the UK multi source data do not share a common ID number. Names, Address, Date of Birth, ARE however, normally collected.


In the beginning

In the beginning . . .

  • There was a real opportunity to create this data resource

  • We had established how linked, routine data was useful for research

  • We knew there were numerous technical (computing) challenges to overcome

  • But the idea required data owners / guardians to feel able to provide their data to SAIL.

  • Constructing the circumstances that enabled data guardians to supply data to SAIL become the single biggest challenge!


The issues facing sail

The issues facing SAIL

  • Data guardians across Wales:

    • Wanted to participate and saw the potential benefits

    • Were nervous of breaching the Data Protection Act

    • Did not have clear guidelines that helped

    • Needed a way of guaranteeing the privacy of their data

    • Were nervous about the uses to which the data might be put

    • Wanted access to be controlled (in some way)


The issues facing sail1

The issues facing SAIL

  • Researchers (across the UK) wanted:

    • As much data as they wanted, whenever they wanted it

    • To avoid detailing what they wanted to do

    • Data delivered to them

    • Data to arrive quickly

    • No admin, no approvals, no constraints

    • Clean, easy and consistent data

    • Simple, flat data structures


Our response

Our response

  • Set a series of objectives

  • Undertook pilot work

  • Consulted very widely

  • Understood relevant legislation and good practice guidance (Information Governance)

  • Developed the approaches over time

  • Continued to consult and have external inspection

  • Continuous improvement process


The initial ig challenge

The initial IG challenge

  • Matching up the same people in different datasets (data linkage) is very inaccurate without access to identifiers

  • (Matching with imperfect identifiers is still a challenge!)

  • Sophisticated but standardised matching was therefore required.

  • Data owners felt able to part with “anonymised at source” data. However including identifiers in the supply was seen to be illegal without consent


Setting out

Setting out

  • Pilot to prove the concept

    • One health economy area – Swansea (pop. c. 250k)

    • Data General Practices (36), Patient Episode Database Wales (PEDW) and social services data extracts

    • Purpose: to develop, review, refine technical and procedural methodologies.


Setting out1

Setting out

  • Consultations with regulatory and professional bodies (local and national)

    • Suitability of system

    • In the public interest

    • Protection of patient privacy

    • Ethics and governance

    • Usefulness to enhance research and inform policy

    • Value for money

  • Exhaustive (and exhausting!) efforts

  • File of evidence of acceptability


The base level

The base level

  • Response: development of “Split-file anonymisation” technique

    • Using the “separation principle”

    • No flow of identifiable information to SAIL

    • No flow of identifiable confidential information to ANYONE

  • Clear, written, formal data sharing agreements

    • Clarity about use cases, conditions and exception clauses


Other design constraints

Other design constraints

  • A pledge to data providers that no data will ever leave the databank

    • They can ask for it to be deleted

    • They know who has accessed it

    • They know what it has been used for


Hiru methodology illustration

Validate

Construct ALF

HIRU (Blue C)

Health Solutions Wales

Data Provider

Other recombined data

Anonymisation process

Demographic data only

Validated, anonymised data

Recombine

Encrypt and load

Clinical / activity data

Operational system

HIRU (Blue C)

HIRU methodology (illustration)


Available computing infrastructure

Available Computing infrastructure

  • Blue C supercomputer, one of the fastest computers in Europe dedicated to Life Science research

  • Strategic partnership with IBM (through School of Medicine’s Institute of Life Sciences initiative)

  • Advanced software toolset (database, data mining, GIS)


Objectives

Objectives

  • Secure data transportation

  • Reliable matching process

  • Anonymisation and encryption

  • Disclosure control

  • Data access controls

  • Scrutiny of data utilisation proposals

  • External verification of compliance with IG


Objective 1

Objective 1

  • Secure data transportation

    • Data transported using HTTPS (Hyper-Text Transfer Protocol Secure)

    • DPOs split datasets at source

    • Clinical details to HIRU (none to HSW)

    • Demographics to HSW for matching and anonymisation

    • Brown Envelope principle)

    • Linking key – re-join after anonymisation


Objective 2

Objective 2

  • 2) Reliable matching process

  • Partnership with Trusted Third Party – HSW

  • HSW = NHS Agency with right to hold identifiers for NHS admin purposes

  • Use the Welsh Demographic Service administrative register as gold standard

  • MACRAL (Matching Algorithm for Consistent Results in Anonymous Linkage) - SQL-based algorithm – sequential passes

  • Deterministic and probabilistic record linkage


Macral

MACRAL

  • Exact match on valid NHS number

  • Exact match on firstname, surname, d.o.b, gender, postcode

  • Soundexing

  • Lexicon matching

  • Assigns match probability on Bayesian model

  • Informs analysts


Validation and optimisation

Validation and optimisation

  • Firstly –

  • Validation exercise

  • Obtained specificity values >99.8% and sensitivity > 94.6% with error rates <0.2%

    Then –

  • Optimised techniques for matching a variety of datasets: primary care (GP), hospital/secondary care (PEDW), and social care (PARIS)


Matching rates

Matching rates


Objective 3

Objective 3

  • 3) Anonymisation and encryption

  • Anonymous Linking Field (ALF)

  • One person – one ALF

  • Aggregation and categorisation

  • Further processing at HIRU

    • Into ALF_E

    • Recombination


Objective 4

Objective 4

  • 4) Disclosure controls

  • Assessment of Uniques and low-copy numbers

  • Data reduced to minimum required for study

  • Operated at various stages:

    • When the data view is created

    • Before dissemination

  • Numerical Evaluation of Multiple Outputs

  • Combination of expert review and machines processes


Numerical evaluation of multiple outputs

Numerical Evaluation of Multiple Outputs

  • NEMO

  • SQL-based algorithm

  • Counts unique and low-copy number records

  • Allows the judicious application of suppression and/or aggregation

  • Manual review

  • Linkage/Homogeneity attack


Objective 5

Objective 5

  • 5) Data access controls

  • Technical and permission-based control

  • Policies and Standard Operating Procedures (SOPs)

  • User agreements – clarity + penalties

  • Project-based approvals and linked access

  • Physical restrictions - technology

  • Time-limited, specific data views per approved project

  • SAIL Gateway


Improving access sail gateway

Improving access: SAIL Gateway


Sail gateway critical features

SAIL Gateway: Critical features

  • Firewalled network

  • Windows XP Desktops one per user running in a virtualised environment (VPN)

  • All desktop and server members of active directory and specific group policies applied

  • Only remote desktop (RDP) allowed through firewall to the windows XP desktops

  • Localised file storage for windows XP desktops both private and shared between desktops within the Gateway

  • Ability to host application servers within environment

  • Automated one-way transfer of data into the environment

  • Authorised limited transfer of data out of the environment


Objective 6

Objective 6

  • 6) Scrutiny of data utilisation proposals

  • Collaboration Review System – applies to all uses

  • Information Governance Review Panel (IGRP)

    • British Medical Association

    • Public Health Wales

    • National Research Ethics Service

    • Informing Healthcare

    • Involving People


Objective 7

Objective 7

  • 7) External verification of compliance with IG (Audit)

  • Important to:

    • Reassure DPOs and other partners

    • Gain recommendations for improvement

  • Conduct:

    • Policies and SOPs

    • Interviews

    • System verification


The sail system

Data Users

Project

View

Project Request

IGRP

HIRU

&

IGRP

SOPs and Policies

Disclosure control

HIRU

Access controls

Views

SAIL databank

Masking and encryption

HSW

Anonymisation service

Data Sources

National Datasets

NHS

Social care

Others

The SAIL system


Subsequent refinements

Subsequent refinements

  • Role based access

    • Technical, Senior Analyst, Approved Analyst, User / statistician, HSW technical

  • SAIL Gateway

    • Uploading, tool selection, performance

    • Results out / approvals

    • Wiki, help, training materials, code of conduct, messaging

  • Tighter user agreements (line management sign off) & Clearer sanctions for misuse

  • Purpose-built virtual IGRP committee technology


  • Privacy protection the sail databank david ford ecconet data linkage workshop

    Data

    • Data on 3 million people, ≈ 2 billion records, and growing!

    • Historical data 5 – 20 years

    • Maintains address history for full period (exposures)

    • Most codified using ICD, Read codes, OPCS codes, SNOMED, etc.

    • Many hundreds of separate data suppliers

    • Free text a real (IG) challenge

      • Unknown use of identifiers

      • Potential for ‘risky’ comments

      • Hard to analyse in quantity

      • Now a major work stream


    National datasets examples

    National datasets - examples

    • PEDW - in-patients & day cases and out-patients

    • National Community Child Health Database

    • NHS Direct Wales

    • Cancer incidence registry for Wales

    • National screening programmes

    • Congenital abnormalities

    • Ambulance service data

    • National Pupil Database (performance and attendance of children at School)

      And much more…


    Local datasets examples

    Local datasets - examples

    • General Practice

    • Pathology

    • A&E departments

    • Social services

    • Local authority housing data – RALFs

    • And more….


    Research datasets

    Research datasets

    • Data collected as part of research studies where the aim is to use routine data as well

    • Permissions, consent and regulatory approvals

    • Do not release SAIL data to researchers to link to study datasets

    • Treat as dataset from DPO – study dataset anonymised and loaded into SAIL for linkage with SAIL data


    Clinical systems

    Clinical systems

    • Introduced new clinical systems to send data direct to SAIL (via standard mechanisms)

    • Working with NHS Wales to introduce new national systems

    • SAIL now central part of the NHS’s “secondary uses” approaches – new data from new national systems e.g. - radiology, pathology, emergency, etc.


    Other advancements

    Other advancements

    • Data collected directly from the people of Wales (and beyond) via internet portals. Currently disease cohort specific, moving to all-Wales

    • SAIL data now linked to local histopathology sample archive (tissue bank), with potential to link to national cancer (tissue) bank

    • Flow of imaging data (MRI, ECG, etc.) from local NHS providers

    • Set up of a public advocates group

    • Linkage of national (cross-sectional) surveys – consent issues

    • Genomics data under consideration (special IG issues!)

    • Increasingly used by NHS to monitor and plan services – change of use

    • Residential Linking Fields (RALFs) . . .


    Ralfs

    RALFs

    • Desire to know more about:

      • The properties people live in (characteristics, proximity to geographical features)

      • Who they live with (household relationships, familial relationships etc)

    • A real problem to do while maintaining anonymity

    • Our Solution: RALFs

    • An ALF has a RALF, all RALFs have 1+ ALFs (usually)


    Residential anonymous linking fields ralfs

    Residential Anonymous Linking Fields - RALFs

    HIRU

    HSW

    • a. Create environment metrics

    OS Data

    b. KEY and addresses with environment metrics

    HIRU GIS

    WDS

    c. Match incoming address data and

    attach RALFs

    d. RALFs and environment metrics

    Encrypt

    Encrypt

    e. Combination of

    RALFs with ALFs

    SAIL


    Methodology references architecture

    Methodology references - Architecture


    Methodology references matching

    Methodology references - Matching


    Methodology references ralfs

    Methodology references - RALFs


    Summary

    Summary

    • Privacy is not just about the individual – it sometimes relates to the organisation

    • Preserving privacy reduces research utility

    • Finding the balance between privacy protection and research utility is the key

    • There is no perfect balance


    Thanks

    Thanks

    Data providers - NHS organisations, local authorities and government agencies, and more

    Health Solutions Wales

    NHS Wales Informatics Service

    National Institute for Social Care and Health Research

    Welsh Government

    Information Governance Review Panel

    Researchers of Wales and beyond

    And to you for listening!


    Thanks1

    Thanks


  • Login