Privacy Protection
1 / 46

Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop - PowerPoint PPT Presentation

  • Uploaded on

Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011. Overview. The SAIL system and how it operates Privacy Protection Issues and Drivers Privacy Protection approach Current developments Examples of research studies Future work.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop' - sorley

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  • Privacy Protection

  • & the SAIL Databank

  • David Ford

  • ECCONET Data Linkage Workshop

  • Bergen 15 – 17 June 2011


  • The SAIL system and how it operates

  • Privacy Protection Issues and Drivers

  • Privacy Protection approach

  • Current developments

  • Examples of research studies

  • Future work

What are hiru and sail
What are HIRU and SAIL?

  • HIRU – the Health Information Research Unit

  • SAIL – Secure Anonymous Information Linkage

  • Main aim of HIRU is to realise the potential of electronically-held, routinely-collected, person-based data to conduct and support health-related studies

  • The SAIL databank already holds over 1 billion anonymised and encrypted individual-level records, from a range of sources relevant to health and well-being

Is sail a cohort
Is SAIL a Cohort?

  • Perhaps!

  • Total population databank for the 3 million people of Wales

  • Multi source data (administrative, clinical, research)

  • Many nested e-cohorts within SAIL (such as WECC )

Data linkage is the key
Data Linkage is the key!

  • Data linkage (at a person level) is essential to reap the benefits of routine data

  • Good quality data linkage needs some form of consistent personal identifiers on which to link

  • In the UK multi source data do not share a common ID number. Names, Address, Date of Birth, ARE however, normally collected.

In the beginning
In the beginning . . .

  • There was a real opportunity to create this data resource

  • We had established how linked, routine data was useful for research

  • We knew there were numerous technical (computing) challenges to overcome

  • But the idea required data owners / guardians to feel able to provide their data to SAIL.

  • Constructing the circumstances that enabled data guardians to supply data to SAIL become the single biggest challenge!

The issues facing sail
The issues facing SAIL

  • Data guardians across Wales:

    • Wanted to participate and saw the potential benefits

    • Were nervous of breaching the Data Protection Act

    • Did not have clear guidelines that helped

    • Needed a way of guaranteeing the privacy of their data

    • Were nervous about the uses to which the data might be put

    • Wanted access to be controlled (in some way)

The issues facing sail1
The issues facing SAIL

  • Researchers (across the UK) wanted:

    • As much data as they wanted, whenever they wanted it

    • To avoid detailing what they wanted to do

    • Data delivered to them

    • Data to arrive quickly

    • No admin, no approvals, no constraints

    • Clean, easy and consistent data

    • Simple, flat data structures

Our response
Our response

  • Set a series of objectives

  • Undertook pilot work

  • Consulted very widely

  • Understood relevant legislation and good practice guidance (Information Governance)

  • Developed the approaches over time

  • Continued to consult and have external inspection

  • Continuous improvement process

The initial ig challenge
The initial IG challenge

  • Matching up the same people in different datasets (data linkage) is very inaccurate without access to identifiers

  • (Matching with imperfect identifiers is still a challenge!)

  • Sophisticated but standardised matching was therefore required.

  • Data owners felt able to part with “anonymised at source” data. However including identifiers in the supply was seen to be illegal without consent

Setting out
Setting out

  • Pilot to prove the concept

    • One health economy area – Swansea (pop. c. 250k)

    • Data General Practices (36), Patient Episode Database Wales (PEDW) and social services data extracts

    • Purpose: to develop, review, refine technical and procedural methodologies.

Setting out1
Setting out

  • Consultations with regulatory and professional bodies (local and national)

    • Suitability of system

    • In the public interest

    • Protection of patient privacy

    • Ethics and governance

    • Usefulness to enhance research and inform policy

    • Value for money

  • Exhaustive (and exhausting!) efforts

  • File of evidence of acceptability

The base level
The base level

  • Response: development of “Split-file anonymisation” technique

    • Using the “separation principle”

    • No flow of identifiable information to SAIL

    • No flow of identifiable confidential information to ANYONE

  • Clear, written, formal data sharing agreements

    • Clarity about use cases, conditions and exception clauses

Other design constraints
Other design constraints

  • A pledge to data providers that no data will ever leave the databank

    • They can ask for it to be deleted

    • They know who has accessed it

    • They know what it has been used for

Hiru methodology illustration


Construct ALF

HIRU (Blue C)

Health Solutions Wales

Data Provider

Other recombined data

Anonymisation process

Demographic data only

Validated, anonymised data


Encrypt and load

Clinical / activity data

Operational system

HIRU (Blue C)

HIRU methodology (illustration)

Available computing infrastructure
Available Computing infrastructure

  • Blue C supercomputer, one of the fastest computers in Europe dedicated to Life Science research

  • Strategic partnership with IBM (through School of Medicine’s Institute of Life Sciences initiative)

  • Advanced software toolset (database, data mining, GIS)


  • Secure data transportation

  • Reliable matching process

  • Anonymisation and encryption

  • Disclosure control

  • Data access controls

  • Scrutiny of data utilisation proposals

  • External verification of compliance with IG

Objective 1
Objective 1

  • Secure data transportation

    • Data transported using HTTPS (Hyper-Text Transfer Protocol Secure)

    • DPOs split datasets at source

    • Clinical details to HIRU (none to HSW)

    • Demographics to HSW for matching and anonymisation

    • Brown Envelope principle)

    • Linking key – re-join after anonymisation

Objective 2
Objective 2

  • 2) Reliable matching process

  • Partnership with Trusted Third Party – HSW

  • HSW = NHS Agency with right to hold identifiers for NHS admin purposes

  • Use the Welsh Demographic Service administrative register as gold standard

  • MACRAL (Matching Algorithm for Consistent Results in Anonymous Linkage) - SQL-based algorithm – sequential passes

  • Deterministic and probabilistic record linkage


  • Exact match on valid NHS number

  • Exact match on firstname, surname, d.o.b, gender, postcode

  • Soundexing

  • Lexicon matching

  • Assigns match probability on Bayesian model

  • Informs analysts

Validation and optimisation
Validation and optimisation

  • Firstly –

  • Validation exercise

  • Obtained specificity values >99.8% and sensitivity > 94.6% with error rates <0.2%

    Then –

  • Optimised techniques for matching a variety of datasets: primary care (GP), hospital/secondary care (PEDW), and social care (PARIS)

Objective 3
Objective 3

  • 3) Anonymisation and encryption

  • Anonymous Linking Field (ALF)

  • One person – one ALF

  • Aggregation and categorisation

  • Further processing at HIRU

    • Into ALF_E

    • Recombination

Objective 4
Objective 4

  • 4) Disclosure controls

  • Assessment of Uniques and low-copy numbers

  • Data reduced to minimum required for study

  • Operated at various stages:

    • When the data view is created

    • Before dissemination

  • Numerical Evaluation of Multiple Outputs

  • Combination of expert review and machines processes

Numerical evaluation of multiple outputs
Numerical Evaluation of Multiple Outputs

  • NEMO

  • SQL-based algorithm

  • Counts unique and low-copy number records

  • Allows the judicious application of suppression and/or aggregation

  • Manual review

  • Linkage/Homogeneity attack

Objective 5
Objective 5

  • 5) Data access controls

  • Technical and permission-based control

  • Policies and Standard Operating Procedures (SOPs)

  • User agreements – clarity + penalties

  • Project-based approvals and linked access

  • Physical restrictions - technology

  • Time-limited, specific data views per approved project

  • SAIL Gateway

Sail gateway critical features
SAIL Gateway: Critical features

  • Firewalled network

  • Windows XP Desktops one per user running in a virtualised environment (VPN)

  • All desktop and server members of active directory and specific group policies applied

  • Only remote desktop (RDP) allowed through firewall to the windows XP desktops

  • Localised file storage for windows XP desktops both private and shared between desktops within the Gateway

  • Ability to host application servers within environment

  • Automated one-way transfer of data into the environment

  • Authorised limited transfer of data out of the environment

Objective 6
Objective 6

  • 6) Scrutiny of data utilisation proposals

  • Collaboration Review System – applies to all uses

  • Information Governance Review Panel (IGRP)

    • British Medical Association

    • Public Health Wales

    • National Research Ethics Service

    • Informing Healthcare

    • Involving People

Objective 7
Objective 7

  • 7) External verification of compliance with IG (Audit)

  • Important to:

    • Reassure DPOs and other partners

    • Gain recommendations for improvement

  • Conduct:

    • Policies and SOPs

    • Interviews

    • System verification

The sail system

Data Users



Project Request





SOPs and Policies

Disclosure control


Access controls


SAIL databank

Masking and encryption


Anonymisation service

Data Sources

National Datasets


Social care


The SAIL system

Subsequent refinements
Subsequent refinements

  • Role based access

    • Technical, Senior Analyst, Approved Analyst, User / statistician, HSW technical

  • SAIL Gateway

    • Uploading, tool selection, performance

    • Results out / approvals

    • Wiki, help, training materials, code of conduct, messaging

  • Tighter user agreements (line management sign off) & Clearer sanctions for misuse

  • Purpose-built virtual IGRP committee technology

  • Data

    • Data on 3 million people, ≈ 2 billion records, and growing!

    • Historical data 5 – 20 years

    • Maintains address history for full period (exposures)

    • Most codified using ICD, Read codes, OPCS codes, SNOMED, etc.

    • Many hundreds of separate data suppliers

    • Free text a real (IG) challenge

      • Unknown use of identifiers

      • Potential for ‘risky’ comments

      • Hard to analyse in quantity

      • Now a major work stream

    National datasets examples
    National datasets - examples

    • PEDW - in-patients & day cases and out-patients

    • National Community Child Health Database

    • NHS Direct Wales

    • Cancer incidence registry for Wales

    • National screening programmes

    • Congenital abnormalities

    • Ambulance service data

    • National Pupil Database (performance and attendance of children at School)

      And much more…

    Local datasets examples
    Local datasets - examples

    • General Practice

    • Pathology

    • A&E departments

    • Social services

    • Local authority housing data – RALFs

    • And more….

    Research datasets
    Research datasets

    • Data collected as part of research studies where the aim is to use routine data as well

    • Permissions, consent and regulatory approvals

    • Do not release SAIL data to researchers to link to study datasets

    • Treat as dataset from DPO – study dataset anonymised and loaded into SAIL for linkage with SAIL data

    Clinical systems
    Clinical systems

    • Introduced new clinical systems to send data direct to SAIL (via standard mechanisms)

    • Working with NHS Wales to introduce new national systems

    • SAIL now central part of the NHS’s “secondary uses” approaches – new data from new national systems e.g. - radiology, pathology, emergency, etc.

    Other advancements
    Other advancements

    • Data collected directly from the people of Wales (and beyond) via internet portals. Currently disease cohort specific, moving to all-Wales

    • SAIL data now linked to local histopathology sample archive (tissue bank), with potential to link to national cancer (tissue) bank

    • Flow of imaging data (MRI, ECG, etc.) from local NHS providers

    • Set up of a public advocates group

    • Linkage of national (cross-sectional) surveys – consent issues

    • Genomics data under consideration (special IG issues!)

    • Increasingly used by NHS to monitor and plan services – change of use

    • Residential Linking Fields (RALFs) . . .


    • Desire to know more about:

      • The properties people live in (characteristics, proximity to geographical features)

      • Who they live with (household relationships, familial relationships etc)

    • A real problem to do while maintaining anonymity

    • Our Solution: RALFs

    • An ALF has a RALF, all RALFs have 1+ ALFs (usually)

    Residential anonymous linking fields ralfs
    Residential Anonymous Linking Fields - RALFs



    • a. Create environment metrics

    OS Data

    b. KEY and addresses with environment metrics



    c. Match incoming address data and

    attach RALFs

    d. RALFs and environment metrics



    e. Combination of

    RALFs with ALFs



    • Privacy is not just about the individual – it sometimes relates to the organisation

    • Preserving privacy reduces research utility

    • Finding the balance between privacy protection and research utility is the key

    • There is no perfect balance


    Data providers - NHS organisations, local authorities and government agencies, and more

    Health Solutions Wales

    NHS Wales Informatics Service

    National Institute for Social Care and Health Research

    Welsh Government

    Information Governance Review Panel

    Researchers of Wales and beyond

    And to you for listening!