1 / 16

Supporting precise data analysis without releasing patient records: the Simulacrum in action

This presentation discusses the use of synthetic data in healthcare research to support data analysis while maintaining privacy. It explores the Simulacrum project and its goal of providing a public resource for researchers to access and analyze data without compromising sensitive information. The talk also highlights the applications and benefits of synthetic data in faster and more private analysis.

rgrand
Download Presentation

Supporting precise data analysis without releasing patient records: the Simulacrum in action

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting precise data analysis without releasing patient records: the Simulacrum in action Cong Chen, Paul Clarke, Lora Frayling, Sally Vernon, Brian Shand, Pesh Doubleday, Jem Rashbass

  2. Overview • Context and goals of this talk • Background: our motivating problem • What is synthetic data and how does it help? • What is the goal of the data exercise? • Building a synthetic data model in the Simulacrum • Results and applications • Conclusion Presentation title - edit in Header and Footer

  3. Talk Aims • Introduce and motivate concepts • Synthetic data • The information governance environment • Externally guided analysis • Describe and explain • The Simulacrum as synthetic data – what is it and how was it created? • Synthetic data-guided queries • How this has led to faster, more private answers Presentation title - edit in Header and Footer

  4. Problems with sharing cancer data • Lots of data is available • This would enable researchers and industry to provide valuable insight into disease epidemiology, survival, clinical practice, resource utilisation, outcomes • Highly sensitive • Sharing data is an exercise in risk-reward balancing • Complex and intricate • Data dictionaries do not provide a perfect view of what to expect, analysis can be slow to converge Presentation title - edit in Header and Footer

  5. Synthetic data • Data items which are not created by observations • This includes simulations (e.g. Synthea), partially synthetic data (generalised perturbation) and fully synthetic data • Does not represent individuals • Removes re-identification risk, but attribution risks remain Presentation title - edit in Header and Footer

  6. Simulacrum project aims • Users should have direct access to a public resource • Showing data as it looks to internal analysts • Be able to identify their cohort and the cohort size, data completeness and quality, and the codes/ranges used • Be able to prepare and code algorithms against the synthetic data With a prepared analytical plan • Engage PHE with the proposed study • Share code which runs on the real data • Be able to complete analysis without releasing row-level or other sensitive data • Take a data-driven approach where possible • Use parameters • To adjust for differently sized or shaped datasets • To adjust to different privacy constraints/requirements Presentation title - edit in Header and Footer

  7. Linked datasets • Data represents the course of patient treatment – we are interested in a coherent story and sensible timeline. • Patients can have multiple tumours, with very many treatment events – we need to capture this.

  8. How did we do it? • Key idea: sample from empirical conditional distributions. • Question: how do we keep from running out of data? • Use low-dimensional distributions. • Question: which variables do we condition on? • Use independence tests to find strongly associated variables.

  9. More details • Question: what do we do for linked tables? • Use all previous data (but in read-only mode). • Question: what about sequences of events? • Use information from the previous event (if it exists) and data in upstream tables – so a Markov model. • Question: what about sampling from small conditional distributions, which risk reflecting real individuals? • Cluster these distributions to meet accepted healthcare data standards.

  10. What models look like (without the data)

  11. The Simulacrum as a dataset • Version 1 – released 2018. 1.5 million tumours (corresponding to English incidences 2013-2015) with tumour/demographic/mortality data and chemotherapy treatment. • Representative at low dimensions (of variable combinations), not as good for complex detail. • Non-disclosive for public release. • Ongoing development.

  12. How does it look? Cumulative age distribution (breast) Blue: Synthetic, Red: Real Cumulative age distribution (prostate)

  13. Applications • Synthetic data used to back up a statistical query gateway (currently manual). • We’ve shared our synthetic data with partners to write queries against – those have turned out to be robust and aware of data formats, categories in our data and run against our data. • Publications accepted for conferences and journal articles. • We then try to release non-disclosive aggregates, model parameters/diagnostics without the personal data used to build those models. Presentation title - edit in Header and Footer

  14. Current work • Better documentation of research and access process for less technical researchers • Model improvement, application in context of other datasets • More test-driven quality measures, automatic simulation with specific goals • Use other synthetic methodology within the data architecture • Fidelity isn’t objective – need to think about suitability for specific purpose

  15. Conclusions • Synthetic data is a game changer for supporting research and reducing risks • This opens understanding of the data and analysis to a wider audience while reducing workload and misunderstandings • Realistic understanding of aims and expectations helps a synthetic data project improve mutual understanding Presentation title - edit in Header and Footer

  16. Acknowledgements • Analyses were based on anonymous aggregate patient data from the National Cancer Registration and Analysis Service. • Thank you to NCRAS and HDI, as well as everyone working on or who has worked on the Simulacrum. • Pick up the data at https://simulacrum.healthdatainsight.org.uk • https://github.com/UCL-simulacrum/EDA is an amazing piece of work carried out by UCL students over 3 months with no reference to the real data. • cong.chen@phe.gov.uk • ncrasenquiries@phe.gov.uk Presentation title - edit in Header and Footer

More Related