1 / 27

Anonymity through Data cubes

Anonymity through Data cubes. Athos Antoniades. Introduction. Why Share Data? What are the current legal and ethical limitations? How have scientists shared medical data so far? Key Problems Perturbation Cell Suppression. The Problem. Why share data: Replication Testing

deidra
Download Presentation

Anonymity through Data cubes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anonymity through Data cubes Athos Antoniades

  2. Introduction • Why Share Data? • What are the current legal and ethical limitations? • How have scientists shared medical data so far? • Key Problems • Perturbation • Cell Suppression

  3. The Problem • Why share data: • Replication Testing • Statistical Power • Multiple Testing Problem • Legal and Ethical Issues • AnonymizationvsPseudoanonimization • Limitations derived from consent form signed by subjects • Other, regional, study, or subject specific issues.

  4. How have scientists shared medical data Contingency Table and Data Cube example

  5. 16 year old widow Problem A paper that analyzes data from a specific study reports:

  6. 16 year old widow Problem A paper that analyzes data from a specific study reports:

  7. 16 year old widow Problem A paper that analyzes data from a specific study reports:

  8. Categorization Differences Paper 2 that analyzes data from the same study reports: Paper 1 that analyzes data from a specific study reports:

  9. Perturbation and Cell Suppression Perturbation (+-1) and Cell Suppression (<5) Original Data

  10. Evaluation • Most common parameters tested • Perturbation:[0], [-1,1], [-3,3], [-5,5], [-10,10] • Cell Supression: <0, <=1, <=3,<=5,<=10 • Standard main effect test usingChi Square • Pearson’s Correlation Coefficient used to evaluate deviation of each parameter combination to original results. • A-priory defined threshold for Pearson’s correlation coefficient <=0.95.

  11. Evaluating Parameters with a matrix of graphs

  12. Linked2Safety’s Data Analysis Space Objectives: • Design and develop the data mining techniques and the scalable infrastructure for the identification of phenotypic and genetic associations related to adverse events. • Develop new and implement existing state of the art analytical approaches for genetic data. • Define and implement the knowledge extraction and filtering mechanisms and the knowledge base • Integrate the knowledge base into a lightweight decision support system (Adverse events early detection mechanism)

  13. Data Analysis Steps

  14. Quality Control Subspace Provides the tools for identifying and removing erroneous data or data that do not conform to the quality standards that a user might define. • Tools: • Hardy-Weinberg Equilibrium Test • Allele Frequency Test • Missing Data Test

  15. Feature Selection Subspace Provides the tools for removing redundant or irrelevant features from a dataset. Tools: • Rough Set Feature Selection • Information Gain Feature Selection • Chi Squared Feature Selection

  16. Data Analysis Steps

  17. Single Hypothesis Testing Subspace Provides the tools for performing single hypothesis testing on a dataset and test for associations. • Tools: • Pearson’s Chi Square Test • Fisher’s Exact Test • Odds Ratio • Binomial Logistic Regression • Linkage Disequilibrium • Genetic Region Based Association Testing

  18. Data Mining Subspace Provides the tools for performing data mining analyses on a dataset and extract association rules. • Tools: • Association Rules (apriori) • Decision Trees with Percentage Split (C4.5) • Decision Trees with Cross Validation (C4.5) • Random Forest with Percentage Split • Random Forest with Cross Validation

  19. Data Analysis Space Interactions

  20. Data Analysis Steps

  21. Knowledge Extraction and Filtering Mechanism • Knowledge Extraction Mechanism • This mechanism is responsible for storing statistically significant associations and important association rules in the Linked2Safety knowledge database • Has two steps: • Logging system • Storing important knowledge • Filtering mechanism • This mechanism allows users to insert or delete associations and association rules

  22. Adverse Event Early Detection Mechanism • Uses the knowledge in the L2S knowledge base • Runs in the background to identify new associations and association rules • Reruns analyses when updated datasets are available • Creates alerts for patients profiles associated with adverse events

  23. Linked2Safety’s Data Analysis Platform

  24. Linked2Safety’s Data Analysis Platform Workflow Screenshot

  25. Patterns Discovery Common Variable Selection • Overlapping non genetic data of at least 2 data providers:

  26. Conclusion and future work on utilizing data cubes • We were able to identify for a given dataset the maximum noise that can be added to the data without significantly affecting the outcomes. • Results presented are only relevant to MASTOS, all other datasets need to repeat the analytical approach described to determine the maximum noise that can be added to the results. • Further investigation is necessary to identify the minimum parameter settings to satisfy legal and ethical requirements.

  27. Who to Contact • Athos Antoniades • University of Cyprus • email: athos@cs.ucy.ac.cy

More Related