1 / 24

On the Importance of Data Cleansing and Pre-processing for Genomic Studies

On the Importance of Data Cleansing and Pre-processing for Genomic Studies. Raymond Ng Computer Science, UBC (ICapture and BC Cancer Research). My Key Genomic Projects. Better biomarkers in transplantation (Genome Canada)

tender
Download Presentation

On the Importance of Data Cleansing and Pre-processing for Genomic Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the Importance of Data Cleansing and Pre-processing for Genomic Studies Raymond Ng Computer Science, UBC (ICapture and BC Cancer Research)

  2. My Key Genomic Projects • Better biomarkers in transplantation (Genome Canada) • Rational chemotherapy selection for Non-Small Cell lung cancer (Genome Canada) • Nanosilver effects on amphibian wildlife using novel molecular assays (NSERC) • Frog Sentinel species comparative “omics” for the environment (Genome BC)

  3. Overview: Better Biomarkers in Transplantation • Vital organ failure a leading cause of premature death world-wide • Organ transplantation restores life and health to over 40,000 patients per year • Post-transplants not clear sailing: • Immunosuppressive drugs cause infection, cancer, diabetes, heart diseases, and kidney failures • Transplant failure and treatment complications consume enormous health care resources

  4. BiT OverallObjectives • Identify effective and widely applicable markers that… • predict rejection or immune accommodation of solid organ transplants • diagnose acute and chronic rejection • forecast the response to therapies that individual transplant recipients receive

  5. BiT Analysis Pipeline 54,675 probe sets ~ 38,500 genes ~15,000 probe sets ~200 probe sets PAXgene Whole Blood Affymetrix HG U133 plus 2 Microarrays Normalization and Pre-filtering Univariate feature selection Classifier building + Pathway analysis Biomarker Panel

  6. Complex Data Types Common Theme: data cleansing and pre-processing

  7. My Studies on Cleansing and Pre-processing • Clinical:“Detecting potential labeling errors in microarrays by data perturbation,’’Bioinformatics 2006 (Malossini, Blanzieri) • mRNA:“MDQC: a new quality assessment method for microarrays based on quality control reports,”Bioinformatics 2007 (Cohen-Freue, Hollander et al.) • DNA:“Modelling Recurrent DNA Copy Number Alterations in array CGH Data,”Bioinformatics 2007 (Shah, Murphy, Lam) • Proteomics:“Linking Protein Groups Across Multiple Experiments,”in preparation (Cohen-Freue et al.)

  8. A. Detecting Potential Labeling Errors [MBN06] • Biomedical data can be very noisy: • Laboratory environment could change • Diagnostic decisions are not completely objective • Different “gold-standards” are used for grading • Essential to check for label (e.g., grade of rejection) consistency

  9. A. Our Approach • Propose a leave-one-out perturbed classification matrix: • Flip every training sample and compare the resulting classifier with the classifier trained on the original training set • Look for differences in the sets obtained from Support Vector Machines • E.g., sample A is a suspect of mislabeling if flipping A’s label increases accuracy • Effectiveness shown on 3 real microarray data sets with ground truths

  10. B. MDQC: Our Approach to Microarray Quality Control [FZN+07] • collapses all values in QC reportsinto measures to assess the quality of the array, the sample, and the RNA • measures the distances of each array to an “average” array in the study, adjusting for covariances • accounts for interrelation among measures to identify outlying arrays that is not evident from inspection of each one in isolation

  11. Sample Quality 1500 21-4 1000 500 0 0 50 100 150 200 Chip Quality RNA Quality 400 21-4 13-3 15 300 320-1 13-4 10 13-6 13-2 200 19-1 13-5 317-10 5 100 17-6 302-7 25-5 0 0 0 50 100 150 200 0 50 100 150 200 Sample Sample

  12. B. MDQC Advantages • Performs a multidimensional analysis and not requiring absolute thresholds (which are often arbitrary) • Easy to implement and visualize, and computationally inexpensive (as compared with Affy PLM) • Can suggest potential sources of problems and possible batch effects

  13. Possible Batch Effects Some of the samples from batch 9 Batches 1, 2, 3 and 4)

  14. C. DNA Copy Number Analysis with CGH arrays [SNLM06,07] • Segments of DNA that get duplicated (gains) or deleted (losses) • Chromosomal aberrations are being used to form signatures • Chemotheraphy selection for NSCLC • Staging in cancer (e.g., lung and oral)

  15. Computational challenges • Noisy signals • Spatial dependence between adjacent clones • Outliers • Systematic errors • Copy number polymorphisms

  16. C. Two State-of-the-Art Methods MERGELEVELS Base-HMM

  17. C. Our Approach • Use a Hidden Markov Model (HMM) to capture spatial dependency between clones • Use a Gaussian mixture model to model the outliers separately from the inliers • Outliers have no spatial dependence • Use prior knowledge about locations of CNPs to ‘inform’ the model about possible locations of outliers

  18. Example results LSP-HMM

  19. D. Protein Quantitation: ITRAQ

  20. Distinct Indistinguishable Subset Shared Peptide D. From Peptides to Proteins p1 p1’ p2 p3 p4 p5 p5 p6 p6 p7 p8 p9 p10 p10 p11 p11 p12 A B C D E F G H

  21. p5 p5 p6 p6 p7 p8 p9 C D E F D. Protein Groups Which protein is present in the sample? Form protein groups where proteins in a group are not distinguishable

  22. p10 p11 p12 p4 • Set 1 G H divide merge G H I • Set 2 D. Linking Proteins Groups Across Different Sets: a Challenge Two protein groups One protein group Proteins G and H may not be identified in Set 2

  23. Conclusions • We propose an algorithm to solve the protein group linking problem (submitted) • We discussed some approaches to cleansing and pre-processing for clinical, mRNA, DNA and proteomics • In our biomarker studies, dealing with these issues proven to be significant to the analysis

  24. Thank You!

More Related