1 / 45

Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data. Kwan R. Lee, Ph.D. and Lei A. Zhu, Amit Bhattacharyya, J. Alan Menius Biomedical Data Sciences GlaxoSmithKline kwan.lee@gsk.com. Overview. Systems Biology Challenges for Statisticians

flaviaj
Download Presentation

Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data Kwan R. Lee, Ph.D. and Lei A. Zhu, Amit Bhattacharyya, J. Alan Menius Biomedical Data Sciences GlaxoSmithKline kwan.lee@gsk.com NISS Metabolomics Workshop, 2005

  2. Overview • Systems Biology • Challenges for Statisticians • Possible solutions • Example of integrative data analysis • Summary and discussion NISS Metabolomics Workshop, 2005

  3. Of mice and men ? ? NISS Metabolomics Workshop, 2005

  4. Integrate knowledge and technologies Reduce attrition by running coordinated studies in animal and man NISS Metabolomics Workshop, 2005

  5. Focusing on one platform may miss an obvious signal!!! NISS Metabolomics Workshop, 2005

  6. Animal Phenotype Human Phenotype Classic Phenotypic Approach Integrative Biology Animal Phenotype Human Phenotype Animal Biomarker Fingerprint Human Biomarker Fingerprint How can efficacy failures be attacked? Few data to support analogy Many data to support analogy NISS Metabolomics Workshop, 2005

  7. ‘Systems Biology’ approach to drug discovery NISS Metabolomics Workshop, 2005

  8. Experimental Platforms Non-omics and Omics, what are they? LC-MS Lipid 1H NMR metabolites A A “Non-omic” markers Affy Transcriptome Veh A B C D Veh A B C D Normal Disease LC-MS metabolites NISS Metabolomics Workshop, 2005

  9. Traditional Blood Chemistry (non-omics) Gene Expression (transcriptomics) Metabolite (metabonomics) Lipid (lipomics) Protein (proteomics) Experimental Platforms Non-omics and Omics, what are they? (cont.) NISS Metabolomics Workshop, 2005

  10. Five Challenges • Data Pre-processing • High Dimensionality • Multiple Testing for Marker Selection • Data Integration • Validation of the Prediction Model NISS Metabolomics Workshop, 2005

  11. Challenge #1: Data Pre-processing • Peak Alignment (NMR, LC/MS) • Normalization (Gene Chip, NMR, LC/MS data) • Why? Remove systematic bias in the data • Normalization within the platform makes data comparable across samples NISS Metabolomics Workshop, 2005

  12. Challenge # 2: High Dimensionality# of subjects << # of variables Choles, Trig,… ... probe set 1 …… 22,000 Lipid 1 ...… 2,000 Metabolite 1 … 3,000 NMR 1 …… 500 • Blood Chemistry: 9 markers • Gene Expression: 22,000 probe sets • Lipid LC/MS: 2, 000 peaks • Metabolite LS/MS: 3,000 peaks • NMR: 500 buckets Animal 1 Animal 2 . . . . Animal 100 NISS Metabolomics Workshop, 2005

  13. Noise Signal Signal+Noise Challenge #3: Multiple Testing in Variable Selection No Adjustment for Multiple Testing FWER Adjustment + = FDR NISS Metabolomics Workshop, 2005

  14. Challenge #4: Data integration LC-MS Lipid 1H NMR metabolites A A “Non-omic” markers Affy Transcriptome Veh A B C D Veh A B C D Normal Disease LC-MS metabolites NISS Metabolomics Workshop, 2005

  15. Challenge #4: Data integration (cont.) Integration Approach 2: Integration Approach 1: Platform A Platform B Platform A Platform B 20000s var. 1000s var. 20000s var. 1000s var. Dimension Reduction ( eg variable selection) Platform A Platform B 1000s var. 100s var. Combined Combined Data Data NISS Metabolomics Workshop, 2005

  16. Challenge #4: Data integration Example 1 Integration approach 1: Simple data integration • Simply combining the platform data together, the platform with large amount of data and variability will dominate the other platforms NISS Metabolomics Workshop, 2005

  17. PCA on Non-omics, Transcriptomics, and Combined. Non-omics (20) Transcriptomics (12,488) Mirror image!!! Combined (12,508) Transcriptomics data dominate Non-omics data!!! NISS Metabolomics Workshop, 2005

  18. PCA on Non-omics, Transcriptomics, and Combined. Non-omics (20) Transcriptomics (20 PCs) Like a mirror image!!! Combined (40) NISS Metabolomics Workshop, 2005

  19. Challenge #4: Data integrationExample 2 Integration approach 2:Integrate on selected markers • 9 blood chemistry + 2000 probe sets + 150 metabolites • There are still platforms with more selected markers • How to weight different platforms appropriately? Eg. 9 blood chemistry markers are known to relate to disease or drug • Identify relationship among the probe sets, metabolites, along with the blood chemistry markers in terms of biological pathways NISS Metabolomics Workshop, 2005

  20. Principle Component Analysis (PCA ) Projection of 67 animals of 28 normal (black) , 39 disease (red) (9 NO, 1991 TA, 115 MT) All markers used for projection Normal Disease NISS Metabolomics Workshop, 2005

  21. Loading Plot NISS Metabolomics Workshop, 2005

  22. Partial Least Square Discriminant Analysis (PLS-DA) Disease group only Vehicle Drug NISS Metabolomics Workshop, 2005

  23. PLS-DA: Corresponding projection of all markers (9 NO, 1991 TA, 115 MT), Which are important drug markers? Veh Drug NISS Metabolomics Workshop, 2005

  24. Ranked drug markers by importance or by coefficients. marker importance by variable importance on projection Up or down regulation by coefficients NISS Metabolomics Workshop, 2005

  25. Validation of the model: R2, Q2 and permutation tests 100 times (P < 0.01) NISS Metabolomics Workshop, 2005

  26. Variation explained by each platformPLS-DA for prediction of 2 experimental groups Two Groups HFD, vehicle HFD, Drug treated The above table is based on 2- component model. If the 4th model uses more components, 91% of the variation in the data can be explained by 4 components. Q2(Y) = amount of variation among the 2 groups explained by the model (cross-validated) NISS Metabolomics Workshop, 2005

  27. Challenge #5: Validation of the Prediction Model • Correct way of doing cross-validation • Especially when the variables are selected • Is your prediction accuracy significant? NISS Metabolomics Workshop, 2005

  28. Random Noise Data • Simulate 20,000 marker columns of random noise for 100 patients and one additional column containing arbitrary labels of class indicators. • Select 5 marker columns showing most correlation with class label. • Make a prediction model for class indicators based on these 5 selected markers. NISS Metabolomics Workshop, 2005

  29. PCA of Full Markers NISS Metabolomics Workshop, 2005

  30. PLS-DA on Random Noise Data • Running a full model on SIMCA does not yield a model – no significant Q2. • Multivariate approach is conservative. • Q2 computes prediction performance. • But forced the software to fit a 6 -component model by PLS-DA • (R2 = 1.0, Q2 = 0.225) NISS Metabolomics Workshop, 2005

  31. Full marker modelPLS-DA NISS Metabolomics Workshop, 2005

  32. Was it real or by chance? NISS Metabolomics Workshop, 2005

  33. Select 5 Markers • Selected top 5 markers using VIP from the over-fitted model and fit PLS-DA again on the same data. • Now we have (R2 = 0.459, Q2 = 0.348) NISS Metabolomics Workshop, 2005

  34. Good prediction from PLS-DA? Q2 = 0.35 NISS Metabolomics Workshop, 2005

  35. Validated by permutation test?Significance of Q2 NISS Metabolomics Workshop, 2005

  36. Selection Bias • When a prediction model is tested on the samedata that were used in the first instance to select the markers, selectionbias makes the test error overly optimistic. • Many publications claimed a small set of selected “genes” is highly predictive. • IBI practice is to use a data set to select markers and use the same data set to fit a prediction model based on selected markers. NISS Metabolomics Workshop, 2005

  37. How to correct for selection bias? • External validation should be undertaken subsequent to feature selection process. • Independent test data set (hold-out data set) that never used for feature selection. • External cross-validation (ECV). • Cross validation of the prediction model is external to the selection process. • In other words, make a new selection for each cross validation round. NISS Metabolomics Workshop, 2005

  38. Externally Validated PLS.Model and variable selection • Divide the data set randomly into d parts. • Set ecv = 1; (this means hold-out one part and use d-1 parts for modeling) • Set a =1 ; (the number of components, do until 10) • Set k = total number of variables; • Loop: • Fit PLS model with given a and k , PLS (a,k); • Predict hold-out set, compute PRESS (ecv, a, k) and save; • Choose top half of the variables by appropriate statistics (coeff, vip, t-ratio etc); • Set k = k/2; • Go back to Loop until k = 2; • Set a = a + 1; • Go back to Loop until a =10; • Set ecv = ecv + 1; • Go back to Loop until ecv = d; • Compute PRESS (a, k) = Sum over ecv {PRESS (ecv, a, k)}; • Compute Q2(a, k) = 1 – PRESS (a, k)/TSS; • Plot Q2(a,k) vs. log2(k); NISS Metabolomics Workshop, 2005

  39. Simulation of 2000 Random DataR. Simon 2003 • 20 x 6000 and 10/10 for class labels • Repeat 2000 times • Compute 3 different error rates • Re-substitution (wrong) • Cross validation after selection (wrong) • Cross validation before selection (correct) NISS Metabolomics Workshop, 2005

  40. Results of 2000 Random Data NISS Metabolomics Workshop, 2005

  41. Permutation testing • Because of the high dimensionality of gene expression data, it may be possible to achieve relatively small error rates even for random data. • To assess the significance of the classification results, permutation test may be suggested. NISS Metabolomics Workshop, 2005

  42. Challenge #5: Validation of the Prediction Model - summary • Correct way of doing cross-validation • All the steps of the prediction modeling should be cross-validated. • Each cross validation step should start from scratch • Is your prediction accuracy significant? • Random data can give you low prediction error • Permutation tests, bootstrap aggregation NISS Metabolomics Workshop, 2005

  43. Summary and Discussion • Recent technological advances present challenging and interesting biological data at molecular level. • Statistics and multivariate analysis play an important role in understanding and extracting knowledge from these type of data. • Integrative analysis is even more challenging and we presented some solutions to these challenges. There is plenty of room for improvement. NISS Metabolomics Workshop, 2005

  44. Acknowledgement GlaxoSmithKline • High Throughput Biology • Biomedical Data Sciences • Genomics and Proteomics Science • Pathology, Cellular & Biochemical Toxicology • Discovery IT NISS Metabolomics Workshop, 2005

  45. Data exploration: Present Challenges Data is an extremely valuable asset, but like a cash crop, unless harvested, it is wasted. - Sid Adelman NISS Metabolomics Workshop, 2005

More Related