1 / 32

Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery

Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery. Jake Y. Chen, Ph.D. IUPUI Indiana Center for Systems Biology & Personalized Medicine http://bio.informatics.iupui.edu. Polyp and Colorectal Cancer. Polyp vs. Colorectal Cancer Benign tumors of the large intestine.

omar
Download Presentation

Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery Jake Y. Chen, Ph.D. IUPUI Indiana Center for Systems Biology & Personalized Medicine http://bio.informatics.iupui.edu

  2. Polyp and Colorectal Cancer • Polyp vs. Colorectal Cancer • Benign tumors of the large intestine. • Does not invade nearby tissue or spread to other parts of the body. • If not removed from the large intestine, may become malignant (cancerous) over time. • Most of the cancers of the large intestine are believed to have developed from Polyp. Photo Courtesy of National Cancer Institute • Colon Cancer vs. Rectal Cancer • Share many commonalities, including molecular mechanisms. • Tend to be treated differently.

  3. Colorectal Cancer Molecular Pathways A. Walther, et al. (2009) Nature Reviews Cancer, 9(7) pp. 489-99

  4. Omics/Clinical Data SourceProteomics/Metabolomics/Lipdomics/Clinical Data

  5. Scientific Questions to Answer • Data Analysis • Which Omics data has the best prediction power? • Which features in Omics data are important? • Data Mining • Does integration of Omics data improve the prediction? • Which combination of Omics data has the best prediction power? • Knowledge Discovery • Why those features in Omics data have the best prediction power?

  6. Roadmap • Knowledge Discovery of Proteomics Data • Knowledge Discovery of Metabolomics Data • Integrative Data Mining

  7. Proteomics Data Description • Group: Bindley Biosciences Center at Purdue University • Instruments: Agilent's chip cube coupled the XCT PLUS ESI ion trap • Data format at CCE webportal: mzXML • Number of Samples: Normal: 80; PolyP:72; Colorectal: 40

  8. LC-MS Proteomics Data Processing Methods Adapted from N. Jeffries (2005) Bioinformatics, vol. 21, (no. 14), pp. 3066. S.A. Kazmi, et al., (2006) Metabolomics, vol. 2, (no. 2), pp. 75-83 LC/MS data “heat map” Total Ion Chromatogram (TIC) summarized from enhanced heat map Image Enhanced LC/MS data “heat map”

  9. LC-MS Major Protein Identification~25-28 characteristic proteins /sample identified Identify Most Informative TIC R.T. “Grid” • Use Mascot to Search for Protein ID at R.T. Grid Regions Apply the R.T. Grid to Original Spectra

  10. Proteomics Result Interpretation Proteins Interacted with High-Frequency Proteins from Colon Cancer Group Proteins Identified from Colon Cancer and Health Group

  11. Proteomics Result InterpretationA Network Biology Context Protein Network Constructed from the Top 3 Differential Proteins Green-circled proteins are frequently (>=0.3) detected in the colon patient blood samples by using LC/MS. Node: Protein with evidence from PubMed by searching ("GENE_SYMBOL" AND ("colon" OR "colorectal") AND ("cancer" OR "carcinoma")), Edge: Protein interaction with confidence score from HAPPI 1.31 (4&5-Star)

  12. Proteomics Result InterpretationA Biological Pathway Context BRAF (Serine/threonine-protein kinase B-raf) plays major roles in Colorectal Cancer Pathway (KEGG data)

  13. Proteomics Result InterpretationA Biological Pathway Context for NNMT NNMT (Nicotinamide N-methyltransferase) is involved in Biological Oxidations/Phase II Conjugation/Methylation (from Reactome)

  14. Roadmap • Knowledge Discovery of Proteomics Data • Knowledge Discovery of Metabolomics Data • NMR Data • GCxGC MS Data • Integrative Data Mining

  15. Metabolomics Data Description Group: Daniel Raftery Laboratory at Purdue University • NMR Data • Instruments: BrukerAvance 500MHz, NMR • Data format at CCE webportal: Excel spreadsheet • Number of Samples: Normal: 53; PolyP:35; Colorectal: 15 • GCxGC MS Data • Instruments: LECO Pegasus 4D GCxGC-TOF • Data format at CCE webportal: Excel spreadsheet • Number of Samples: Normal: 83; Polyp: 84; Colorectal:30

  16. NMR Data Analysis Workflow Signal Processing Report only significant metabolites Extract peaks’ ppm Search Against Human Metabolome Database (2.5) to identify metabolites

  17. NMR Peak Metabolite Identificationusing Human Metabolomics Database 1) Input the peak lists 2) Get the metabolites; leave out those with fewer than 2 matches

  18. Significant Metabolites Identified from NRM Metabolomics Data Marker metabolites? Sharedmetabolites Population Frequency =

  19. D-Arabitol Identified from NMR ResultsInvolved in Pentose and GlucuronateInterconversions Pathways

  20. Roadmap • Knowledge Discovery of Proteomics Data • Knowledge Discovery of Metabolomics Data • NMR Data • GCxGC MS Data • Integrative Data Mining

  21. Results from GCxGC MS Data IMetabolite identification is more straightforward

  22. Results from GCxGC MS DataII C. Colorectal vs Healthy A. Polyp vs Healthy B. Polyp vs Colorectal

  23. Comparative Results (Intensity vs. Population)MarkerMetabolite Panel Clustering of three groups Intensity based Heat map Population Frequency based Heat map

  24. Metabolites identified from GCxGCMS ResultsInvolved in Fatty Acid Biosynthesis Pathways

  25. Roadmap • Knowledge Discovery of Proteomics Data • Knowledge Discovery of Metabolomics Data • Integrative Data Mining

  26. Data Set Description • Diet, Lipidomics, Oxidative and VD • # of features and the total # of subjects varies • Three classes are balanced to the least common denominator • Healthy vs. Polyp • Healthy vs. Colorectal • Polyp vs. Colorectal

  27. Predictive Modeling Methods Classification Model Hypothesis Hypothesis Hypothesis Clean Dataset Raw Dataset • Data Preprocessing • Filtering outliers (three standard deviations away from mean) • Data Normalization (transforming to the 0-1 range) • Binned categorical data using Quantile binning method • Missing Value Treatment • Replaced with the mean value of the attribute in group • Support vector machines (SVM) Classifier Kernel • Radial Basis Function (RBF) kernel are used • Feature Selection Methods • Approach #1: Two sample unpaired T-tests at 5% significance level. • Approach #2: SVM Attribute Evaluator with Ranker Algorithm. • Features from T-tests are filtered using p-values • K-fold Cross-validation

  28. Dietary Attributes as Predictors Colorectal vs. Healthy Polyp vs. Healthy P-value P-value Salad 2.53E-02 Ice cream 2.38E-02 Tomato 9.57E-01 Rice 4.21E-01 Egg Tea 3.71E-02 4.11E-02 Milk Shellfish 5.60E-02 1.21E-01 SVM Predictor Accuracy = 65% SVM Predictor Accuracy = 64%

  29. Lipidomics T-Tests Results Significant Features Selected from T Test with their corresponding p value

  30. Integrating lipidomics with clinical features Performance comparisons Without Clinical Features With Clinical Features * Since the number of subjects was less than15, 3 fold cross-validation accuracy was reported.

  31. Messages • Individual Omics data set has variable predictive performance • Need thorough statistical filtering + biological knowledge integration to battle inherent high-level of data noise • Integration of different Omics data with clinical data can improve predictive performance

  32. Acknowledgment We thank all the members in our team.

More Related