Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery

Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery Jake Y. Chen, Ph.D. IUPUI Indiana Center for Systems Biology & Personalized Medicine http://bio.informatics.iupui.edu

Polyp and Colorectal Cancer • Polyp vs. Colorectal Cancer • Benign tumors of the large intestine. • Does not invade nearby tissue or spread to other parts of the body. • If not removed from the large intestine, may become malignant (cancerous) over time. • Most of the cancers of the large intestine are believed to have developed from Polyp. Photo Courtesy of National Cancer Institute • Colon Cancer vs. Rectal Cancer • Share many commonalities, including molecular mechanisms. • Tend to be treated differently.

Colorectal Cancer Molecular Pathways A. Walther, et al. (2009) Nature Reviews Cancer, 9(7) pp. 489-99

Omics/Clinical Data SourceProteomics/Metabolomics/Lipdomics/Clinical Data

Scientific Questions to Answer • Data Analysis • Which Omics data has the best prediction power? • Which features in Omics data are important? • Data Mining • Does integration of Omics data improve the prediction? • Which combination of Omics data has the best prediction power? • Knowledge Discovery • Why those features in Omics data have the best prediction power?

Roadmap • Knowledge Discovery of Proteomics Data • Knowledge Discovery of Metabolomics Data • Integrative Data Mining

Proteomics Data Description • Group: Bindley Biosciences Center at Purdue University • Instruments: Agilent's chip cube coupled the XCT PLUS ESI ion trap • Data format at CCE webportal: mzXML • Number of Samples: Normal: 80; PolyP:72; Colorectal: 40

LC-MS Proteomics Data Processing Methods Adapted from N. Jeffries (2005) Bioinformatics, vol. 21, (no. 14), pp. 3066. S.A. Kazmi, et al., (2006) Metabolomics, vol. 2, (no. 2), pp. 75-83 LC/MS data “heat map” Total Ion Chromatogram (TIC) summarized from enhanced heat map Image Enhanced LC/MS data “heat map”

LC-MS Major Protein Identification~25-28 characteristic proteins /sample identified Identify Most Informative TIC R.T. “Grid” • Use Mascot to Search for Protein ID at R.T. Grid Regions Apply the R.T. Grid to Original Spectra

Proteomics Result Interpretation Proteins Interacted with High-Frequency Proteins from Colon Cancer Group Proteins Identified from Colon Cancer and Health Group

Proteomics Result InterpretationA Network Biology Context Protein Network Constructed from the Top 3 Differential Proteins Green-circled proteins are frequently (>=0.3) detected in the colon patient blood samples by using LC/MS. Node: Protein with evidence from PubMed by searching ("GENE_SYMBOL" AND ("colon" OR "colorectal") AND ("cancer" OR "carcinoma")), Edge: Protein interaction with confidence score from HAPPI 1.31 (4&5-Star)

Proteomics Result InterpretationA Biological Pathway Context BRAF (Serine/threonine-protein kinase B-raf) plays major roles in Colorectal Cancer Pathway (KEGG data)

Proteomics Result InterpretationA Biological Pathway Context for NNMT NNMT (Nicotinamide N-methyltransferase) is involved in Biological Oxidations/Phase II Conjugation/Methylation (from Reactome)

Roadmap • Knowledge Discovery of Proteomics Data • Knowledge Discovery of Metabolomics Data • NMR Data • GCxGC MS Data • Integrative Data Mining

Metabolomics Data Description Group: Daniel Raftery Laboratory at Purdue University • NMR Data • Instruments: BrukerAvance 500MHz, NMR • Data format at CCE webportal: Excel spreadsheet • Number of Samples: Normal: 53; PolyP:35; Colorectal: 15 • GCxGC MS Data • Instruments: LECO Pegasus 4D GCxGC-TOF • Data format at CCE webportal: Excel spreadsheet • Number of Samples: Normal: 83; Polyp: 84; Colorectal:30

NMR Data Analysis Workflow Signal Processing Report only significant metabolites Extract peaks’ ppm Search Against Human Metabolome Database (2.5) to identify metabolites

NMR Peak Metabolite Identificationusing Human Metabolomics Database 1) Input the peak lists 2) Get the metabolites; leave out those with fewer than 2 matches

Significant Metabolites Identified from NRM Metabolomics Data Marker metabolites? Sharedmetabolites Population Frequency =

D-Arabitol Identified from NMR ResultsInvolved in Pentose and GlucuronateInterconversions Pathways

Roadmap • Knowledge Discovery of Proteomics Data • Knowledge Discovery of Metabolomics Data • NMR Data • GCxGC MS Data • Integrative Data Mining

Results from GCxGC MS Data IMetabolite identification is more straightforward

Results from GCxGC MS DataII C. Colorectal vs Healthy A. Polyp vs Healthy B. Polyp vs Colorectal

Comparative Results (Intensity vs. Population)MarkerMetabolite Panel Clustering of three groups Intensity based Heat map Population Frequency based Heat map

Metabolites identified from GCxGCMS ResultsInvolved in Fatty Acid Biosynthesis Pathways

Roadmap • Knowledge Discovery of Proteomics Data • Knowledge Discovery of Metabolomics Data • Integrative Data Mining

Data Set Description • Diet, Lipidomics, Oxidative and VD • # of features and the total # of subjects varies • Three classes are balanced to the least common denominator • Healthy vs. Polyp • Healthy vs. Colorectal • Polyp vs. Colorectal

Predictive Modeling Methods Classification Model Hypothesis Hypothesis Hypothesis Clean Dataset Raw Dataset • Data Preprocessing • Filtering outliers (three standard deviations away from mean) • Data Normalization (transforming to the 0-1 range) • Binned categorical data using Quantile binning method • Missing Value Treatment • Replaced with the mean value of the attribute in group • Support vector machines (SVM) Classifier Kernel • Radial Basis Function (RBF) kernel are used • Feature Selection Methods • Approach #1: Two sample unpaired T-tests at 5% significance level. • Approach #2: SVM Attribute Evaluator with Ranker Algorithm. • Features from T-tests are filtered using p-values • K-fold Cross-validation

Dietary Attributes as Predictors Colorectal vs. Healthy Polyp vs. Healthy P-value P-value Salad 2.53E-02 Ice cream 2.38E-02 Tomato 9.57E-01 Rice 4.21E-01 Egg Tea 3.71E-02 4.11E-02 Milk Shellfish 5.60E-02 1.21E-01 SVM Predictor Accuracy = 65% SVM Predictor Accuracy = 64%

Lipidomics T-Tests Results Significant Features Selected from T Test with their corresponding p value

Integrating lipidomics with clinical features Performance comparisons Without Clinical Features With Clinical Features * Since the number of subjects was less than15, 3 fold cross-validation accuracy was reported.

Messages • Individual Omics data set has variable predictive performance • Need thorough statistical filtering + biological knowledge integration to battle inherent high-level of data noise • Integration of different Omics data with clinical data can improve predictive performance

Acknowledgment We thank all the members in our team.

Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery