Case Studies

Case Studies

Case Study: Diagnostic Model From Array Gene Expression Data Computational Models of Lung Cancer: Connecting Classification, Gene Selection, and Molecular Sub-typing C.Aliferis M.D., Ph.D., Pierre Massion M.D. I. Tsamardinos Ph.D., D. Hardin Ph.D.

Case Study: Diagnostic Model From Array Gene Expression Data • Specific Aim 1: “Construct computational models that distinguish between important cellular states related to lung cancer, e.g., (i) Cancerous vs Normal Cells; (ii) Metastatic vs Non-Metastatic cells; (iii) Adenocarcinomas vs Squamous carcinomas”. • Specific Aim 2: “Reduce the number of gene markers by application of biomarker (gene) selection algorithms such that small sets of genes can distinguish among the different states (and ideally reveal important genes in the pathophysiology of lung cancer).”

Case Study: Diagnostic Model From Array Gene Expression Data • Bhattacharjee et al. PNAS, 2001 • 12,600 gene expression measurements obtained using Affymetrix oligonucleotide arrays • 203 patients and normal subjects, 5 disease types, ( plus staging and survival information)

Case Study: Diagnostic Model From Array Gene Expression Data • Linear and polynomial-kernel Support Vector Machines (LSVM, and PSVM respectively) C optimized via C.V. from {10-8, 10-7, 10-6, 10-5, 10-4, 10-3, 10-2, 0.1, 1, 10, 100, 1000} and degree from the set: {1, 2, 3, 4}. • K-Nearest Neighbors (KNN) (k optimized via C.V.) • Feed-forward Neural Networks (NNs). 1 hidden layer, number of units chosen (heuristically) from the set {2, 3, 5, 8, 10, 30, 50}, variable-learning-rate back propagation, custom-coded early stopping with (limiting) performance goal=10-8 (i.e., an arbitrary value very close to zero), and number of epochs in the range [100,…,10000], and a fixed momentum of 0.001 • Stratified nested n-fold cross-validation (n=5 or 7 depending on task)

Case Study: Diagnostic Model From Array Gene Expression Data • Area under the Receiver Operator Characteristic (ROC) curve (AUC) computed with the trapezoidal rule (DeLong et al. 1998). • Statistical comparisons among AUCs were performed using a paired Wilcoxon rank sum test (Pagano et al. 2000). • Scale gene values linearly to [0,1] • Feature selection: • RFE (parameters as the ones used in Guyon et al 2002) • UAF (Fisher criterion scoring; k optimized via C.V.)

Case Study: Diagnostic Model From Array Gene Expression Data • Classification Performance

Case Study: Diagnostic Model From Array Gene Expression Data • Gene selection

Case Study: Diagnostic Model From Array Gene Expression Data • Novelty

Case Study: Diagnostic Model From Array Gene Expression Data • A more detailed look: • Specific Aim 3:“Study how aspects of experimental design (including data set, measured genes, sample size, cross-validation methodology) determine the performance and stability of several machine learning (classifier and feature selection) methods used in the experiments”.

Case Study: Diagnostic Model From Array Gene Expression Data • Overfitting: we replace actual gene measurements by random values in the same range (while retaining the outcome variable values). • Target class rarity: we contrast performance in tasks with rare vs non-rare categories. • Sample size: we use samples from the set {40,80,120,160, 203} range (as applicable in each task). • Predictor info redundancy: we replace the full set of predictors by random subsets with sizes in the set {500, 1000, 5000, 12600}.

Case Study: Diagnostic Model From Array Gene Expression Data • Train-test split ratio: we use train-test ratios from the set {80/20, 60/40, 40/60} (for tasks II and III, while for task I modified ratios were used due to small number of positives, see Figure 1). • Cross-validated fold construction: we construct n-fold cross-validation samples retaining the proportion of the rarer target category to the more frequent one in folds with smaller sample, or, alternatively we ensure that all rare instances are included in the union of test sets (to maximize use of rare-case instances). • Classifiertype: Kernel vs non-kernel and linear vs non-linear classifiers are contrasted. Specifically we compare linear and non-linear SVMs (a prototypical kernel method) to each other and to KNN (a robust and well-studied non-kernel classifier and density estimator).

Case Study: Diagnostic Model From Array Gene Expression Data  Random gene values

Case Study: Diagnostic Model From Array Gene Expression Data  varying sample size

Case Study: Diagnostic Model From Array Gene Expression Data  Random gene selection

Case Study: Diagnostic Model From Array Gene Expression Data  Split ratio

Case Study: Diagnostic Model From Array Gene Expression Data  Use of rare categories

Case Study: Diagnostic Model From Array Gene Expression Data • Questions: • What would you do differently? • How to interpret the biological significance of the selected genes? • What is wrong with having so many and robust good classification models? • Why do we have so many good models?

Case Study: Diagnostic Model From Array Gene Expression Data • We have recently completed an extensive analysis of all multi-category gene expression-based cancer datasets in the public domain. The analysis spans >75 cancer types and >1,000 patients in 12 datasets. • On the basis of this study we have created a tool (GEMS) that automatically analyzes data to create diagnostic systems and identify biomarker candidates using a variety of techniques. • The present incarnation of the tool is oriented toward the computer-savvy researcher; a more biologist-friendly web-accessible version is under development.

Case Study: Diagnostic Model From Array Gene Expression Data GEMS System “Methods for Multi-Category Cancer Diagnosis from Gene Expression Data: A Comprehensive Evaluation to Inform Decision Support System Development” A. Statnikov, C.F. Aliferis, I. Tsamardinos AMIA/MEDINFO 2004

Case Study: Diagnostic Modeling From Mass Spectrometry Data

Creating a Tool (FAST-AIMS) for Cancer Diagnostic Decision Support Using Mass Spectrometry Data Nafeh Fananapazir Department of Biomedical Informatics Vanderbilt University Academic Committee: Constantin Aliferis (Primary Advisor), Dean Billheimer, Douglas Hardin, Shawn Levy, Daniel Liebler, Ioannis Tsamardinos

Introduction Problem: • In the last two years, we have seen the emergence of mass spectrometry in the domain of cancer diagnosis • Mass spectrometry on biological samples produces data with a size and complexity that defies simple analysis. • There is a need for clinicians without expertise in the field of machine learning to have access to intelligent software that permits at least a first pass analysis as to the diagnostic capabilities of data obtained from mass spectrometry analysis.

MS Studies in Cancer Research:Types Cancer Types Specimen Types

MS Studies in Cancer Research: Problems • Lack of disclosure of key methods components • Overfitting • One-time partitioning of data • Lack of randomization when allocating to test/train sets • Lack of an appropriate performance metric

Data Source: Blood Serum • Advantages • Relatively non-invasive • Easily obtained • Access to most tissues in the body • Screening possibilities • Composition/Derivation • Blood Plasma • Protein Constituents • Albumins • Globulins • Fibrinogen • Low Molecular Weight (LMW) Proteins

Data Representation: Mass Spectrometry • MALDI-TOF/SELDI-TOF1 • Relatively little sample purification is required • Direct measurement of proteins from serum, tissue, other bio. samples • Relatively rapid analysis time • Production of intact molecular ions with little fragmentation • Detection of proteins with m/z ranging from 2000-100,000 daltons • Collection of useful spectra from complex mixtures • Accuracies approaching 1 part in 10,000 • Data Characteristics • Parameters • Mass/Charge (M/Z) • Intensity • Format • Continuous • Peak Detection 1Billheimer D., A Functional Data Approach to MALDI-TOF MS Protein Analysis

Data Analysis: Paradigm

Data Analysis: Preparations • Get Mass Spectra • Data Pre-Processing • Baseline subtraction • Peak detection [Coombes 2003] • Feature Selection • Normalization of intensities • Peak alignment

Data Analysis: Experimental Design c. Classification: Parameter Optimization

Data Analysis: Classifiers c. Classification: Classifiers • KNN: Optimize K • SVM: Optimize cost, kernel, gamma LSVM PSVM RBF-SVM

Preliminary Studies • Datasets • Petricoin Ovarian • Petricoin Prostate • Adam Prostate • Feature Selection • RFE • Experimental Design • 10-fold nested cross-validation • Performance Metric • ROC (rationale for selecting)

Preliminary Studies: Results

Case Study: Categorizing Text Into Content Categories Automatic Identification of Purpose and Quality of Articles In Journals Of Internal Medicine Yin Aphinyanaphongs M.S. , Constantin Aliferis M.D., Ph.D. (presented in AMIA 2003)

Case Study: Categorizing Text Into Content Categories • The problem: classify Pubmed articles as [high quality & treatment specific] or not • Same function as the current Clinical Quality Filters of Pubmed (in the treatment category)

Case Study: Categorizing Text Into Content Categories • Overview: • Select Gold Standard • Corpus Construction • Document representation • Cross-validation Design • Train classifiers • Evaluate the classifiers

Case Study: Categorizing Text Into Content Categories • Select Gold Standard: • ACP journal club. Expert reviewers strictly evaluate and categorize in each medical area articles from the top journals in internal medicine. • Their mission is “to select from the biomedical literature those articles reporting original studies and systematic reviews that warrant immediate attention by physicians.” • The treatment criteria -ACP journal club • “Random allocation of participants to comparison groups.” • “80% follow up of those entering study.” • “Outcome of known or probable clinical importance.” • If an article is cited by the ACP , it is a high quality article.

Case Study: Categorizing Text Into Content Categories • Corpus construction: 12/2000 9/1999 8/1998 Get all articles from the 49 journals in the study period. Review ACP Journal from 8/1998 to 12/2000 for articles that are cited by the ACP.  15,803 total articles, 396 positives (high quality treatment related)

Case Study: Categorizing Text Into Content Categories • Document representation: • “Bag of words” • Title, abstract, Mesh terms, publication type • Term extraction and processing: e.g. “The clinical significance of cerebrospinal.” • Term extraction • “The”, “clinical”, “significance”, “of”, “cerebrospinal” • Stop word removal • “Clinical”, “Significance”, “Cerebrospinal” • Porter Stemming (i.e. getting the roots of words) • “Clinic*”, “Signific*”, “Cerebrospin*” • Term weighting • log frequency with redundancy.

Case Study: Categorizing Text Into Content Categories • Cross-validation design 10 fold cross Validation to measure error 20% reserve train 80% validation test 15803 articles

Case Study: Categorizing Text Into Content Categories • Classifier families • Naïve Bayes (no parameter optimization) • Decision Trees with Boosting (# of iterations = # of simple rules) • Linear & Polynomial Support Vector Machines (cost from {0.1, 0.2, 0.4, 0.7, 0.9, 1, 5, 10, 20, 100, 1000}, degree from {1,2,3,5,8})

Case Study: Categorizing Text Into Content Categories • Evaluation metrics (averaged over 10 cross-validation folds): • Sensitivity for fixed specificity • Specificity for fixed sensitivity • Area under ROC curve • Area under 11-point precision-recall curve • “Ranked retrieval”

Case Study: Categorizing Text Into Content Categories

Case Study: Categorizing Text Into Content Categories 36 27 Clinical Query Filter 18 9 Clinical Query Filter Performance

Case Study: Categorizing Text Into Content Categories Clinical Query Filter

Case Study: Categorizing Text Into Content Categories • Alternative/additional approaches? • Negation detection • Citation analysis • Sequence of words • Variable selection to produce user-understandable models • Analysis of ACPJ potential bias • Others???

Supplementary: Case Study: Imputation for Machine Learning Models For Lung Cancer Classification Using Array Comparative Genomic Hybridization C.F. Aliferis M.D., Ph.D., D. Hardin Ph.D., P. P. Massion M.D. AMIA 2002

Case Study: A Protocol to Address the Missing Values Problem • Context: • Array comparative genomic hybridization (array CGH): recently introduced technology that measures gene copy number changes of hundreds of genes in a single experiment. Gene copy number changes (deletion, amplification) are often characteristic of disease and cancer in particular. • aCGH as been shown in studies published during the last years that it enables development of powerful classification models, facilitate selection of genes for array design, and identification of likely oncogenes in a variety of cancers (e.g., esophageal, renal, head/neck, lymphomas, breast, and glioblastomas). • Interestingly a recent study (Fritz et al. June 2002) has shown that aCGH enables better classification of liposarcoma differentiation than gene expression information.

Case Studies

Case Studies

Presentation Transcript

Case studies

Case Studies

Case studies

Case studies

Case Studies

Case-studies

Case Studies

CASE STUDIES

Case Studies

Case Studies

Case Studies:

Case Studies:

CASE STUDIES:

CASE STUDIES

Case Studies

Case Studies

Case Studies

Case Studies

Case Studies

Case Studies

Case Studies

Case Studies