1 / 34

Robust diagnosis DLBCL from gene expression data from different laboratories

Robust diagnosis DLBCL from gene expression data from different laboratories. Gyan Bhanot, IBM Research. Dimacs Workshop, June 22, 2005. Collaborators: Gabriela Alexe 1 Arnold Levine 2,3 Gustavo Stolovitzky 1. 1 IBM 2 IAS 3 UMDNJ. Overview. Motivation

brina
Download Presentation

Robust diagnosis DLBCL from gene expression data from different laboratories

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robust diagnosis DLBCL from gene expression data from different laboratories Gyan Bhanot, IBM Research Dimacs Workshop, June 22, 2005

  2. Collaborators: Gabriela Alexe1 Arnold Levine2,3 Gustavo Stolovitzky1 1 IBM 2 IAS 3 UMDNJ

  3. Overview • Motivation • Pattern-based meta-classifiers • Case study – compare data from two labs for DLBCL vs FL diagnosis

  4. Motivation Cancer is a genetic/proteomic disease • Genetic mutations/virus’s/radiation modify pathways to create survival advantage for damaged cell • Gene Arrays are a way to study the variation of mRNA levels between diseased and healthy cells. • This allows diagnosis and inference of pathways that cause disease

  5. Cancer diagnosis: Input Training (biomedical / proteomic, microarray) data: k 2 classes (m samples) described by N >> features Output Collection of robust biomarkers, models Robust, accurate classifier / tested on out-of-sample data

  6. Strategy of present paper 1. Transform original data to “pattern ” space 2. Find robust sets of biomarkers with significant collective discriminatory power 3. Use many machine learning tools on original and pattern data ANN, SVM, kNN, Weighted voting, Classification trees 4. Validate the results on data from a different lab

  7. Patterns Observed dataset System response

  8. Pattern basics Positive patterns Negative patterns

  9. Individual classifiers used SVM, ANN, WV, KNN, CART, LR Trained / calibrated (leave-one-out): raw data pattern data

  10. Application: Progression of Follicular Lymphoma (FL) to Diffuse Large B Cell Lymphoma (DLBCL) Gene Array data from different laboratories Shipp et al. (2002) Nature Med.; 8(1), 68-74. (Whitehead Lab) Stolovitzky G. (2005) In Deisboeck et al Complex Systems Science in BioMedicine (in press) (preprint: http://www.wkap.nl/prod/a/Stolovitzky.pdf). (DellaFavera Lab) Alexe et al (2005) Artificial Intelligence in Medicine (in press)

  11. Non-Hodgkin lymphomas FL low grade non-Hodgkin lymphoma t(14;18) translocation: over-expression of anti-apoptotic bcl2 25-60% FL cases evolve to DLBCL DLBCL high grade non-Hodgkin lymphoma < 2 years survival if untreated Biomarkers: FL transformation to DLBCL • p53/MDM2 (Moller et al., 1999) • p16 (Pyniol, 1998) • p38MAPK (Elenitoba-Johnson et al., 2003) • c-myc (Lossos et al., 2002)

  12. Lymphoma datasets Data:WI (Shipp et al., 2002) Affy HuGeneFL CU (DallaFavera Lab, Stolovitzky, 2005) Affy Hu95Av2 Samples: WI: 58 DLBCL & 19 FL CU: 14 DLBCL & 7 FL Genes: WI: 6817 CU: 12581

  13. Data Preprocessing • 50 % P calls, UL = 16000, LL = 20 • 2/1 stratify WI data to train/test. CU data test • Compute SD per gene across samples • Normalize data to mean 0, SD 1 per gene • Generate 500 data sets using noise + k fold stratified sampling + jacknife • Find genes with high correlation to phenotype using t-test or SNR. Keep genes that are in > 450/501 of datasets

  14. Choosing Support Sets • Create good patterns using small subsets of genes, validate using weighted voting with 10 fold cross validation • Sort genes by their appearance in good patterns • Select top genes to cover each sample by at least 10 patterns

  15. The 30 genes that best distinguish FL from DLBCL

  16. Examples of FL and DLBCL patterns WI training data: Each DLBCL case satisfies at least one of the patterns P1 and P2 Each FL case satisfies the pattern N1 (and none of the patterns P1 and P2)

  17. Pattern data

  18. Meta-classifier performance

  19. Error distribution: raw and pattern data

  20. Biology Based Methods

  21. FL  DLBCL progression p53 related genes identified by filtering procedure

  22. p53 pattern data

  23. Examples of p53 responsive genes patterns WI data: Each DLBCL case satisfies one of the patterns P1, P2, P3 Each FL case satisfies one of the patterns N1, N2, N3

  24. p53 combinatorial biomarker 77% FL & 21% DLBCL cases (3.7 fold) at most one gene over-expressed 79% DLBCL & 23% FL cases (3.4 fold) at least two genes over-expressed Each individual gene: over- expressed in about 40-70% DLBCL & 20-40% FL (specificity 50-60%, sensitivity 60-70%)

  25. What are these genes? • Plk1 (stpk13): polo-like kinase serine threonine protein kinase 13, M-phase specific • cell transformation, neoplastic, drives quiescent cells into mitosis • over-expressed in various human tumors • Takai et al., Oncogene, 2005: plk1 potential target for cancer therapy, new prognostic marker for cancer • Mito et al, Leuk Lymph, 2005: plk1 biomarker for DLBCL • Cdk2 (p33): cyclin -dependent kinase: G2/M transition of mitotic cell cycle, interacts with cyclins A, B3, D, E • P53 tumor suppressor gene (Levine 1982)

  26. Conclusions • Pattern-based meta-classifier is robust against noise • Good prediction of FL  DLBCL • Biology Based Analysis also possible • Yields useful Biomarker • Should Study Biologically motivated sets of genes  build pathways

  27. <> Thank you for your attention !

  28. Artificial neural networks

  29. Support vector machines Find a maximum margin hyperplane in pattern space (Vapnik) (P) (D)

  30. k-Nearest neighbors • Training data : samples in normalized peptide space • Prediction for test data: The dominant class of the k-nearest neighbors in Euclidean metric Positive New case: Negative Negative

  31. Weighted voting Pattern data: • each pattern P is a voter • weight = fraction of correctly classified cases by the pattern • each test case: compute sum of weights of triggered positive patterns and negative patterns • classify by highest weight

  32. Logistic regression • Dataset of two phenotypes (e.g., cancer vs. non-cancer) • Transform into logit space y->ln(p/1-p) • Find phenotype predictor as a linear combination of data values in logit space Insightful Miner

  33. Decision trees / forests • Find rules in training data: • find root feature which best classifies samples by phenotype • iterate on each branch to find two new features which best split each branch by phenotype • if necessary prune weak support nodes • CART =Classification and Regression Trees (Breiman) • Many trees = forest

More Related