1 / 34

Data Mining in Genomics: the dawn of personalized medicine

Data Mining in Genomics: the dawn of personalized medicine. Gregory Piatetsky-Shapiro KDnuggets www.KDnuggets.com/gps.html Connecticut College, October 15, 2003. Overview. Data Mining and Knowledge Discovery Genomics and Microarrays Microarray Data Mining. Trends leading to Data Flood.

jfrance
Download Presentation

Data Mining in Genomics: the dawn of personalized medicine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets www.KDnuggets.com/gps.html Connecticut College, October 15, 2003

  2. Overview • Data Mining and Knowledge Discovery • Genomics and Microarrays • Microarray Data Mining

  3. Trends leading to Data Flood • More data is generated: • Bank, telecom, other business transactions ... • Scientific Data: astronomy, biology, etc • Web, text, and e-commerce • More data is captured: • Storage technology faster and cheaper • DBMS capable of handling bigger DB

  4. __ ____ __ ____ __ ____ Patterns and Rules Knowledge Discovery Process Integration Interpretation & Evaluation Knowledge Data Mining Knowledge RawData Transformation Selection & Cleaning Understanding Transformed Data Target Data DATA Ware house

  5. Major Data Mining Tasks • Classification: predicting an item class • Clustering: finding clusters in data • Associations: e.g. A & B & C occur frequently • Visualization: to facilitate human discovery • Summarization: describing a group • Estimation: predicting a continuous value • Deviation Detection: finding changes • Link Analysis: finding relationships

  6. Advertising Bioinformatics Customer Relationship Management (CRM) Database Marketing Fraud Detection eCommerce Health Care Investment/Securities Manufacturing, Process Control Sports and Entertainment Telecommunications Web Major Application Areas for Data Mining Solutions

  7. Genome, DNA & Gene Expression • An organism’s genome is the “program” for making the organism, encoded in DNA • Human DNA has about 30-35,000 genes • A gene is a segment of DNA that specifies how to make a protein • Cells are different because of differential gene expression • About 40% of human genes are expressed at one time • Microarray devices measure gene expression

  8. Molecular Biology Overview Nucleus Cell Chromosome Gene expression Gene (DNA) Gene (mRNA), single strand Protein Graphics courtesy of the National Human Genome Research Institute

  9. 50um Affymetrix Microarrays 1.28cm ~107 oligonucleotides, half Perfectly Match mRNA (PM), half have one Mismatch (MM) Gene expression computed from PM and MM

  10. Affymetrix Microarray Raw Image Gene Value D26528_at 193 D26561_cds1_at -70 D26561_cds2_at 144 D26561_cds3_at 33 D26579_at 318 D26598_at 1764 D26599_at 1537 D26600_at 1204 D28114_at 707 raw data Scanner enlarged section of raw image

  11. Microarray Potential Applications • New and better molecular diagnostics • New molecular targets for therapy • few new drugs, large pipeline, … • Outcome depends on genetic signature • best treatment? • Fundamental Biological Discovery • finding and refining biological pathways • Personalized medicine ?!

  12. Microarray Data Mining Challenges • Avoiding false positives, due to • too few records (samples), usually < 100 • too many columns (genes), usually > 1,000 • Model needs to be robust in presence of noise • For reliability need large gene sets; for diagnostics or drug targets, need small gene sets • Estimate class probability • Model needs to be explainable to biologists

  13. False Positives in Astronomy cartoon used with permission

  14. CATs: Clementine Application Templates • CATs - examples of complete data mining processes • Microarray CAT Preparation Multi- Class Clustering 2-Class

  15. Key Ideas • Capture the complete process • X-validation loop w. feature selection inside • Randomization to select significant genes • Internal iterative feature selection loop • For each class, separate selection of optimal gene sets • Neural nets – robust in presence of noise • Bagging of neural nets

  16. Microarray Classification Train data Feature and Parameter Selection Data Model Building Evaluation Test data

  17. Classification: External X-val Gene Data Train data Feature and Parameter Selection T r a i n Data Model Building Evaluation Test data FinalTest Final Model Final Results

  18. Measuring false positives with randomization Rand Class Gene Class 178 105 4174 7133 1 1 2 2 2 1 1 2 Randomize 500 times Gene Class Bottom 1% T-value = -2.08 Select potentially interesting genes at 1% 178 105 4174 7133 2 1 1 2

  19. Gene Reduction improves Classification • most learning algorithms look for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes • Classification accuracy improves if we first reduce # of genes by a linear method, e.g. T-values of mean difference • Heuristic: select equal # genes from each class • Then apply a favorite machine learning algorithm

  20. Iterative Wrapper approach to selecting the best gene set • Test models using 1,2,3, …, 10, 20, 30, 40, ..., 100 top genes with x-validation. • Heuristic 1: evaluate errors from each class; select # number of genes from each class that minimizes error for that class • For randomized algorithms, average 10+ Cross-validation runs! • Select gene set with lowest average error

  21. Clementine stream for subset selection by x-validation

  22. Microarrays: ALL/AML Example • Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 • 72 examples (38 train, 34 test), about 7,000 genes • well-studied (CAMDA-2000), good test example ALL AML Visually similar, but genetically very different

  23. Gene subset selection: one X-validation Single Cross-Validation run

  24. Gene subset selection: multiple cross-validation runs For ALL/AML data, 10 genes per class had the lowest error: (<1%) Point in the center is the average error from 10 cross-validation runs Bars indicate 1 st. dev above and below

  25. ALL/AML: Results on the test data • Genes selected and model trained on Train set ONLY! • Best Net with 10 top genes per class (20 overall) was applied to the test data (34 samples): • 33 correct predictions (97% accuracy), • 1 error on sample 66 • Actual Class AML, Net prediction: ALL • other methods consistently misclassify sample 66 -- misclassified by a pathologist?

  26. Pediatric Brain Tumour Data • 92 samples, 5 classes (MED, EPD, JPA, EPD, MGL, RHB) from U. of Chicago Children’s Hospital • Outer cross-validation with gene selection inside the loop • Ranking by absolute T-test value (selects top positive and negative genes) • Select best genes by adjusted error for each class • Bagging of 100 neural nets

  27. Selecting Best Gene Set • Minimizing Combined Error for all classes is not optimal Average, high and low error rate for all classes

  28. Error rates for each class Error rate Genes per Class

  29. Evaluating One Network Averaged over 100 Networks:

  30. Bagging 100 Networks • Note: suspected error on one sample (labeled as MED but consistently classified as RHB)

  31. AF1q: New Marker for Medulloblastoma? • AF1Q ALL1-fused gene from chromosome 1q • transmembrane protein • Related to leukemia (3 PUBMED entries) but not to Medulloblastoma

  32. Future directions for Microarray Analysis • Algorithms optimized for small samples • Integration with other data • biological networks • medical text • protein data • Cost-sensitive classification algorithms • error cost depends on outcome (don’t want to miss treatable cancer), treatment side effects, etc.

  33. Acknowledgements • Eric Bremer, Children’s Hospital (Chicago) & Northwestern U. • Greg Cooper, U. Pittsburgh • Tom Khabaza, SPSS • Sridhar Ramaswamy, MIT/Whitehead Institute • Pablo Tamayo, MIT/Whitehead Institute

  34. Thank you Further resources on Data Mining: www.KDnuggets.com Microarrays: www.KDnuggets.com/websites/microarray.html Contact: Gregory Piatetsky-Shapiro: www.kdnuggets.com/gps.html

More Related