learning the genetics of common human diseases l.
Skip this Video
Loading SlideShow in 5 Seconds..
Learning the Genetics of Common Human Diseases PowerPoint Presentation
Download Presentation
Learning the Genetics of Common Human Diseases

Loading in 2 Seconds...

play fullscreen
1 / 33

Learning the Genetics of Common Human Diseases - PowerPoint PPT Presentation

  • Uploaded on

Learning the Genetics of Common Human Diseases . -- Applications of Machine Learning Methods in genomic epidemiological studies. Wayne State, Detroit, MI. Epidemiology.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Learning the Genetics of Common Human Diseases

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Learning the Genetics of Common Human Diseases -- Applications of Machine Learning Methods in genomic epidemiological studies Yan Sun Wayne State, Detroit, MI

    2. Epidemiology Epidemiology is the study of factors affecting the health and illness of populations, and serves as the foundation and logic of interventions made in the interest of public health and preventive medicine. It is considered a cornerstone methodology of public health research, and is highly regarded in evidence-based medicine for identifying risk factors for disease and determining optimal treatment approaches to clinical practice. Yan Sun

    3. Genetic/Genomic Epidemiology Genetic epidemiology is the epidemiological evaluation of the role of inherited causes of disease in families and in populations; it aims to detect the inheritance pattern of a particular disease, localize the gene and find a marker associated with disease susceptibility. Gene-gene and gene-environment interactions are also studied in genetic epidemiology of a disease. In its broad context, genetic epidemiology includes family studies, molecular epidemiologic studies with genetic components, and more traditional cohort and case-control studies with family history components. Yan Sun

    4. Genetics of Common Human Diseases Heritability of Common Human Diseases • Disease Heritability • Asthma ~ 60% • T2D ~ 70% • Obesity ~ 50% CVD Risk factors with a significant genetic component (heritability) Genetics of atherosclerosis Lusis,A.J., Mar,R., Pajukanta,P., 2004 Yan Sun

    5. Dimensionality of Data • Phenotypic Data • Genotypic Data • Microsatellites • Single Nucleotide Polymorphism (SNP) • Copy Number Variation (CNV) Yan Sun


    7. Genome-Wide Association Data Yan Sun

    8. Xavier and Armengol, 2007 PLoS Genetics Yan Sun

    9. Data Mining The data are not mine, they are public. • NIH dbGaP (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap) • The Wellcome Trust Case Control Consortium (http://www.wtccc.org.uk/) • Genetic Analysis Workshop Yan Sun

    10. Goals • Variable Selection • Noise removal • Dimension reduction • Feature selection • Prediction Model • Predictive ability of genetic factor along • Improvement of predictive ability • Underlying Mechanism* • Interactions, pathways and biological networks Yan Sun

    11. Machine Learning Methods • Neural Networks • Support Vector Machine • Ensemble Leaning Methods • Bagging, Boosting, Random Forests& RuleFit Yan Sun

    12. Ensemble Learning Methods Ensemble learning refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions Yan Sun

    13. Ensemble Learning Methods • Accuracy: a more reliable mapping can be obtained by combining the output of multiple "experts " • Efficiency: a complex problem can be decomposed into multiple sub-problems that are easier to understand and solve (divide-and-conquer approach). Mixture of experts, ensemble feature selection. • There is no single model that works for all pattern recognition problems! (no free lunch theorem) "To solve really hard problems, we’ll have to use several different representations……. It is time to stop arguing over which type of pattern-classification technique is best……. Instead we should work at a higher level of organization and discover how to build managerial systems to exploit the different virtues and evade the different limitations of each of these ways of comparing things. " --Minsky, 1991. Yan Sun

    14. Why do ensembles work ? • Because uncorrelated errors of individual classifiers can be eliminated by averaging. • Assume: 40 base classifiers, majority voting, each error rate 0.3 • Probability that an instance will be misclassified by r out of 40 classifiers (Dietterich, 1997): r=21 -> 0.002 • Theoretical results by Hansen & Solomon (1990) Yan Sun

    15. Ensemble Learning Methods • How to generate base classifiers?Generation strategy • Decision tree learning:ID3, C4.5 & CART • Instance-based learning: k-nearest neighbor • Bayesian classification: Naïve Bayes • Neural networks • Regression analysis • Clustering et.al. • How to integrate them?Integration strategy: • BAGGing = Bootstrap AGGregation (Breiman, 1996) • Boosting (Schapire and Singer, 1998) • Random Forests (Breiman, 2001) Yan Sun

    16. Tree 1 Tree 2 Tree 3 Tree i Final Classification is based on votes from all N trees Tree i+1 Tree i+2 Tree i+3 Tree N No Yes No Yes No Yes No No Random Forests Yan Sun

    17. Random Forests • Its accuracy is as good as Adaboost and sometimes better • It is relatively robust to outliers and noise • It is faster than bagging or boosting • It gives useful internal estimates of error, strength, correlation and variable importance • It is simple and can be easily parallelized Yan Sun

    18. Random Forests Heidema 2006 BMC Genetics Yan Sun

    19. Application of Random Forests • Candidate Genes • Genome-wide Markers Yan Sun

    20. Application of Random Forests ds1 (n=360) 16 Cov. 471 SNPs ds2 (n=360) 16 Cov. 471 SNPs Missing SNP Genotype Imputation All 471 SNPs tagSNPs LD Rsq<0.5 Random Forests RuleFit Prediction Models With ROC curve 5-fold CV Identify Replicable Covs and SNPs KGraph Summary and Biological relevance Yan Sun

    21. Predicting CAC Random Forests Rulefit Yan Sun

    22. Using All SNPs vs. Tag SNPs A tagSNP is a representative SNP in a region of the genome with high linkage disequilibrium with the rest SNPs. It is possible to identify genetic variation without genotyping every SNP in a chromosomal region. It is the maximally informative SNP. Yan Sun

    23. Predicting CAC Yan Sun

    24. GPR35 Protein Yan Sun

    25. Kgraph Presentation Yan Sun

    26. Predicting RA Status • Sample: One subject was randomly selected from each family to create dataset 1 and the second subject was then randomly selected from the rest of samples for dataset 2. The singletons were randomly divided into the two samples. Each of the two replicate samples has 740 unrelated subjects • Genetic Markers: 5,742 genome-wide informative SNP markers. Yan Sun

    27. Predicting RA Status Sun YV et.al., 2007, In Press Yan Sun

    28. Predicting RA Status Sun YV et.al., 2007, In Press Yan Sun

    29. Challenges • Validation, validation and validation! “So far, comprehensive reviews of the published literature, most of which reports work based on the candidate-gene approach, have demonstrated a plethora of questionable genotype– phenotype associations, replication of which has often failed in independent studies.” Yan Sun

    30. Computational Challenge • 100K SNP data • 500K SNP data • 1M SNP data • 3.1M HapMap SNPs (Nature Oct. 2007) • And more – different type of genetic variations • And more rare genetic variations • And larger sample Yan Sun

    31. If all a man has is a hammer, then every problem looks like a nail. Yan Sun

    32. Acknowledgements University of Michigan Sharon Kardia Lawrence Bielak Patricia Peyser Ji Zhu Mayo Clinic Stephen Turner Patrick Sheedy, II University of Texas Eric Boerwinkle Yan Sun

    33. Yan Sun