Learning the genetics of common human diseases
1 / 33

Learning the Genetics of Common Human Diseases - PowerPoint PPT Presentation

  • Updated On :

Learning the Genetics of Common Human Diseases . -- Applications of Machine Learning Methods in genomic epidemiological studies. Wayne State, Detroit, MI. Epidemiology.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Learning the Genetics of Common Human Diseases ' - Jimmy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Learning the genetics of common human diseases l.jpg

Learning the Genetics of Common Human Diseases

-- Applications of Machine Learning Methods in genomic epidemiological studies

Yan Sun

Wayne State, Detroit, MI

Epidemiology l.jpg

Epidemiology is the study of factors affecting the health and illness of populations, and serves as the foundation and logic of interventions made in the interest of public health and preventive medicine. It is considered a cornerstone methodology of public health research, and is highly regarded in evidence-based medicine for identifying risk factors for disease and determining optimal treatment approaches to clinical practice.

Yan Sun

Genetic genomic epidemiology l.jpg
Genetic/Genomic Epidemiology

Genetic epidemiology is the epidemiological evaluation of the role of inherited causes of disease in families and in populations; it aims to detect the inheritance pattern of a particular disease, localize the gene and find a marker associated with disease susceptibility. Gene-gene and gene-environment interactions are also studied in genetic epidemiology of a disease. In its broad context, genetic epidemiology includes family studies, molecular epidemiologic studies with genetic components, and more traditional cohort and case-control studies with family history components.

Yan Sun

Genetics of common human diseases l.jpg
Genetics of Common Human Diseases

Heritability of Common Human Diseases

  • Disease Heritability

  • Asthma ~ 60%

  • T2D ~ 70%

  • Obesity ~ 50%

CVD Risk factors with a significant genetic component (heritability)

Genetics of atherosclerosis Lusis,A.J., Mar,R., Pajukanta,P., 2004

Yan Sun

Dimensionality of data l.jpg
Dimensionality of Data

  • Phenotypic Data

  • Genotypic Data

    • Microsatellites

    • Single Nucleotide Polymorphism (SNP)

    • Copy Number Variation (CNV)

Yan Sun

Slide6 l.jpg














Yan Sun

Data mining l.jpg
Data Mining

The data are not mine, they are public.

  • NIH dbGaP


  • The Wellcome Trust Case Control Consortium


  • Genetic Analysis Workshop

Yan Sun

Goals l.jpg

  • Variable Selection

    • Noise removal

    • Dimension reduction

    • Feature selection

  • Prediction Model

    • Predictive ability of genetic factor along

    • Improvement of predictive ability

  • Underlying Mechanism*

    • Interactions, pathways and biological networks

Yan Sun

Machine learning methods l.jpg
Machine Learning Methods

  • Neural Networks

  • Support Vector Machine

  • Ensemble Leaning Methods

    • Bagging, Boosting, Random Forests& RuleFit

Yan Sun

Ensemble learning methods l.jpg
Ensemble Learning Methods

Ensemble learning refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions

Yan Sun

Ensemble learning methods13 l.jpg
Ensemble Learning Methods

  • Accuracy: a more reliable mapping can be obtained by combining the output of multiple "experts "

  • Efficiency: a complex problem can be decomposed into multiple sub-problems that are easier to understand and solve (divide-and-conquer approach). Mixture of experts, ensemble feature selection.

  • There is no single model that works for all pattern recognition problems! (no free lunch theorem)

    "To solve really hard problems, we’ll have to use several different representations……. It is time to stop arguing over which type of pattern-classification technique is best……. Instead we should work at a higher level of organization and discover how to build managerial systems to exploit the different virtues and evade the different limitations of each of these ways of comparing things. " --Minsky, 1991.

Yan Sun

Why do ensembles work l.jpg
Why do ensembles work ?

  • Because uncorrelated errors of individual classifiers can be eliminated by averaging.

  • Assume: 40 base classifiers, majority voting, each error rate 0.3

  • Probability that an instance will be misclassified by r out of 40 classifiers (Dietterich, 1997): r=21 -> 0.002

  • Theoretical results by Hansen & Solomon (1990)

Yan Sun

Ensemble learning methods15 l.jpg
Ensemble Learning Methods

  • How to generate base classifiers?Generation strategy

    • Decision tree learning:ID3, C4.5 & CART

    • Instance-based learning: k-nearest neighbor

    • Bayesian classification: Naïve Bayes

    • Neural networks

    • Regression analysis

    • Clustering et.al.

  • How to integrate them?Integration strategy:

    • BAGGing = Bootstrap AGGregation (Breiman, 1996)

    • Boosting (Schapire and Singer, 1998)

    • Random Forests (Breiman, 2001)

Yan Sun

Random forests l.jpg

Tree 1

Tree 2

Tree 3

Tree i

Final Classification is based on votes from all N trees

Tree i+1

Tree i+2

Tree i+3

Tree N









Random Forests

Yan Sun

Random forests17 l.jpg
Random Forests

  • Its accuracy is as good as Adaboost and sometimes better

  • It is relatively robust to outliers and noise

  • It is faster than bagging or boosting

  • It gives useful internal estimates of error, strength, correlation and variable importance

  • It is simple and can be easily parallelized

Yan Sun

Random forests18 l.jpg
Random Forests

Heidema 2006 BMC Genetics

Yan Sun

Application of random forests l.jpg
Application of Random Forests

  • Candidate Genes

  • Genome-wide Markers

Yan Sun

Application of random forests20 l.jpg
Application of Random Forests

ds1 (n=360)

16 Cov. 471 SNPs

ds2 (n=360)

16 Cov. 471 SNPs

Missing SNP Genotype Imputation

All 471 SNPs

tagSNPs LD Rsq<0.5

Random Forests


Prediction Models

With ROC curve 5-fold CV

Identify Replicable

Covs and SNPs

KGraph Summary and

Biological relevance

Yan Sun

Predicting cac l.jpg
Predicting CAC




Yan Sun

Using all snps vs tag snps l.jpg
Using All SNPs vs. Tag SNPs

A tagSNP is a representative SNP in a region of the genome with high linkage disequilibrium with the rest SNPs. It is possible to identify genetic variation without genotyping every SNP in a chromosomal region. It is the maximally informative SNP.

Yan Sun

Gpr35 protein l.jpg
GPR35 Protein

Yan Sun

Predicting ra status l.jpg
Predicting RA Status

  • Sample: One subject was randomly selected from each family to create dataset 1 and the second subject was then randomly selected from the rest of samples for dataset 2. The singletons were randomly divided into the two samples. Each of the two replicate samples has 740 unrelated subjects

  • Genetic Markers: 5,742 genome-wide informative SNP markers.

Yan Sun

Predicting ra status27 l.jpg
Predicting RA Status

Sun YV et.al., 2007, In Press

Yan Sun

Predicting ra status28 l.jpg
Predicting RA Status

Sun YV et.al., 2007, In Press

Yan Sun

Challenges l.jpg

  • Validation, validation and validation!

“So far, comprehensive reviews of the published literature, most of which reports work based on the candidate-gene approach, have demonstrated a plethora of questionable genotype– phenotype associations, replication of which has often failed in independent studies.”

Yan Sun

Computational challenge l.jpg
Computational Challenge

  • 100K SNP data

  • 500K SNP data

  • 1M SNP data

  • 3.1M HapMap SNPs (Nature Oct. 2007)

  • And more – different type of genetic variations

  • And more rare genetic variations

  • And larger sample

Yan Sun

Acknowledgements l.jpg
Acknowledgements a nail.

University of Michigan

Sharon Kardia

Lawrence Bielak

Patricia Peyser

Ji Zhu

Mayo Clinic

Stephen Turner

Patrick Sheedy, II

University of Texas

Eric Boerwinkle

Yan Sun

Slide33 l.jpg

Yan Sun a nail.