Learning the genetics of common human diseases
Download
1 / 33

Learning the Genetics of Common Human Diseases - PowerPoint PPT Presentation


Learning the Genetics of Common Human Diseases . -- Applications of Machine Learning Methods in genomic epidemiological studies. Wayne State, Detroit, MI. Epidemiology.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Learning the Genetics of Common Human Diseases

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Learning the Genetics of Common Human Diseases

-- Applications of Machine Learning Methods in genomic epidemiological studies

Yan Sun

Wayne State, Detroit, MI


Epidemiology

Epidemiology is the study of factors affecting the health and illness of populations, and serves as the foundation and logic of interventions made in the interest of public health and preventive medicine. It is considered a cornerstone methodology of public health research, and is highly regarded in evidence-based medicine for identifying risk factors for disease and determining optimal treatment approaches to clinical practice.

Yan Sun


Genetic/Genomic Epidemiology

Genetic epidemiology is the epidemiological evaluation of the role of inherited causes of disease in families and in populations; it aims to detect the inheritance pattern of a particular disease, localize the gene and find a marker associated with disease susceptibility. Gene-gene and gene-environment interactions are also studied in genetic epidemiology of a disease. In its broad context, genetic epidemiology includes family studies, molecular epidemiologic studies with genetic components, and more traditional cohort and case-control studies with family history components.

Yan Sun


Genetics of Common Human Diseases

Heritability of Common Human Diseases

  • DiseaseHeritability

  • Asthma~ 60%

  • T2D~ 70%

  • Obesity~ 50%

CVD Risk factors with a significant genetic component (heritability)

Genetics of atherosclerosis Lusis,A.J., Mar,R., Pajukanta,P., 2004

Yan Sun


Dimensionality of Data

  • Phenotypic Data

  • Genotypic Data

    • Microsatellites

    • Single Nucleotide Polymorphism (SNP)

    • Copy Number Variation (CNV)

Yan Sun


SNP

AAGGCGTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC

AAGGCCTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTGGGCGCC

AAGGCCTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC

AAGGCCTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTTGGCGCC

AAGGCGTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTTGGCGCC

AAGGCGTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTTGGCGCC

AAGGCCTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC

AAGGCGTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC

AAGGCGTTTGTCCC . . . CCCTAGGATGCAA . . . AAGCTTGGCGCC

AAGGCGTTTGTCCC . . . CCCTTGGATGCAA . . . AAGCTTGGCGCC

SNP1 C/G

SNP2 A/T

SNP3 T/G

Yan Sun


Genome-Wide Association Data

Yan Sun


Xavier and Armengol, 2007 PLoS Genetics

Yan Sun


Data Mining

The data are not mine, they are public.

  • NIH dbGaP

    (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap)

  • The Wellcome Trust Case Control Consortium

    (http://www.wtccc.org.uk/)

  • Genetic Analysis Workshop

Yan Sun


Goals

  • Variable Selection

    • Noise removal

    • Dimension reduction

    • Feature selection

  • Prediction Model

    • Predictive ability of genetic factor along

    • Improvement of predictive ability

  • Underlying Mechanism*

    • Interactions, pathways and biological networks

Yan Sun


Machine Learning Methods

  • Neural Networks

  • Support Vector Machine

  • Ensemble Leaning Methods

    • Bagging, Boosting, Random Forests& RuleFit

Yan Sun


Ensemble Learning Methods

Ensemble learning refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions

Yan Sun


Ensemble Learning Methods

  • Accuracy: a more reliable mapping can be obtained by combining the output of multiple "experts "

  • Efficiency: a complex problem can be decomposed into multiple sub-problems that are easier to understand and solve (divide-and-conquer approach). Mixture of experts, ensemble feature selection.

  • There is no single model that works for all pattern recognition problems! (no free lunch theorem)

    "To solve really hard problems, we’ll have to use several different representations……. It is time to stop arguing over which type of pattern-classification technique is best……. Instead we should work at a higher level of organization and discover how to build managerial systems to exploit the different virtues and evade the different limitations of each of these ways of comparing things. " --Minsky, 1991.

Yan Sun


Why do ensembles work ?

  • Because uncorrelated errors of individual classifiers can be eliminated by averaging.

  • Assume: 40 base classifiers, majority voting, each error rate 0.3

  • Probability that an instance will be misclassified by r out of 40 classifiers (Dietterich, 1997): r=21 -> 0.002

  • Theoretical results by Hansen & Solomon (1990)

Yan Sun


Ensemble Learning Methods

  • How to generate base classifiers?Generation strategy

    • Decision tree learning:ID3, C4.5 & CART

    • Instance-based learning: k-nearest neighbor

    • Bayesian classification: Naïve Bayes

    • Neural networks

    • Regression analysis

    • Clustering et.al.

  • How to integrate them?Integration strategy:

    • BAGGing = Bootstrap AGGregation (Breiman, 1996)

    • Boosting (Schapire and Singer, 1998)

    • Random Forests (Breiman, 2001)

Yan Sun


Tree 1

Tree 2

Tree 3

Tree i

Final Classification is based on votes from all N trees

Tree i+1

Tree i+2

Tree i+3

Tree N

No

Yes

No

Yes

No

Yes

No

No

Random Forests

Yan Sun


Random Forests

  • Its accuracy is as good as Adaboost and sometimes better

  • It is relatively robust to outliers and noise

  • It is faster than bagging or boosting

  • It gives useful internal estimates of error, strength, correlation and variable importance

  • It is simple and can be easily parallelized

Yan Sun


Random Forests

Heidema 2006 BMC Genetics

Yan Sun


Application of Random Forests

  • Candidate Genes

  • Genome-wide Markers

Yan Sun


Application of Random Forests

ds1 (n=360)

16 Cov. 471 SNPs

ds2 (n=360)

16 Cov. 471 SNPs

Missing SNP Genotype Imputation

All 471 SNPs

tagSNPs LD Rsq<0.5

Random Forests

RuleFit

Prediction Models

With ROC curve 5-fold CV

Identify Replicable

Covs and SNPs

KGraph Summary and

Biological relevance

Yan Sun


Predicting CAC

Random

Forests

Rulefit

Yan Sun


Using All SNPs vs. Tag SNPs

A tagSNP is a representative SNP in a region of the genome with high linkage disequilibrium with the rest SNPs. It is possible to identify genetic variation without genotyping every SNP in a chromosomal region. It is the maximally informative SNP.

Yan Sun


Predicting CAC

Yan Sun


GPR35 Protein

Yan Sun


Kgraph Presentation

Yan Sun


Predicting RA Status

  • Sample: One subject was randomly selected from each family to create dataset 1 and the second subject was then randomly selected from the rest of samples for dataset 2. The singletons were randomly divided into the two samples. Each of the two replicate samples has 740 unrelated subjects

  • Genetic Markers: 5,742 genome-wide informative SNP markers.

Yan Sun


Predicting RA Status

Sun YV et.al., 2007, In Press

Yan Sun


Predicting RA Status

Sun YV et.al., 2007, In Press

Yan Sun


Challenges

  • Validation, validation and validation!

“So far, comprehensive reviews of the published literature, most of which reports work based on the candidate-gene approach, have demonstrated a plethora of questionable genotype– phenotype associations, replication of which has often failed in independent studies.”

Yan Sun


Computational Challenge

  • 100K SNP data

  • 500K SNP data

  • 1M SNP data

  • 3.1M HapMap SNPs (Nature Oct. 2007)

  • And more – different type of genetic variations

  • And more rare genetic variations

  • And larger sample

Yan Sun


If all a man has is a hammer, then every problem looks like a nail.

Yan Sun


Acknowledgements

University of Michigan

Sharon Kardia

Lawrence Bielak

Patricia Peyser

Ji Zhu

Mayo Clinic

Stephen Turner

Patrick Sheedy, II

University of Texas

Eric Boerwinkle

Yan Sun


Yan Sun


ad
  • Login