slide1 l.
Skip this Video
Loading SlideShow in 5 Seconds..
Outline PowerPoint Presentation
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 43

Outline - PowerPoint PPT Presentation

  • Uploaded on

Bioinformatics Multifactor Dimensionality Reduction Kristel Van Steen, PhD, ScD ( Université de Liege - Institut Montefiore 2008-2009. Outline. Setting the scene Analyses methods for gene-gene interactions Traditional vs non-Traditional MDR, MB-MDR, FAM-MDR

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Outline' - colman

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

BioinformaticsMultifactor Dimensionality ReductionKristel Van Steen, PhD, ScD(é de Liege - Institut Montefiore2008-2009

  • Setting the scene
  • Analyses methods for gene-gene interactions
    • Traditional vs non-Traditional
  • The future: work in progress
genetic architecture of disease
Genetic Architecture of Disease
  • The number of genes that impact disease susceptibility
  • The distribution of alleles and genotypes at those genes
  • The manner in which the alleles and genotypes impact disease susceptibility

(Weiss 1993)

complications in disentangling
Complications in disentangling?

There are likely to be many susceptibility genes each with combinations of rare and common alleles and genotypes that impact disease susceptibility primarily through non-linear interactions with genetic and environmental factors

Analysis Methods

Traditional vs Non-Traditional

Traditional methods involving single markers have limited use and more advanced and efficient methods are needed to identify gene interactions and epistatic patterns of susceptibility
alternative methods
Alternative Methods
  • Tree-based methods:
    • Recursive Partitioning (Helix Tree)
    • Random Forests (R, CART)
  • Pattern recognition methods:
    • Symbolic Discriminant Analysis (SDA)
    • Mining association rules
    • Neural networks (NN)
    • Support vector machines (SVM)
  • Data reduction methods:
    • DICE (Detection of Informative Combined Effects)
    • MDR (Multifactor Dimensionality Reduction)
    • Logic regression …

(e.g., Onkamo and Toivonen 2006)

gene interaction models
Gene Interaction Models
  • Non-parametric:
    • Appealing because no distributional assumptions on genotype-phenotype effect
  • Parametric:
    • Appealing because easy adjustment for confounding variables and main effects
    • Severe limitations in presence of too many independent variables in relation to number of observed outcome events
out of control curse

2 x 1026

3 x 1021

2 x 1016

1 x 1011

5 x 105

Out-of-control curse?

~500,000 SNPs span 80% of common variation in genome (HapMap)

curse of dimensionality
Curse of Dimensionality
  • Bellman R (1961) Adaptive control processes: A guided tour. Princeton University Press:

“... Multidimensional variational problems cannot be solved routinely ... . This does not mean that we cannot attack them. It merely means that we must employ some more sophisticated techniques.”

limitation of regression
Limitation of Regression
  • Having too many independent variables in relation to the number of observed outcome events
  • Assuming 10 bi-allelic loci:

# of Parameters =

limitation of regression14
Limitation of Regression
  • Fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients and to an increase in Type 1 and Type 2 errors.
  • For 200 cases and 200 controls, this formula suggests that no more than 19 (= 200/10 – 1) parameters should be estimated in logistic regression model.

# of parameters P min(ncase , ncontrol)/10 - 1

Multifactor Dimensionality Reduction


to tackle the dimensionality problem of interaction detection

mdr for interaction detection
MDR for Interaction Detection
  • MDR creates a one-dimensional multi-locus genotype variable (high and low risk), which is evaluated for its ability to classify and predict disease status through cross-validation and permutation testing.

(Ritchie et al 2001; Hahn et al 2003)

mdr steps

10 cross-validation  10 best models.

The model with minimum PE is the best n-locus model.

MDR Steps

1/10 test data

9/10 training data

(Ritchie et al 2003)

two measures for selection of best n locus model
Two Measures for Selection of Best n-locus model
  • Misclassification error:

The proportion of incorrect classification in the training set.

  • Prediction error (PE):

The proportion of incorrect prediction in the test set.

best multi factor models
Best Multi-factor Models

Best 2-factor model

Best 3-factor model

Best 4-factor model

Best 5-factor model

Best 6-factor model



Best n-factor model

model selection and evaluation
Model Selection and Evaluation
  • Among the best n-factor models, the best model is:
    • The model with the minimum average PE.
    • The model with the maximum average CVC.
    • Rule of parsimony: If there is a tie, select the smaller model.
significance of the final model
Significance of the Final Model
  • Via permutation tests:
    • Randomize the the case and control labels in the original dataset multiple times to create a set of permuted datasets.
    • Run MDR on each permuted dataset.
    • Maximum CVC and minimum PE identified for each dataset saved and used to create an empirical distribution for estimation of a P-value.

Example: through simulation

200 cases and 200 controls;

10 SNPs: 1, 2, 3 , …, 10.

Disease etiology due to interaction

between SNP 1 and SNP 6.

Over 10 CVs and 10 runs

advantages of mdr
Advantages of MDR
  • Simultaneous detection of multiple genetic loci associated with a discrete clinical endpoint in absence of main effect.
  • Non-parametric: Overcomes “curse of dimensionality” by logistic regression model.
  • Three genotype groups are considered separately
  • Non-linear interactions between multiple polymorphisms in the absence of independent effects
  • Low false positive rates
disadvantages of mdr
Disadvantages of MDR
  • Need to introduce parametrics?
    • MDR in its initial layout cannot deal with main effects / confounding factors / non-dichotomous outcomes:
      • GMDR / OR-MDR
    • Low power in the presence of genetic heterogeneity

Power Simulation Set-Up

no noise

5% genotyping error -- GE

5% missing data -- MS

50% phenocopy -- PC

50% genetic heterogeneity – GH




6 models

4 models

Total 16 models

disadvantages of mdr28
Disadvantages of MDR
  • Noteworthy:
    • Model selection on the basis of prediction accuracy
    • One single higher-order interaction model is proposed
    • Some important interactions could be missed due to pooling too many cells together
model based mdr mb mdr





Model Based MDR (MB-MDR)
mb mdr in its simplest form
Step 1: New risk cell identification via association test on each genotype cell cj

Parametric or non-parametric test of association


Step 2: Test X on Y

Parametric or non-parametric


MB-MDR in its simplest form
mb mdr in its simplest form34
MB-MDR in its simplest form
  • Step 3: assess significance
    • W = [b/se(b)]2, b=ln(OR)
    • Adjust for number of combined cells in high and low risk category
improve power in the presence of heterogeneity
Improve power in the presence of heterogeneity

Power of MDR compared to MB-MDR under aforementioned scenarios

(Calle, Urrea, Malats, Van Steen 2008- submitted)

mb mdr in its simplest form37
MB-MDR in its simplest form
  • Step 4:

Adjusted p-values need to be corrected for multiple testing

from mb mdr to fam mdr
  • Extension to families
    • Perform polygenic analysis using the complete pedigree structure but ignore marker data.
    • Derive residuals from this model (gives rise to independent quantitative “new” traits)
    • Submit to MB-MDR
    • Effected sizes can be derived using measured (multi-locus) genotype models on the selected combinations of markers.

Adjusted p-values need to be corrected for multiple testing

motivation for fam mdr40
Motivation for FAM-MDR
  • The idea of removing “family trend due to genetic inheritance” was also adopted in the GRAMMAR approach of Aulchenko and colleagues.

“For each particular method there are situations for which it is particularly well suited, and others where it performs badly compared to the best that can be done with that data…

However, it is seldom known in advance which procedure will perform best or even well for any given problem.”

(Hastie et al 2001)


Helpful discussions:

Marylyn Ritchie and co-workers (USA), MaluCalle and Victor Urrea (Spain)

Phd students on the project:

JestinahMahachie (e.g., MDR and longitudinal measurements), Vaness De Wit (e.g., MDR and multi-allelic markers; sparse cell management), Lizzy De Lobel (e.g., pre-screening algorithms)

Post-doc on the project:

Tom Cattaert (e.g., FAM-MDR simulations)