Bayesian Modelling for Differential
Download
1 / 56

- PowerPoint PPT Presentation


  • 138 Views
  • Uploaded on

Bayesian Modelling for Differential Gene Expression. Alex Lewin (Imperial College) Sylvia Richardson ( IC Epidemiology) Tim Aitman (IC Microarray Centre) In collaboration with Anne-Mette Hein, Natalia Bochkina ( IC Epidemiology) Helen Causton (IC Microarray Centre)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - truman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Bayesian Modelling for Differential

Gene Expression

Alex Lewin

(Imperial College)

Sylvia Richardson (IC Epidemiology)

Tim Aitman (IC Microarray Centre)

In collaboration with

Anne-Mette Hein, Natalia Bochkina (IC Epidemiology)

Helen Causton (IC Microarray Centre)

Peter Green (Bristol)


Slide2 l.jpg

Insulin-resistance gene Cd36

cDNA microarray: hybridisation signal for SHR much lower than for Brown Norway and SHR.4 control strains

Aitman et al 1999, Nature Genet 21:76-83


Slide3 l.jpg

Larger microarray experiment: look for other genes associated with Cd36

Microarray Data

3 SHR compared with 3 transgenic rats (with Cd36)

3 wildtype (normal) mice compared with 3 mice with Cd36 knocked out

 12000 genes on each array

Biological Question

Find genes which are expressed differently between animals with and without Cd36.


Slide4 l.jpg


Slide5 l.jpg

Low-level Model

(how gene expression is estimated from signal)

Normalisation

(to make arrays comparable)

Clustering,

Partition Model

Differential

Expression

Microarray analysis is amulti-step process

We aim to integrate all

the steps in a common

statistical framework


Slide6 l.jpg

Bayesian Modelling Framework

  • Model different sources of variability simultaneously,

    within array, between array …

  • Uncertainty propagated from data to parameter estimates (so not over-optimistic in conclusions).

  • Share information in appropriate ways to get robust estimates.


Slide7 l.jpg

sd

mean

Gene Expression Data

3 wildtype mice,

Fat tissue hybridised to Affymetrix chips

Newton et al. 2001

Showed data fit well by Gamma or Log Normal distributions

Kerr et al. 2000

Linear model on log scale


Slide8 l.jpg

Bayesian hierarchical model for differential expression

Data: ygsr = log expression for gene g, condition s, replicate r

g = gene effect

δg = differential effect for gene g between 2 conditions

r(g)s = array effect (expression-level dependent)

gs2 = gene variance

  • 1st level

    yg1r | g, δg, g1  N(g – ½ δg + r(g)1 , g12),

    yg2r | g, δg, g2  N(g + ½ δg + r(g)2 , g22),

    Σrr(g)s = 0

    r(g)s = function of g , parameters {a} and {b}


Slide9 l.jpg

Explicit modelling

of the alternative

H0

Priors for gene effects

Mean effect g

g ~ Unif (much wider than data range)

Differential effect δg

δg ~ N(0,104) – “fixed” effects (no structure in prior)

OR mixture:

δg ~ p0δ0 + p1G_(1.5, 1) + p2G+(1.5, 2)


Slide10 l.jpg

References

Fixed Effects

Kerr et al. 2000

Mixture Models

Newton et al. 2004 (non-parametric mixture)

Löenstedt and Speed 2003, Smyth 2004

(conjugate mixture prior)

Broet et al. 2002 (several levels of DE)


Slide11 l.jpg

Prior for gene variances

Two extreme cases:

(1) Constant variance gsr  N(0, 2)

Too stringent Poor fit

(2) Independent variances gsr  N(0, g2)

! Variance estimates based on few replications are highly variable

Need to share information between genes to

better estimate their variance, while allowing

some variability

Hierarchical model


Slide12 l.jpg

Prior for gene variances

  • 2nd level

    gs2 | μs, τs logNormal (μs, τs)

    Hyper-parameters μs and τs can be influential.

    Empirical Bayes

    Eg. Löenstedt and Speed 2003, Smyth 2004

    Fixes μs , τs

    Fully Bayesian

  • 3rd level

    μs  N( c, d) τs  Gamma (e, f)


Slide13 l.jpg

Gene specific variances are stabilised

  • Variances estimated using information from all G x R measurements (~12000 x 3) rather than just 3

  • Variances stabilised and shrunk towards average variance


Slide14 l.jpg

a0

a1

a3

a2

Prior for array effects (Normalization)

Spline Curve

r(g)s = quadratic in g for ars(k-1)≤ g ≤ ars(k)

with coeff (brsk(1),brsk(2) ), k =1, … #breakpoints

Locations of break points not fixed

Must do sensitivity checks on # break points


Slide15 l.jpg

Bayesian posterior mean

loess

Array effect as a function of gene effect


Slide16 l.jpg

Wildtype

Knockout

Effect of normalisation on density

Before (ygsr)

^

After (ygsr- r(g)s )


Slide17 l.jpg

Bayesian hierarchical model for differential expression

  • 1st level

    • ygsr | g, δg, gs  N(g – ½ δg + r(g)s , gs2),

  • 2nd level

    • Fixed effect priors for g, δg

    • Array effect coefficients, Normal and Uniform

    • gs2 | μs, τs logNormal (μs, τs)

  • 3rd level

    • μs  N( c, d)

    • τs  Gamma (e, f)


Slide18 l.jpg

WinBUGS software for fitting Bayesian models

WinBUGS does the calculations

Declare the model

for( i in 1 : ngenes ) {

for( j in 1 : nreps) {

y1[i, j] ~ dnorm(x1[i, j], tau1[i])

x1[i, j] <- alpha[i] - 0.5*delta[i]

+ beta1[i, j]

}

}

for( i in 1 : ngenes ) {

tau1[i] <- 1.0/sig21[i]

sig21[i] <- exp(lsig21[i])

lsig21[i] ~ dnorm(mm1,tt1)

}

mm1 ~ dnorm( 0.0,1.0E-3)

tt1 ~ dgamma(0.01,0.01)


Slide19 l.jpg

WinBUGS software for fitting Bayesian models

Whole posterior distribution

Posterior means, medians, quantiles


Slide20 l.jpg


Slide21 l.jpg

Decision Rules for Inference

So far, discussed fitting the model.

How do we decide which genes are differentially expressed?

Parameters of interest: g , δg , g

  • What quantity do we consider, δg , (δg /g) , … ?

  • How do we summarize the posterior distribution?


Slide22 l.jpg

biological

interest

biological

interest

statistical

confidence

Fixed Effects Model

Inference on δ

(1) dg= E(δg | data) posterior mean

Like point estimate of log fold change.

Decision Rule: gene g is DE if |dg| > δcut

(2) pg = P( |δg|> δcut | data)

posterior probability (incorporates uncertainty)

Decision Rule: gene g is DE if pg > pcut

This allows biologist to specify what size of effect is interesting (not just statistical significance)


Slide23 l.jpg

statistical

confidence

statistical

confidence

Fixed Effects Model

Inference on δ, 

(1) tg= E(δg | data) / E(g | data)

Like t-statistic.

Decision Rule: gene g is DE if |tg| > tcut

(2) pg = P( |δg /g|> tcut | data)

Decision Rule: gene g is DE if pg > pcut

Bochkina and Richardson (in preparation)


Slide24 l.jpg

Explicit modelling

of the alternative

H0

Mixture Model

δg ~ p0δ0 + p1G_(1.5, 1) + p2G+(1.5, 2)

(1) dg= E(δg | data) posterior mean

Shrunk estimate of log fold change.

Decision Rule: gene g is DE if |dg| > δcut

(2) Classify genes into the mixture components.

pg = P(gene g not in H0 | data)

Decision Rule: gene g is DE if pg > pcut


Slide25 l.jpg

Illustration of decision rule

pg = P( |δg|> log(2)

and g > 4| data)

x pg > 0.8

Δ t-statistic > 2.78

(95% CI)


Slide26 l.jpg


Bayesian p values l.jpg
Bayesian P-values

  • Compare observed data to a “null” distribution

  • P-value: probability of an observation from the null distribution being more extreme than the actual observation

  • If all observations come from the null distribution, the distribution of p-values is Uniform


Cross validation p values l.jpg
Cross-validation p-values

Idea of cross validation is to split the data: one part for fitting the model, the rest for validation

n units of observation

For each observation yi, run model on rest of data y-i, predict new data yinew from posterior distribution.

Bayesian p-value pi = Prob(yinew> yi | data y-i)

Distribution of p-values {pi,i=1,…,n} is approximately Uniform if model adequately describes the data.


Posterior predictive p values l.jpg
Posterior Predictive p-values

For large n, notpossible to run model n times.

Run model on all data.For each observation yi, predict new data yinew from posterior distribution.

Bayesian p-value pi = Prob(yinew> yi | all data)

“all data” includes yi p-values are less extreme

than they should be

p-values are conservative (not quite Uniform).


Slide30 l.jpg

Example: Check priors on gene variances

  • Compare equal and exchangeable variance models

  • Compare different exchangeable priors

Want to compare data for each gene, not gene and replicate, so use sample variance Sg2 (suppress index s here)

Bayesian p-value Prob( Sg2 new > Sg2 obs | data)


Slide31 l.jpg

WinBUGS code for posterior predictive checks

replicate relevant sampling distribution

calculate sample variances

count no. times predicted sample variance is bigger than observed sample variance

for( i in 1 : ngenes ) {

for( j in 1 : nreps) {

y1[i, j] ~ dnorm(x1[i, j], tau1[i])

ynew1[i, j] ~ dnorm(x1[i, j], tau1[i])

x1[i, j] <- alpha[i] - 0.5*delta[i]

+ beta1[i, j]

}

s21[i] <- pow(sd(y1[i, ]), 2)

s2new1[i] <- pow(sd(ynew1[i, ]), 2)

pval1[i] <- step(s2new1[i] - s21[i])

}


Slide32 l.jpg

Prior

parameters

Mean

parameters

g2

ygr

r = 1:R

new

new

Sg2

ygr

Sg2

g = 1:G

Posterior predictive

Graph shows structure of model


Slide33 l.jpg

Prior

parameters

Mean

parameters

g2

ygr

r = 1:R

new

new

new

Sg2

ygr

g2

Sg2

g = 1:G

Less conservative than posterior predictive

(Marshall and Spiegelhalter, 2003)

Mixed predictive


Slide34 l.jpg

Four models for gene variances

Equal variance model:

Model 1: 2  log Normal (0, 10000)

Exchangeable variance models:

Model 2: g-2  Gamma (2, β)

Model 3: g-2  Gamma (α, β)

Model 4: g2  log Normal (μ, τ)

(α, β, μ, τ all parameters)



Slide36 l.jpg


Slide37 l.jpg

Expression level dependent normalization

Many gene expression data sets need normalization which depends on expression level.

Usually normalization is performed in a pre-processing step before the model for differential expression is used.

These analyses ignore the fact that the expression level is measured with variability.

Ignoring this variability leads to bias in the function used for normalization.


Slide38 l.jpg

Simulated Data

Gene variances

similar range and distribution to mouse data

Array effects

cubic functions of expression level

Differential effects

900 genes: δg= 0

50 genes: δg N( log(3), 0.12)

50 genes: δg N( -log(3), 0.12)


Slide39 l.jpg

_

Data points:ygsr – yg (r = 1…3)

Curves: r(g)s (r = 1…3)

Array Effects and Variability for Simulated Data


Slide40 l.jpg

Two-step method (using loess)

  • Use loess smoothing to obtain array effects loessr(g)s

  • Subtract loess array effects from data: yloessgsr = ygsr - loessr(g)s

  • Run our model on yloessgsrwith no array effects


Slide41 l.jpg

Decision rules for selecting differentially expressed genes

If P( |δg| > δcut | data) > pcut then gene g is called differentially expressed.

δcut chosen according to biological hypothesis of interest (here we use log(3)).

pcut corresponds to the error rate (e.g. False Discovery Rate or Mis-classification Penalty) considered acceptable.


Slide42 l.jpg

Full model v. two-step method

Plot observed False Discovery Rate against pcut (averaged over 5 simulations)

Solid line for full model

Dashed line for pre-normalized method


Slide43 l.jpg

Different two-step methods

  • yloessgsr = ygsr - loessr(g)s

  • ymodelgsr = ygsr - E(r(g)s | data)

  • Results from 2 different two-step methods are much closer to each other than to full model results.


Slide44 l.jpg


Slide45 l.jpg

Gene Ontology (GO)

Database of biological terms

Arranged in graph connecting related terms

Directed Acyclic Graph: links indicate more specific terms

~16,000 terms

from QuickGO website (EBI)


Slide46 l.jpg

Gene Ontology (GO)

from QuickGO website (EBI)


Slide47 l.jpg

Gene Annotations

  • Genes/proteins annotated to relevant GO terms

  • Gene may be annotated to several GO terms

  • GO term may have 1000s of genes annotated to it (or none)

  • Gene annotated to term A  annotated to all ancestors of A (terms that are related and more general)


Slide48 l.jpg

GO annotations of genes associated with the insulin-resistance gene Cd36

Compare GO annotations of genes most and least differentially expressed

Most differentially expressed ↔ pg > 0.5 (280 genes)

Least differentially expressed ↔ pg < 0.2 (11171 genes)


Slide49 l.jpg

genes not annot. insulin-resistance gene Cd36

to GO term

genes annot.

to GO term

A

B

genes most

diff. exp.

C

D

genes least

diff. exp.

GO annotations of genes associated with the insulin-resistance gene Cd36

For each GO term, Fisher’s exact test on

proportion of differentially expressed genes with annotations

v.

proportion of non-differentially expressed genes with annotations

observed O = A

expected E = C*(A+B)/(C+D)

if no association of GO

annotation with DE

FatiGO website

http://fatigo.bioinfo.cnio.es/


Slide50 l.jpg

GO annotations of genes associated with the insulin-resistance gene Cd36

O = observed no. differentially expressed genes

E = expected no. differentially expressed genes


Slide51 l.jpg

Biological process insulin-resistance gene Cd36

Physiological process

Response to stimulus

Organismal movement

Response to external stimulus

(O=12, E=4.7)

Response to biotic stimulus

(O=14, E=6.9)

Response to stress

(O=12, E=5.9)

Response to wounding

(O=6, E=1.8)

Response to external biotic stimulus *

Defense response

(O=11, E=5.8)

Response to pest, pathogen or parasite

(O=8, E=2.6)

Immune response

(O=9, E=4.5)

Inflammatory response

(O=4, E=1.2)

All GO ancestors of

Inflammatory response

* This term was not accessed by FatiGO

Relations between GO terms were found using QuickGO:

http://www.ebi.ac.uk/ego/


Slide52 l.jpg

Further Work to do on GO insulin-resistance gene Cd36

  • Account for dependencies between GO terms

  • Multiple testing corrections

  • Uncertainty in annotation

    ( work in preparation )


Slide53 l.jpg

Summary insulin-resistance gene Cd36

  • Bayesian hierarchical model flexible, estimates variances robustly

  • Predictive model checks show exchangeable prior good for gene variances

  • Useful to find GO terms over-represented in the most differentially-expressed genes

    Paper available (Lewin et al. 2005, Biometrics, in press)

    http ://www.bgx.org.uk/


Slide55 l.jpg

Decision Rules insulin-resistance gene Cd36

  • In full Bayesian framework, introduce latent allocation variable zg= 0,1 for gene g in null, alternative

  • For each gene, calculate posterior probability of belonging to unmodified component: pg = Pr( zg = 0 | data )

  • Classify using cut-off on pg (Bayes rule corresponds to 0.5)

  • For any given pg , can estimate FDR, FNR.

For gene-list S, est. (FDR | data) = Σg  S pg / |S|


Slide56 l.jpg

The Null Hypothesis insulin-resistance gene Cd36

Composite Null

Point Null, alternative not modelled

Point Null, alternative modelled


ad