Bayesian hierarchical modelling
1 / 69

Sylvia Richardson Centre for Biostatistics Imperial College, London - PowerPoint PPT Presentation

  • Uploaded on

Bayesian hierarchical modelling of genomic data. Sylvia Richardson Centre for Biostatistics Imperial College, London. In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s) Helen Causton and Tim Aitman (Hammersmith) Peter Green (Bristol)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Sylvia Richardson Centre for Biostatistics Imperial College, London' - kelly-meyer

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Bayesian hierarchical modelling

of genomic data

Sylvia Richardson

Centre for Biostatistics

Imperial College, London

In collaboration with

Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s)

Helen Causton and Tim Aitman (Hammersmith)

Peter Green (Bristol)

Philippe Broët (INSERM, Paris)

BBSRC Exploiting Genomics grant


  • Introduction

  • A fully Bayesian gene expression index (BGX)

  • Differential expression and array effects

  • Mixture models

  • Discussion

Part 1 introduction
Part 1 Introduction

  • Recent developments in genomics have led to techniques

    • Capable of interrogating the genome at different levels

    • Aiming to capture one or several stages of the biological process

DNA mRNA protein phenotype

Fundamental process

Protein-encoding genes are transcribed into mRNA (messenger),

and the mRNA is translated to make proteins

DNA -> mRNA -> protein

Pictures from

What are gene expression data ?

  • DNA Microarrays are used to measure the relative abundance of mRNA, providing information on gene expression in a particular cell, under particular conditions

  • The fundamental principle used to measure the expression is that of hybridisation between a sample and probes:

    • Known sequences of single-stranded DNA representing genes are immobilised on microarray

    • Tissue sample (with unknown concentration of RNA) fluorescently labelled

    • Sample hybridised to array

    • Array scanned to measure amount of RNA present for each sequence

  • The expression level of ten of thousands of probes are measured on a single microarray !

gene expression measure

gene expression profile

Variation and uncertainty

Gene expression data (e.g. Affymetrix) is the result of multiple sources of variability

  • condition/treatment

  • biological

  • array manufacture

  • imaging

  • technical

  • gene specific variability of the probes for a gene

  • within/between array variation

Structured statistical modelling allows considering all uncertainty at once

Example of within vs between strains gene variability
Example of within vs between strains gene variability

  • 7 cross-bred strains of mice that differ only by a small portion of chromosome 1

  • Strains have different phenotypes related to immunological disorders

  • For each line, 9 animals used to obtain 3 pooled RNA extracts from spleen 7 x 3 samples

    Excellent experimental design to minimise “biological variability between replicate animals”

    Aim: to tease out differences between expression profiles of the 7 lines of mice and relate these to locations on chromosome 1

Average (over the 7 groups) of

within strain variance

calculated from the 3 pooled samples

Total variance calculated

over the 21 samples

Biological variability is large !

Ratio within/total

1000 genes most variable

between strains: hierarchical

clustering recovers the

cross-bred lines structure

Random set of 1000 genes

Common characteristics of genomics data sets
Common characteristics of genomics data sets

  • High dimensional data (ten of thousands of genes) and few samples

  • Many sources of variability (low signal/noise ratio)

Common issues

  • Pre-processing and data reduction

  • Multiple testing

  • Need to borrow information

  • Importance to include prior biological knowledge

Part 2

  • Introduction

  • A fully Bayesian gene expression index (BGX)

    • Single array model

    • Multiple array model

  • Differential expression and array effects

  • Mixture models

  • Discussion









A fully Bayesian Gene eXpression index for Affymetrix GeneChip arraysAnne Mette HeinSR, Helen Causton, Graeme Ambler, Peter Green

Raw intensities

Background correction

Gene specific variability


Gene index BGX






Zoom Image of Hybridised Array

Hybridised Spot

Each gene g represented by probe set: (J:11-20)

Perfect match: PMg1,…, PMgJ

Mis-match: MMg1,…, MMgJ

Expressed PM

Non-expressed PM

Image of Hybridised Array

expression measure for gene g

Slide courtesy of Affymetrix

Affymetrix GeneChips:

Commonly used methods for estimation expression levels from GeneChips

  • MAS5:

  • uses PM and MMs. Imputes IMs from MMs to obtain all PM-MMs positive

  • gene expression measure : estimate obtained by applying Tukey Biweight to the set of log(PM-MM) values in the probe set

  • RMA:

  • uses PMs only.

  • Fits an model with additive gene and probe effects to log-scale background corrected PMs using median polish

Characteristics: positive, robust, noisy at low levels

Characteristics: positive, robust, attenuated signal detection

Variability across conditions is conditioned by the choice of summary measure beware of filtering

Mean (left) and Empirical standard deviation (right) GeneChips

over 7 conditions (arrays) for 45000 genes estimated by

2 different methods for quantifying gene expression

Variability across conditions is conditioned by the choice of summary measure ! Beware of filtering


Model assumptions and key biological parameters GeneChips

  • The intensity for the PM measurement for probe (reporter) j and gene g is due to binding

    of labelled fragments that perfectly match the oligos in the spot

    The true Signal Sgj

    of labelled fragments that do not

    perfectly match these oligos

    The non-specific hybridisation Hgj

  • The intensity of the corresponding MM measurement is caused

    by a binding fraction Φ of the true signal Sgj

    by non-specific hybridisation Hgj

PM GeneChipsgj N( Sgj + Hgj , τ2)

MMgj N(Φ Sgj + Hgj,τ2)


noise, additive


Non-specific hybridisation


log(Sgj+1)  TN (μg ,ξg2)

log(Hgj+1)  TN(λ, η2)


Gene specific

error terms:


Array-wide distribution

Gene expression index (BGX):

qg=median(TN (μg ,ξg2))

“Pools” information over probes j=1,…,J

log(ξg2)N(a, b2)

Priors: “vague”t2 ~ G(10-3, 10-3) F~ B(1,1),

mg ~ U(0,15) h2 ~ G(10-3, 10-3),l ~ N(0,103)

“Empirical Bayes”

BGX single array model:g=1,…,G (thousands), j=1,…,J (11-20)

Inference: GeneChips


2.5-97.5% credibility interval

For each gene g:

Log-scale true signals:

Log-scale non-spec. hybr:

BGX: gene expr:

log(Sgj+1): j=1,…,J

log(Hgj+1): j=1,…,J


1.75 2 2.25

1 2 3

2 3 4

NB! A distribution

  • Implemented in WinBugs and C

  • allows: - Joint estimation of parameters in full

  • Bayesian framework

  • obtain: - posterior distributions of parameters

  • (and functions of these) in model:

Computational issues
Computational issues GeneChips

  • We found mixing slow for gene specific parameters (μg ,ξg2) and large autocorrelation

  • For low signal (bottom 25%) more variability of Sgj and Hgj , and less separation

    So less information on (μg ,ξg2) and longer runs are needed

  • For the full hierarchical model, the convergence of the hyperparameters for the distribution of ξg2 was problematic

  • We studied sensitivity to a range of plausible values for those and implemented an “empirical Bayes” version of the model which was reproducible with sensible run length

Posterior mean of GeneChipsqgusing a run

of 30 000 versus those obtained

from runs of 5 000, 10 000 and

20 000 sweeps


is obtained with

short runs for

large expression


Longer runs

are necessary for low expression values

Single array model performance data set varying concentrations genelogic
Single array model performance: GeneChipsData set : varying concentrations (geneLogic):

  • 14 samples of cRNA from acute myeloid leukemia (AML) tumor cell line

  • In sample k: each of 11 genes spiked in at concentration ck:

    sample k: 1 2 3 4 5 6 7 8 9 10 11 12 13 14

    conc. ck(pM): 0.0 0.5 0.75 1.0 1.5 2.0 3.0 5.0 12.5 25 50 75 100 150

  • Each sample hybridised to an array

Consider subset consisting of 500 normal genes

+ 11 spike-ins

Single array model: GeneChipsexamples of posterior distributions of BGX expression indices

Each curve

(truncated normal with median param.) represents a gene

Examples with data:

o: log(PMgj-MMgj)


(at 0 if not defined)

Mean +- 1SD

Single array model performance 11 genes spiked in at 13 increasing concentrations

Comparison with other expression measures GeneChips

Single array model performance:11 genes spiked in at 13 (increasing) concentrations

BGX index qgincreases with concentration …..

… except for gene 7 (spiked-in??)

Indication of smooth

& sustained increase

over a wider range of


2.5 – 97.5 % credibility intervals for the Bayesian expression index

11 spike-in genes at 13 different concentration (data set A)

Note how the variability

is substantially larger

for low expression level

Each colour corresponds to a different spike-in gene

Gene 7 : broken red line

Integrated modelling of Affymetrix data expression index

Condition 2

Condition 1

















Gene specific variability (probe)

Gene index BGX

Gene specific variability (probe)

Gene index BGX

Hierarchical model of replicate

(biological) variability and array effect

Hierarchical model of replicate

(biological) variability and array effect

Distribution of expression

index for gene g , condition 1

Distribution of expression

index for gene g , condition 2

Distribution of differential expression parameter

Background expression index

noise, additive

Array specific

log(Sgjcr+1)  TN (μgc,ξgc2)

log(Hgjcr+1)  TN(λcr,ηcr2)

Array-specific distribution of non-specific hybridisation

BGX Multiplearray model: conditions: c=1,…,C, replicates: r = 1,…,Rc

PMgjcr N( Sgjcr+ Hgjcr, τcr2)

MMgjcr N(ΦSgjcr+ Hgjcr, τcr2)

Gene and condition specific BGX

qgc=median(TN(μgc, ξgc2))

“Pools” information over replicate probe sets

j = 1,…J, r = 1,…,Rc

Subset of AffyU133A spike-in data set expression index(AffyComp)

  • Consider:

  • Six arrays, 1154 genes (every 20th and 42 spike-ins)

  • Same cRNA hybridised to all arrays EXCEPT for spike-ins:

`1` `2` `3` … `12` `13` `14`

Spike-in genes: 1-3 4-6 7-9 … 34-36 37-39 40-42

Spike-in conc (pM):

Condition 1 (array 1-3): 0.0 0.25 0.50 … 128 256 512

Condition 2 (array 4-6): 0.25 0.50 1.00 … 256 512 0.00

Fold change: - 2 2 … 2 2 -

BGX: measure of uncertainty provided expression indexPosterior mean +- 1SD credibility intervals

diffg=bgxg,1- bgxg,2


Spike in 1113 -1154

above the blue line

Blue stars show RMA measure

Part 3 expression index

  • Introduction

  • A fully Bayesian gene expression index (BGX)

  • Differential expression and array effects

    • Non linear array effects

    • Model checking

  • Mixture models

  • Discussion

Differential expression and array effects alex lewin sr natalia bochkina tim aitman

Differential expression and array effects expression indexAlex LewinSR, Natalia Bochkina, Tim Aitman

Data Set and Biological question expression index

Biological Question

Understand the mechanisms of insulin resistance

Using animal models where key genes are knockout and comparison made between gene expression of wildtype (normal) and knockout mice

Data set A (MAS 5) ( 12000 genes on each array)

3 wildtype mice compared with 3 mice with Cd36 knocked out

Data set B (RMA) ( 22700 genes on each array)

8 wildtype mice compared with 8 knocked out mice

Condition 2 expression index

Condition 1

Start with given point

estimates of expression

Hierarchical model of replicate

Variability and array effect

Hierarchical model of replicate

Variability and array effect

Posterior distribution

(flat prior)

Differential expression


Mixture modelling for classification

Condition 1 (3 replicates) expression index

Condition 2 (3 replicates)

Exploratory analysis of array effect

Mouse data

set A

Needs ‘normalisation’

Spline curves shown

Model for Differential Expression expression index

  • Expression-level-dependent normalisation

  • Only few replicates per gene, so share information between genes to estimate variability of gene expression between the replicates

  • To select interesting genes:

    • Use posterior distribution of quantities of interest, function of, ranks ….

    • Use mixture prior on the differential expression parameter

Bayesian hierarchical model for differential expression expression index

Data: ygsr = log gene expression for gene g, replicate r

g = gene effect

δg= differential effect for gene g between 2 conditions

r(g)s = array effect (expression-level dependent)

gs2 = gene variance

  • 1st level

    yg1r  N(g – ½ δg + r(g)1 , g12),

    yg2r  N(g + ½ δg + r(g)2 , g22),

    Σrr(g)s = 0, r(g)s = function of g , parameters {a} and {b}

  • 2nd level

    Priors for g , δg, coefficients {a} and {b}

    gs2  lognormal (μs, τs)

Details of array effects (Normalization) expression index

  • Piecewise polynomial with unknown break points:

    r(g)s = quadratic in g for ars(k-1)≤ g ≤ ars(k)

    with coeff (brsk(1),brsk(2) ), k =1, … # breakpoints

    • Locations of break points not fixed

    • Must do sensitivity checks on # break points

  • Joint estimation of array effects and differential expression: In comparison to 2 step method

    • More accurate estimates of array effects

    • Lower percentage of false positive (simulation study)

Mouse Data set A expression index

For this data set, cubic fits well

3 replicate arrays (wildtype mouse data)

Model: posterior means

E(r(g)s | data) v. E(g | data)

Data: ygsr - E(g | data)

Bayesian Model Checking expression index

  • Check assumptions on gene variances, e.g. exchangeable variances, what distribution ?

  • Predict sample variance Sg2 new(a chosen checking function)from the model specification (not using the data for this)

  • Compare predicted Sg2 newwith observed Sg2 obs

    ‘Bayesian p-value’: Prob( Sg2 new > Sg2 obs )

  • Distribution of p-values approx Uniform if model is ‘true’ (Marshall and Spiegelhalter, 2003)

  • Easily implemented in MCMC algorithm

Data set A expression index

Possible Statistics for Differential Expression expression index

δg ≈ log fold change

δg* =δg / (σ2g1 /R1 + σ2g2 /R2 )½ (standardised difference)

  • We obtain the posterior distribution of all {δg} and/or {δg* }

  • Can compute directly posterior probability of genes satisfying criterion X of interest:

  • pg,X= Prob( g of “interest” | Criterion X, data)

  • Can compute the distributions of ranks

Data set A expression index3 wildtype mice compared to 3 knockout mice (U74A chip) Mas5

Criterion X

Gene is of interest if |log fold change| > log(2) and log (overall expression) > 4

Plot of log fold change versus overall expression level

Genes with

pg,X > 0.5 (green)

# 280

pg,X > 0.8 (red)

# 46

The majority of the genes

have very small pg,X :

90% of genes

have pg,X < 0.2

pg,X = 0.49

Genes with low overall expression

have a greater range of fold change

than those with higher expression

Experiment: expression index8 wildtype mice compared to 8 knockout mice RMA

Criterion X:

Gene is of interest if |log fold change| > log (1.5)

Plot of log fold change versus overall expression level

Genes with

pg,X > 0.5 (green)

# 292

pg,X > 0.8 (red)

# 139

The majority of the genes

have very small pg,X :

97% of genes

have pg,X < 0.2

Posterior probabilities and log fold change
Posterior probabilities and log fold change expression index

Data set A : 3 replicates MAS5

Data set B : 8 replicates RMA

Credibility intervals for ranks
Credibility intervals for ranks expression index

Data set B

100 genes with lowest

rank (most under/

over expressed)

Low rank,

high uncertainty

Low rank,

low uncertainty

Using the posterior distribution of expression indexδg*

(standardised difference)

  • Compute

    Probability ( | δg* | > 2 | data)

    Bayesian analogue of a t test !

  • Order genes

  • Select genes such that

Probability ( | δg* | > 2 | data) > cut-off ( in blue)

By comparison, additional genes selected by a standard

T test with p value < 5% are in red)

Part 4 expression index

  • Introduction

  • A fully Bayesian gene expression index

  • Differential expression and array effects

  • Mixture models

    • Classification for differential expression

    • Bayesian estimate of False Discovery Rates

    • CGH arrays: models including information on clones spatial location on chromosome

  • Discussion

Mixture and Bayesian estimation of false discovery rates expression indexNatalia Bochkina, Philippe Broët Alex Lewin, SR

Multiple Testing Problem expression index

  • Gene lists can be built by computing separately a criteria for each gene and ranking

  • Thousands of genes are considered simultaneously

  • How to assess the performance of such lists ?

Statistical Challenge

Select interesting genes without including too many false positives in a gene list

A gene is a false positive if it is included in the list when it is truly unmodified under the experimental set up

Want an evaluation of the expected false discovery rate (FDR)

Bayesian estimate of fdr
Bayesian Estimate of FDR expression index

  • Step 1: Choose a gene specific parameter (e.g. δg ) or a gene statistic

  • Step 2:Model its prior (resp marginal) distribution using a mixture model

    -- with one component to model the unaffected genes (null hypothesis) e.g. point mass at 0 for δg

    -- other components to model (flexibly) the alternative

  • Step 3:Calculate the posterior probability for any gene of belonging to the unmodified component : pg0 | data

  • Step 4: Evaluate FDR (and FNR) for any list

    assuming that all the gene classification are independent

    (Broët et al 2004) :

Bayes FDR (list) | data = 1/card(list) Σg  list pg0

Mixture framework for differential expression expression index

  • To obtain a gene list, a commonly used method

    (cf Lönnstedt & Speed 2002, Newton 2003, Smyth 2003, …) is to define a mixture prior for δg :

  • H0δg= 0 point mass at 0 with probability p0

  • H1δg~ flexible 2-sided distribution to model pattern of differential expression

    Classify each gene following its posterior probabilities of not being in the null: 1- pg0

    Use Bayes rule or fix the FDR to get a cutoff

Mixture prior for differential expression
Mixture prior for differential expression expression index

  • In full Bayesian framework, introduce latent allocation variable zgto help computations

  • Joint estimation of all the mixture parameters (including p0) avoids plugging-in of values (e.g. p0) that are influential on the classification

  • Sensitivity to prior settings of the alternative distribution

  • Performance has been tested on simulated data sets

Poster by Natalia Bochkina

Performance of the mixture prior
Performance of the mixture prior expression index

yg1r = g - ½ δg+ g1r , r = 1, … R1

yg2r = g + ½ δg + g2r , r = 1, … R2

(For simplification, we assume that the data has been pre normalised)

Var(gsr ) = σ2gs ~ IG(as, bs)

δg~ p0δ0 + p1G(1.5, 1) + p2G(1.5, 2)


Dirichlet distribution for (p0, p1, p2)

Exponential hyper prior for 1 and 2

Estimation expression index

  • Estimation of all parameters combines information from biological replicates and between condition contrasts

  • s2gs = 1/RsΣr (ygsr - ygs. )2 , s = 1,2

    Within condition biological variability

  • 1/RsΣr ygsr = ygs. ,

    Average expression over replicates

  • ½(yg1.+ yg2.) Average expression over conditions

  • ½(yg1.- yg2.) Between conditions contrast

p expression index


a1, b1

1 ,2

½(yg1.- yg2.)






½(yg1.+ yg2.)


a2, b2

DAG for the mixture model

g = 1:G

Simulated data expression index

Plot of the true differences

ygr ~ N(δg , σ2g) (8 replicates)

σ2gs ~ IG(1.5, 0.05)

δg ~ (-1)Bern(0.5) G(2,2), g=1:200

δg = 0, g=201:1000

Choice of simulation parameters

inspired by estimates found in

analyses of biological data sets

Bayes expression index


FDR (black)

FNR (blue)

as a function of

1- pg0


and estimated


correspond well

Post Prob (g H1) = 1- pg0

Important feature

Comparison of mixture classification and expression index

posterior probabilities for δg*(standardised differences)

  • Post Prob (g H1)

10 = 6%



In red, 200

genes with

δg≠ 0

31 = 4%



Probability ( | δg* | > 2 | data)

Wrongly classified by mixture: expression index

truly dif. expressed,

truly not dif. expressed

Classification errors

are on the borderline:

Confusion between

size of fold change

and biological


Another simulation expression index

2628 data points

Many points added

on borderline:


errors in red

Can we improve estimation

of within condition

biological variability ?

p expression index


a1, b1

1 ,2

½(yg1.- yg2.)






½(yg1.+ yg2.)


a2, b2

DAG for the mixture model

The variance

estimates are

influenced by

the mixture


Use only partial

information from

the replicates

to estimate

2gs and feed


in the mixture ?

g = 1:G

In expression index

46 data points

with improved

classification when

‘feed back from

mixture is cut’


11 data points

with changed

but new incorrect


Mixture, full vs partial


altered for 57 points:

Work in progress

Mixture models in cgh arrays experiments
Mixture models in CGH arrays experiments expression index

  • Philippe Broët, SR

  • Curie Institute oncology department

CGH = CompetitiveGenomic Hybridization

between fluorescein- labelled normal and pathologic

samples to an array containing clones designed

to cover certain areas of the genome

Aim: study genomic alterations expression index



Tumor supressor gene


In oncology, where carcinogenesis is associated with complex chromosomic alterations, CGH array can be used for detailed analysis of genomic changes in copy number (gains or loss of genetic information) in the tumor sample.

Amplification of an oncogene or deletion of a tumor suppressor gene are considered as important mechanisms for tumorigenesis

Specificity of CGH array experiment expression index

  • A priori biological knowledge from conventional CGH :

  • Limited number of states for a sequence :

  • - presence, - deletion, - gain(s)

  • corresponding to different intensity ratios on the slide

  • Mixture model to capture the underlying discrete states

  • Clones located contiguously on chromosomes are likely to carry alterations of the same type

  • Use clone spatial location in the allocation model

  • Some CGH custom array experiments target

  • restricted areas of the genome

  • Large proportion of genomic alterations are expected

3 component mixture model with spatial allocation

presence expression index



x x x

g -1g g+1

Spatial neighbours of g

3 component mixture model with spatial allocation

ygr  N(θg , g2) , normal versus tumoral change, clone g

replicate measure r

θg  wg0N(μ0,02) + wg1N(μ1 ,12) + wg2N(μ2 ,22)

μ0: known central estimate obtained from reference clones

Introduce centred spatial autoregressive Markov random fields,

{ug0}, {ug1}, {ug2} with nearest neighbours along the chromosomes

Define mixture proportions to depend on the chromosomic location

via a logistic model: wgk = exp(ugk) / Σmexp(ugm)

favours allocation of nearby clones to same component

Work in progress

Presence ? expression index

Ref value

μ0 = - 0.11


Deletion ?

Curie Institute CGH platform

Focus on Investigating deletion areas on chromosome 1

(tumour suppressor locus)

Data on 190 clones

Classification with expression index

cut-off at p ≥ 0.8

Mixture model posterior

probability p of clone being deleted

Short arm

Summary expression index

Bayesian gene expression measure (BGX)

Good range of resolution, provides credibility intervals

Differential Expression

Expression-level-dependent normalisation

Borrow information across genes for variance estimation

Gene lists based on posterior probabilities or mixture classification

False Discovery Rate

Mixture gives good estimate of FDR and classifies

Flexibility to incorporate a priori biological features, e.g. dependence on chromosomic location

Future work

Mixture prior on BGX index, with uncertainty propagated to mixture parameters, comparison of marginal and prior mixture approaches, clustering of profiles for more general experimental set-ups

Papers and technical reports: expression index

Hein AM., Richardson S., Causton H., Ambler G. and Green P. (2004)

BGX: a fully Bayesian gene expression index for Affymetrix GeneChip data

(to appear in Biostatistics)

Lewin A., Richardson S., Marshall C., Glazier A. and Aitman T. (2003)

Bayesian Modelling of Differential Gene Expression

(under revision for Biometrics)

Broët P., Lewin A., Richardson S., Dalmasso C. and Magdelenat H. (2004)

A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics 22, 2562-2571.

Broët, P., Richardson, S. and Radvanyi, F. (2002)

Bayesian Hierarchical Model for Identifying Changes in Gene Expression from Microarray Experiments. Journal of Computational Biology 9, 671-683.

Available at

http ://