Sylvia Richardson Centre for Biostatistics Imperial College, London

Bayesian hierarchical modelling of genomic data Sylvia Richardson Centre for Biostatistics Imperial College, London In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s) Helen Causton and Tim Aitman (Hammersmith) Peter Green (Bristol) Philippe Broët (INSERM, Paris) BBSRC Exploiting Genomics grant

Outline • Introduction • A fully Bayesian gene expression index (BGX) • Differential expression and array effects • Mixture models • Discussion

Part 1 Introduction • Recent developments in genomics have led to techniques • Capable of interrogating the genome at different levels • Aiming to capture one or several stages of the biological process DNA mRNA protein phenotype

Fundamental process Protein-encoding genes are transcribed into mRNA (messenger), and the mRNA is translated to make proteins DNA -> mRNA -> protein Pictures from http://www.emc.maricopa.edu/faculty/farabee/BIOBK/BioBookTOC.html

What are gene expression data ? • DNA Microarrays are used to measure the relative abundance of mRNA, providing information on gene expression in a particular cell, under particular conditions • The fundamental principle used to measure the expression is that of hybridisation between a sample and probes: • Known sequences of single-stranded DNA representing genes are immobilised on microarray • Tissue sample (with unknown concentration of RNA) fluorescently labelled • Sample hybridised to array • Array scanned to measure amount of RNA present for each sequence • The expression level of ten of thousands of probes are measured on a single microarray ! gene expression measure gene expression profile

Variation and uncertainty Gene expression data (e.g. Affymetrix) is the result of multiple sources of variability • condition/treatment • biological • array manufacture • imaging • technical • gene specific variability of the probes for a gene • within/between array variation Structured statistical modelling allows considering all uncertainty at once

Example of within vs between strains gene variability • 7 cross-bred strains of mice that differ only by a small portion of chromosome 1 • Strains have different phenotypes related to immunological disorders • For each line, 9 animals used to obtain 3 pooled RNA extracts from spleen 7 x 3 samples Excellent experimental design to minimise “biological variability between replicate animals” Aim: to tease out differences between expression profiles of the 7 lines of mice and relate these to locations on chromosome 1

Average (over the 7 groups) of within strain variance calculated from the 3 pooled samples Total variance calculated over the 21 samples Biological variability is large ! Ratio within/total

1000 genes most variable between strains: hierarchical clustering recovers the cross-bred lines structure Random set of 1000 genes

Common characteristics of genomics data sets • High dimensional data (ten of thousands of genes) and few samples • Many sources of variability (low signal/noise ratio) Common issues • Pre-processing and data reduction • Multiple testing • Need to borrow information • Importance to include prior biological knowledge

Part 2 • Introduction • A fully Bayesian gene expression index (BGX) • Single array model • Multiple array model • Differential expression and array effects • Mixture models • Discussion

PM MM PM MM PM MM PM MM A fully Bayesian Gene eXpression index for Affymetrix GeneChip arraysAnne Mette HeinSR, Helen Causton, Graeme Ambler, Peter Green Raw intensities Background correction Gene specific variability (probe) Gene index BGX

* * * * * Zoom Image of Hybridised Array Hybridised Spot Each gene g represented by probe set: (J:11-20) Perfect match: PMg1,…, PMgJ Mis-match: MMg1,…, MMgJ Expressed PM Non-expressed PM Image of Hybridised Array expression measure for gene g Slide courtesy of Affymetrix Affymetrix GeneChips:

Commonly used methods for estimation expression levels from GeneChips • MAS5: • uses PM and MMs. Imputes IMs from MMs to obtain all PM-MMs positive • gene expression measure : estimate obtained by applying Tukey Biweight to the set of log(PM-MM) values in the probe set • RMA: • uses PMs only. • Fits an model with additive gene and probe effects to log-scale background corrected PMs using median polish Characteristics: positive, robust, noisy at low levels Characteristics: positive, robust, attenuated signal detection

Mean (left) and Empirical standard deviation (right) over 7 conditions (arrays) for 45000 genes estimated by 2 different methods for quantifying gene expression Variability across conditions is conditioned by the choice of summary measure ! Beware of filtering mean

Model assumptions and key biological parameters • The intensity for the PM measurement for probe (reporter) j and gene g is due to binding of labelled fragments that perfectly match the oligos in the spot The true Signal Sgj of labelled fragments that do not perfectly match these oligos The non-specific hybridisation Hgj • The intensity of the corresponding MM measurement is caused by a binding fraction Φ of the true signal Sgj by non-specific hybridisation Hgj

PMgj N( Sgj + Hgj , τ2) MMgj N(Φ Sgj + Hgj,τ2) Background noise, additive fraction Non-specific hybridisation signal log(Sgj+1)  TN (μg ,ξg2) log(Hgj+1)  TN(λ, η2) j=1,…,J Gene specific error terms: exchangeable Array-wide distribution Gene expression index (BGX): qg=median(TN (μg ,ξg2)) “Pools” information over probes j=1,…,J log(ξg2)N(a, b2) Priors: “vague”t2 ~ G(10-3, 10-3) F~ B(1,1), mg ~ U(0,15) h2 ~ G(10-3, 10-3),l ~ N(0,103) “Empirical Bayes” BGX single array model:g=1,…,G (thousands), j=1,…,J (11-20)

Inference: mean 2.5-97.5% credibility interval For each gene g: Log-scale true signals: Log-scale non-spec. hybr: BGX: gene expr: log(Sgj+1): j=1,…,J log(Hgj+1): j=1,…,J qg: 1.75 2 2.25 1 2 3 2 3 4 NB! A distribution • Implemented in WinBugs and C • allows: - Joint estimation of parameters in full • Bayesian framework • obtain: - posterior distributions of parameters • (and functions of these) in model:

Computational issues • We found mixing slow for gene specific parameters (μg ,ξg2) and large autocorrelation • For low signal (bottom 25%) more variability of Sgj and Hgj , and less separation So less information on (μg ,ξg2) and longer runs are needed • For the full hierarchical model, the convergence of the hyperparameters for the distribution of ξg2 was problematic • We studied sensitivity to a range of plausible values for those and implemented an “empirical Bayes” version of the model which was reproducible with sensible run length

Posterior mean of qgusing a run of 30 000 versus those obtained from runs of 5 000, 10 000 and 20 000 sweeps Reproducibility is obtained with short runs for large expression values Longer runs are necessary for low expression values

Single array model performance:Data set : varying concentrations (geneLogic): • 14 samples of cRNA from acute myeloid leukemia (AML) tumor cell line • In sample k: each of 11 genes spiked in at concentration ck: sample k: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 conc. ck(pM): 0.0 0.5 0.75 1.0 1.5 2.0 3.0 5.0 12.5 25 50 75 100 150 • Each sample hybridised to an array Consider subset consisting of 500 normal genes + 11 spike-ins

Single array model:examples of posterior distributions of BGX expression indices Each curve (truncated normal with median param.) represents a gene Examples with data: o: log(PMgj-MMgj) j=1,…,Jg (at 0 if not defined) Mean +- 1SD

Comparison with other expression measures Single array model performance:11 genes spiked in at 13 (increasing) concentrations BGX index qgincreases with concentration ….. … except for gene 7 (spiked-in??) Indication of smooth & sustained increase over a wider range of concentrations

2.5 – 97.5 % credibility intervals for the Bayesian expression index 11 spike-in genes at 13 different concentration (data set A) Note how the variability is substantially larger for low expression level Each colour corresponds to a different spike-in gene Gene 7 : broken red line

Integrated modelling of Affymetrix data Condition 2 Condition 1 PM MM PM MM PM MM PM MM PM MM PM MM PM MM PM MM Gene specific variability (probe) Gene index BGX Gene specific variability (probe) Gene index BGX Hierarchical model of replicate (biological) variability and array effect Hierarchical model of replicate (biological) variability and array effect Distribution of expression index for gene g , condition 1 Distribution of expression index for gene g , condition 2 Distribution of differential expression parameter

Background noise, additive Array specific log(Sgjcr+1)  TN (μgc,ξgc2) log(Hgjcr+1)  TN(λcr,ηcr2) Array-specific distribution of non-specific hybridisation BGX Multiplearray model: conditions: c=1,…,C, replicates: r = 1,…,Rc PMgjcr N( Sgjcr+ Hgjcr, τcr2) MMgjcr N(ΦSgjcr+ Hgjcr, τcr2) Gene and condition specific BGX qgc=median(TN(μgc, ξgc2)) “Pools” information over replicate probe sets j = 1,…J, r = 1,…,Rc

Subset of AffyU133A spike-in data set(AffyComp) • Consider: • Six arrays, 1154 genes (every 20th and 42 spike-ins) • Same cRNA hybridised to all arrays EXCEPT for spike-ins: `1` `2` `3` … `12` `13` `14` Spike-in genes: 1-3 4-6 7-9 … 34-36 37-39 40-42 Spike-in conc (pM): Condition 1 (array 1-3): 0.0 0.25 0.50 … 128 256 512 Condition 2 (array 4-6): 0.25 0.50 1.00 … 256 512 0.00 Fold change: - 2 2 … 2 2 -

BGX: measure of uncertainty providedPosterior mean +- 1SD credibility intervals diffg=bgxg,1- bgxg,2 } Spike in 1113 -1154 above the blue line Blue stars show RMA measure

Part 3 • Introduction • A fully Bayesian gene expression index (BGX) • Differential expression and array effects • Non linear array effects • Model checking • Mixture models • Discussion

Differential expression and array effectsAlex LewinSR, Natalia Bochkina, Tim Aitman

Data Set and Biological question Biological Question Understand the mechanisms of insulin resistance Using animal models where key genes are knockout and comparison made between gene expression of wildtype (normal) and knockout mice Data set A (MAS 5) ( 12000 genes on each array) 3 wildtype mice compared with 3 mice with Cd36 knocked out Data set B (RMA) ( 22700 genes on each array) 8 wildtype mice compared with 8 knocked out mice

Condition 2 Condition 1 Start with given point estimates of expression Hierarchical model of replicate Variability and array effect Hierarchical model of replicate Variability and array effect Posterior distribution (flat prior) Differential expression parameter Mixture modelling for classification

Condition 1 (3 replicates) Condition 2 (3 replicates) Exploratory analysis of array effect Mouse data set A Needs ‘normalisation’ Spline curves shown

Model for Differential Expression • Expression-level-dependent normalisation • Only few replicates per gene, so share information between genes to estimate variability of gene expression between the replicates • To select interesting genes: • Use posterior distribution of quantities of interest, function of, ranks …. • Use mixture prior on the differential expression parameter

Bayesian hierarchical model for differential expression Data: ygsr = log gene expression for gene g, replicate r g = gene effect δg= differential effect for gene g between 2 conditions r(g)s = array effect (expression-level dependent) gs2 = gene variance • 1st level yg1r  N(g – ½ δg + r(g)1 , g12), yg2r  N(g + ½ δg + r(g)2 , g22), Σrr(g)s = 0, r(g)s = function of g , parameters {a} and {b} • 2nd level Priors for g , δg, coefficients {a} and {b} gs2  lognormal (μs, τs)

Details of array effects (Normalization) • Piecewise polynomial with unknown break points: r(g)s = quadratic in g for ars(k-1)≤ g ≤ ars(k) with coeff (brsk(1),brsk(2) ), k =1, … # breakpoints • Locations of break points not fixed • Must do sensitivity checks on # break points • Joint estimation of array effects and differential expression: In comparison to 2 step method • More accurate estimates of array effects • Lower percentage of false positive (simulation study)

Mouse Data set A For this data set, cubic fits well 3 replicate arrays (wildtype mouse data) Model: posterior means E(r(g)s | data) v. E(g | data) Data: ygsr - E(g | data)

Bayesian Model Checking • Check assumptions on gene variances, e.g. exchangeable variances, what distribution ? • Predict sample variance Sg2 new(a chosen checking function)from the model specification (not using the data for this) • Compare predicted Sg2 newwith observed Sg2 obs ‘Bayesian p-value’: Prob( Sg2 new > Sg2 obs ) • Distribution of p-values approx Uniform if model is ‘true’ (Marshall and Spiegelhalter, 2003) • Easily implemented in MCMC algorithm

Data set A

Possible Statistics for Differential Expression δg ≈ log fold change δg* =δg / (σ2g1 /R1 + σ2g2 /R2 )½ (standardised difference) • We obtain the posterior distribution of all {δg} and/or {δg* } • Can compute directly posterior probability of genes satisfying criterion X of interest: • pg,X= Prob( g of “interest” | Criterion X, data) • Can compute the distributions of ranks

Data set A 3 wildtype mice compared to 3 knockout mice (U74A chip) Mas5 Criterion X Gene is of interest if |log fold change| > log(2) and log (overall expression) > 4 Plot of log fold change versus overall expression level Genes with pg,X > 0.5 (green) # 280 pg,X > 0.8 (red) # 46 The majority of the genes have very small pg,X : 90% of genes have pg,X < 0.2 pg,X = 0.49 Genes with low overall expression have a greater range of fold change than those with higher expression

Experiment: 8 wildtype mice compared to 8 knockout mice RMA Criterion X: Gene is of interest if |log fold change| > log (1.5) Plot of log fold change versus overall expression level Genes with pg,X > 0.5 (green) # 292 pg,X > 0.8 (red) # 139 The majority of the genes have very small pg,X : 97% of genes have pg,X < 0.2

Posterior probabilities and log fold change Data set A : 3 replicates MAS5 Data set B : 8 replicates RMA

Credibility intervals for ranks Data set B 100 genes with lowest rank (most under/ over expressed) Low rank, high uncertainty Low rank, low uncertainty

Using the posterior distribution of δg* (standardised difference) • Compute Probability ( | δg* | > 2 | data) Bayesian analogue of a t test ! • Order genes • Select genes such that Probability ( | δg* | > 2 | data) > cut-off ( in blue) By comparison, additional genes selected by a standard T test with p value < 5% are in red)

Part 4 • Introduction • A fully Bayesian gene expression index • Differential expression and array effects • Mixture models • Classification for differential expression • Bayesian estimate of False Discovery Rates • CGH arrays: models including information on clones spatial location on chromosome • Discussion

Mixture and Bayesian estimation of false discovery ratesNatalia Bochkina, Philippe Broët Alex Lewin, SR

Multiple Testing Problem • Gene lists can be built by computing separately a criteria for each gene and ranking • Thousands of genes are considered simultaneously • How to assess the performance of such lists ? Statistical Challenge Select interesting genes without including too many false positives in a gene list A gene is a false positive if it is included in the list when it is truly unmodified under the experimental set up Want an evaluation of the expected false discovery rate (FDR)

Bayesian Estimate of FDR • Step 1: Choose a gene specific parameter (e.g. δg ) or a gene statistic • Step 2:Model its prior (resp marginal) distribution using a mixture model -- with one component to model the unaffected genes (null hypothesis) e.g. point mass at 0 for δg -- other components to model (flexibly) the alternative • Step 3:Calculate the posterior probability for any gene of belonging to the unmodified component : pg0 | data • Step 4: Evaluate FDR (and FNR) for any list assuming that all the gene classification are independent (Broët et al 2004) : Bayes FDR (list) | data = 1/card(list) Σg  list pg0

Mixture framework for differential expression • To obtain a gene list, a commonly used method (cf Lönnstedt & Speed 2002, Newton 2003, Smyth 2003, …) is to define a mixture prior for δg : • H0δg= 0 point mass at 0 with probability p0 • H1δg~ flexible 2-sided distribution to model pattern of differential expression Classify each gene following its posterior probabilities of not being in the null: 1- pg0 Use Bayes rule or fix the FDR to get a cutoff

Sylvia Richardson Centre for Biostatistics Imperial College, London

Sylvia Richardson Centre for Biostatistics Imperial College, London

Presentation Transcript

Nicky Best, Chris Jackson, Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London

Imperial College London

Imperial College London

Welcome to Imperial College London

Imperial College London

Imperial College London

Ajit Kurup, Imperial College London.

Ajit Kurup, Imperial College London.

Jon Murray Imperial College London

Keith Smith, Imperial College, London.

Imperial College London

Alex Lewin (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Sylvia Richardson Centre for Biostatistics Imperial College, London

Ajit Kurup, Imperial College London / Fermilab Ken Long, Imperial College London

Imperial College London

Sylvia Richardson, with Alex Lewin Department of Epidemiology and Public Health, Imperial College

Heather Fry Centre for Educational Development Imperial College London h.fry@imperial.ac.uk

Alex Lewin (Imperial College, Dept of Epidemiology) Sylvia Richardson ( IC Epidemiology)

Alex Lewin (Imperial College) Sylvia Richardson ( IC Epidemiology)

Imperial College London