1 / 40

Sylvia Richardson Centre for Biostatistics Imperial College, London

Statistical Analysis of Gene Expression Data. Sylvia Richardson Centre for Biostatistics Imperial College, London. In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s) Tim Aitman (Hammersmith) Peter Green (Bristol). Biological Atlas of Insulin Resistance.

zariel
Download Presentation

Sylvia Richardson Centre for Biostatistics Imperial College, London

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Analysis of Gene Expression Data Sylvia Richardson Centre for Biostatistics Imperial College, London In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s) Tim Aitman (Hammersmith) Peter Green (Bristol) Biological Atlas of Insulin Resistance www.bgx.org.uk BBSRC

  2. Statistical modelling and biology • Extracting the ‘message’ from microarray data needs statistical as well as biologicalunderstanding • Statistical modelling – in contrast to data analysis – gives a framework for formally organising assumptions about signal and noise • Our models are structured, reflecting data generation process: • Bayesian hierarchical modelling approach • Inference based on posterior distribution of quantities of interest

  3. * * * * * What are gene expression data ? • DNA Microarrays are used to measure the relative abundance of mRNA, providing information on gene expressionin a particular cell type, under specific conditions • Gene expression data (e.g. Affymetrix) results from the scanning of arrays where hybridisation between a sample and a large number of probes has taken place: • gene expression measure for each gene • The expression level of ten of thousands of probes are measured on a single microarray: • gene expression profile • Typically, gene expression profiles are obtained for several samples, in a single or related experiments: • gene expression data matrix

  4. Common characteristics of data sets in transcriptomic • High dimensional data (ten of thousands of genes) and few samples • Many sources of variability (low signal/noise ratio) • within/between array variation • gene specific variability of the probes for a gene(e.g. for Affymetrix) • condition/treatment • biological • array manufacture • imaging • technical

  5. Analysing gene expression data Gene expression data matrix • Gene expression data can be used in several types of analysis: -- Comparison of gene expression under different experimental conditions, or in different tissues -- Building a predictive model for classification or prognosis based on gene expression measurements -- Exploration of patterns in gene expression matrices Samples Genes (20000) Gene expression level

  6. Common statistical issues • Pre-processing and data reduction • account for the uncertainty of the signal? • making arrays comparable: “normalisation” • Realistic assessment of uncertainty • Multiplicity: control of “error rates” • Need to borrow information • Importance to include prior biological knowledge Illustrate how structured statistical modelling can help to tease out signal from noise and strengthen inference in the context of differential expression studies

  7. Outline • Background • Modelling uncertainty in the signal • Bayesian hierarchical models for differential expression experiments • posterior predictive checks • use of posterior distribution of parameters of interest to select genes of interest • Further structure: mixture models

  8. I – Modelling uncertainty in the signal:A fully Bayesian Gene expression index for Affymetrix Gene Chip arrays (Anne Mette Hein) Data: Affymetrix chip: - Each gene g is represented by a probe set, consisting ofa number ofprobe pairs (reporters) j Perfect match (PM) and Mismatch (MM) Aim: Formulate a model to combine PM and MM values into a new expression value for the gene -- BGX - Base the model on biological assumptions - Combine good features of Li and Wong (dChip) and RMA (Robust Multichip Analysis, Irrizarry et al) • Use a flexible Bayesian framework that will allow • to get a measure of uncertainty of the expression • to integrate further components of the experimental design

  9. Single array model: Motivation Key observations: Conclusions: • PMs and MMs both increase with spike-in concentration (MMs slower than PMs) MMs bind fraction of signal Multiplicative (and additive) error; transformation needed • Spread of PMs increase with level • Considerable variability in PM (and MM) response within a probe set Varying reliability in gene expression estimation for different genes Estimate gene expression measure from PMs and MMs on log scale • Probe effects approximately additive on log-scale

  10. fraction BGX single array model PMgj N( Sgj + Hgj ,τ2) MMgj N(ΦSgj + Hgj,τ2) Background noise, additive Gene and probe specific S and H (g:1,…,”1000s”, j=1,…,”tens”) Non-specific hybridisation: array wide distribution: j=1,…,J (20), g=1,…,G Expression measure for gene g is built from: j=1,…,J (20) log(Sgj+1)  TN(μg,σg2) Shrinkage: exchangeability log(Hgj+1)  TN(λ, η2) log(σg2)N(a, b2) “BGX” expression measure Remaining priors: “vague” “Emp. Bayes”

  11. BGX model: inferenceHein et al, Biostatistics, 2005 BGX: gene expression mg: For each gene g: obtain a distribution for signal (log scale) PM MM • Implemented in WinBugs and C++ (MCMC) • All parameters estimated jointly in full Bayesian framework • Posterior distributions of parameters (and functions) obtained The single array model can be extended to estimate signal from several biological replicates, as well as differential signal between conditions

  12. Single array model:examples of posterior distributions of BGX indices Each curve represents a gene Examples with data: o: log(PMgj-MMgj) j=1,…,J (at 0 if not defined) Mean  1SD

  13. Comparison with other expression measures 11 genes spiked in at 13 (increasing) concentrations BGX index μg increases with concentration ….. … except for gene 7 (incorrectly spiked-in??) Indication of smooth & sustained increase over a wider range of concentrations

  14. 95% credibility intervals for Bayesian gene expression index 11 spike-in genes at 13 different concentrations Each colour corresponds to a different spike-in gene Gene 7 : broken red line Note how the variability is substantially larger for low expression level

  15. II – Modelling differential expression Condition 2 Condition 1 Start with given point estimates of expression Hierarchical model of replicate variability and array effect Hierarchical model of replicate variability and array effect Posterior distribution (flat prior) Differential expression parameter Mixture modelling for classification

  16. Data Sets and Biological question Biological Question • Understand the mechanisms of insulin resistance • Using animal models where key genes are knockout A) Cd36 Knock out Data set (MAS 5) 3 wildtype (“normal”) mice compared with 3 mice with Cd36 knocked out ( 12000 genes on each array ) B) IRS2 Knock out Data set (RMA) 8 wildtype (“normal”) mice compared with 8 mice with IRS2 gene knocked out ( 22700 genes on each array)

  17. Exploratory analysis showing array effect Mouse data set A Condition 1 (3 replicates) Needs ‘normalisation’ Spline curves shown Condition 2 (3 replicates)

  18. Bayesian hierarchical model for differential expression (Lewin et al, Biometrics, 2005) Data: ygcr = log gene expression gene g, replicate r, condition c g = gene effect dg= differential effect for gene g between 2 conditions r(g)c = array effect – modelled as a smooth (spline) function of g gc2 = gene specific variance • 1st level yg1r  N(g – ½ dg + r(g)1 , g12) yg2r  N(g + ½ dg + r(g)2 , g22) Σrr(g)c = 0, r(g)c = function of g, parameters {c,d} • 2nd level “Flat” priors for g , dg, {c,d} gc2  lognormal (ac, bc) Exchangeable variances

  19. Directed Acyclic Graph for the differential expression model (no array effect represented) a1, b1 ½(yg1.- yg2.) dg 2g1 s2g1 2g2 s2g2 ½(yg1.+ yg2.) g a2, b2

  20. Differential expression model Joint modelling of array effects and differential expression: • Performs normalisation simultaneously with estimation • Gives fewer false positives How to check some of the modelling assumptions? Posterior predictive checks How to use the posterior distribution of dgto select genes of interest ? Decision rules

  21. Bayesian Model Checking • Check assumptions on gene variances, e.g. exchangeable variances, what distribution ? • Predict sample variance sg2 new(a chosen checking function)from the model specification (not using the data for this) • Compare predicted sg2 newwith observed sg2 obs ‘Bayesian p-value’: Prob( sg2 new > sg2 obs ) • Distribution of p-values approx Uniform if model is ‘true’ (Marshall and Spiegelhalter, 2003) • Easily implemented in MCMC algorithm

  22. 2g1 new Bayesian model checking a1, b1 s2g1 new ½(yg1.- yg2.) dg obs 2g1 s2g1 2g2 s2g2 ½(yg1.+ yg2.) g a2, b2

  23. Mouse Data set A

  24. Use of tail probabilities for selecting gene lists dg : log fold change tg=dg / (σ2g1 /n1 + σ2g2 /n2 )½standardised difference (n1 and n2 # replicates in each condition) -- Obtain the posterior distribution of dg and/or tg -- Compute directly posterior probability of genes satisfying criterion X of interest, e.g. dg > threshold or tg> percentile pg,X= Prob( g of “interest” | Criterion X, data) -- Compute the distributions of ranks, …. Interesting statistical issues on relative merits and properties of different selection rules based on tail probabilities

  25. Using the posterior distribution of tg (standardised difference) (Natalia Bochkina) • Compute Probability ( | tg| > 2 | data) Bayesian T test • Order genes • Select genes such that Data set B Probability ( | tg| > 2 | data) > cut-off ( in blue) By comparison, additional genes selected by a standard T test with p value < 5% are in red)

  26. Credibility intervals for ranks 100 genes with lowest rank (most under/ over expressed) Low rank, high uncertainty Low rank, low uncertainty

  27. III – Mixture and Bayesian estimation of False Discovery Rates (FDR) • Mixture models can be used to perform a model based classification • Mixture models can be considered at the level of the data (e.g. clustering time profiles) or for the underlying parameters • Mixture models can be used to detect differentially expressed genes if a model of the alternative is specified • One benefit is that an estimate of the uncertainty of the classification: the False Discovery Rate is simultaneously obtained

  28. Mixture framework for differential expression yg1r = g - ½ dg+ g1r , r = 1, … R1 yg2r = g + ½ dg + g2r , r = 1, … R2 (We assume that the data has been pre normalised) Var(gcr ) = σ2gc ~ IG(ac, bc) dg~ p0δ0 + p1G(-x|1.5, 1) + p2G(x|1.5, 2) H0H1 Dirichlet distribution for (p0, p1, p2) Exp(1) hyper prior for 1 and 2 Explicit modelling of the alternative

  29. Mixture for classification of DE genes • Calculate the posterior probability for any gene of belonging to the unmodified component : pg0 | data • Classify using a cut-off on pg0 : i.e. declare gene is DE if 1- pg0 > pcut Bayes rule corresponds to pcut = 0.5 • Bayesian estimate of FDR (and FNR) for any list (Newton et al 2003, Broët et al 2004) : Bayes FDR (list) | data = 1/card(list) Σg  list pg0

  30. Performance of the mixture prior • Joint estimation of all the mixture parameters (including p0) using MCMC algorithms avoids plugging-in of values that are influential on the classification • Estimation of all parameters combines information from biological replicates and between condition contrasts • Performance has been tested on simulated data sets

  31. π0 = 0.9, 250 DE π0 = 0.8, 500 DE Plot of true difference in each case π0 = 0.95, 125 DE π0 = 0.99, 25 DE π0 = 0.80, 500 DE

  32. Examples of simulated data for each case

  33. ^ ^ Av. π0 = 0.80 Av. π0 = 0.90 ^ ^ Av. π0 = 0.78 Av. π0 = 0.95 Results averaged over 50 replications Good estimates of p0 = Prob(null) for each case ^ Av. π0 = 0.99

  34. Comparison of estimated (dotted lines) and observed (full) FDR (black) and FNR (red) rates as cut-off for declaring DEis varied • Bayesian mixture: • good estimates of • FDR and FNR • easy way to • choose efficient • classification rule

  35. In summary Integrated gene expression analysis • Uses the natural hierarchical structure of the data: e.g. probes within genes within replicate arrays within condition to synthesize, borrow information and provide realistic quantification of uncertainty • Posterior distributions can be exploited for inference with few replicates: choice of decision rules • Framework where biological prior information, e.g. on the structure of the probes or on chromosomic location, can be incorporated • Model based classification, e.g. through mixtures, provides interpretable output and a structure to deal with multiplicity General framework for investigating other questions

  36. Many interesting questions in the analysis of gene expression data -- Comparison of gene expression under different experimental conditions, or in different tissues -- Integrated gene expression analysis -- Building a predictive model for classification or prognosis based on gene expression measurements,finding “signatures” -- Investigate high dimensional classification rules (prediction with large number of variables) and “large p small n” regression problems (shrinkage or variable selection)

  37. Association of gene expression with prognosis Expression plot of 115 prognostic genes comprising The Ovarian Cancer Prognostic Profile Investigate properties of high dimensional classification rules (prediction with large number of variables) and “large p small n” regression problems (shrinkage or variable selection)

  38. Other questions …. -- Comparison of gene expression under different experimental conditions, or in different tissues -- Building a predictive model for classification or prognosis based on gene expression measurements, finding “signatures -- Integrated gene expression analysis -- Investigate high dimensional classification rules (prediction with large number of variables) and “large p small n” regression problems (shrinkage or variable selection) -- Perform unsupervised model based clustering -- Estimate graphical models -- Exploration of patterns and association networks in gene expression matrices

  39. Exploration of patterns in gene expression matrices samples genes -- Comparison of gene expression under different experimental conditions, or in different tissues -- Classification of gene expression profiles and association of gene expression with other factors, e.g. prognosis (prediction problem) Development of central nervous systems in rats (9 time points) Perform unsupervised model based clustering (e.g. semi-parametric using basis functions, mixtures or DP processes)

  40. Thanks BBSRC Exploiting Genomics grant Colleagues Natalia Bochkina, Anne Mette Hein, Alex Lewin (Imperial College) Peter Green (Bristol University) Philippe Broët (INSERM, Paris) Papers and technical reports: www.bgx.org.uk/

More Related