Presentation By Lara DePadilla

A Presentation of ‘Bayesian Models for Gene Expression With DNA Microarray Data’ by Ibrahim, Chen, and Gray Presentation By Lara DePadilla

Goal • To “develop a novel class of parametric statistical models for analyzing DNA microarray data’. • Parametric statistical models require making assumptions about the data, such as believing it follows some probabilistic law, and therefore we know something about it.

The Goal Applied • The researchers are trying to discover which genes play a major role for the disease of endometrial cancer. This knowledge can help to determine whether it is inherited and target applicable therapies.

Motivation • Determine which genes best discriminate between different types of tissue Why? Because of the sheer number of genes in the human genome we must identify which one are relevant to our purpose. • Characterize gene expression patterns in tumor tissues Why? We must develop models to explain the patterns in order to recognize them.

About Bayesion Models (Liu pp. 306) • The full process has three main steps: • Setting up a probability model to describe the data. This is a joint distribution that makes use of our prior knowledge of the subject: • Joint = Prior * Likelihood • f(y,Ө) = f(Ө) * f(y|Ө) • It must capture the elements of the scientific problem.

About Bayesion Models (Cont.) The next step invokes Bayes rule 2. f(Ө,y) = f(Ө) * f(y|Ө) f(y) Now we know what we are looking for. 3. This step is evaluate and improve upon what we have done.

Back to Our Goal Applied Data Structure of Observations • The array contains more than 7,000 probe sets, which are thought to represent 5,600 genes. • Each probe set consists of 16 – 20 perfect match and mismatch pairs. • A match is a strand of DNA that compliments a specific DNA sequences. • A mismatch has a single base mismatch position (one piece out of approx. 25 doesn’t match). • Using pairs from the same gene from different probes will be more specific than is possible with a single probe.

Back to Our Goal Applied More Data Preparation • The probes are compared and normalized, resulting in a dataset of expression levels that have atypical results filtered out. • After the filtering process, the data set was 14 x 3214, with 14 samples (10 cancerous, 4 normal) and 3214 genes.

The Model Setup: Data • j = 1, 2 (for each tissue type) • i = 1… n (for each individual) • nj individuals available (n for each tissue type) • G genes are measure for each individual • x is the represents each gene in the dataset ⇒ c0 is the threshold value for a gene is considered not expressed (and therefore not what we are seeking), so if x = c0, it is not expressed so, x = c0 with probability p x = c0 +ywith probability 1 – p where y is the level of expression • xjig denotes the random variable • yjigdenotes the expression level

The Model Setup: Likelihood • Let  = 1 if x = c0 (not expressed) and 0 otherwise (expressed) • Remember the expressed/not expressed probability from before, so there is one probability for each gene within each tissue type: pjg = P(xjig = c0) = P(δ = 1) 1 – pjg= P(xjig = c0 + yjig) = P(δ = 0) • Based on whether the gene had the qualified expression level, we have δ = (δ111, … , δ2,n2,G), meaning one for each gene, for each individual, for each tissue type).

The Model Setup: Likelihood • The mean expression level of each gene for both tissue types: μ = (μ11,…, μ2,G) • The variance of each expression for each gene for both tissue types: σ2 = (σ211,…, σ22,G) • The probability the gene not being expressed for each gene for both tissue types: p = (p11,…, p2,G)

The Model Setup: Likelihood • Ө = (μ, σ2, p) is the likelihood function based on the data: D = (x111,…, x2,n2,G) • L(Ө|D) = П j = 1 to 2 П I = 1 to nj П g = 1 to G (pjgδjig)(1- p)1 - δjig *p(yjig| μjg,σ2jg) 1 - δjig Interpreted: This is the product of the probability distribution function (the probability that a gene qualifies for being of interest to the study) of each data point to give the overall likelihood.

The Posterior: Which Genes Discriminate? • The posterior is a ratio between the average expression level for a particular gene across subjects in cancerous tissue and the same gene across subject in non- cancerous tissue. • The value of each element comprising the mean is based on whether or not the gene for that individual and that tissue type meets the necessary expression level to count.

The Posterior: The Function • Ψjg is the value for the expected value of the joint distribution of (δ,y) with individual subjects in the data comprising the elements that create the expected value. The distribution describes whether the expression level enough to count, and what is the level if it does?

The Posterior: The Function • εg = Ψ2g / Ψ1g • This is a ratio of the expression means between normal and cancer tissues for all of the genes, so there will be one distribution for each of the G genes. • A key summary to compute is P(εg) > 1|D), which is the probability given the data (the individuals in the study) that the ratio will exceed 1.

Priors • The purpose of the prior in this situation is to create a correlation between the genes for a given individual • The priors are hierarchal: there are different priors for different parameters, and some parameters of interest are incorporated into other priors. In some cases, the values are based on information from the data.

Gene Selection: Applying the Posterior • Compute the Posterior for g = 1…G • Compare these probability that the ratios will exceed 1 to a threshold γ. This threshold might be .9, .8, .7 etc. • One the threshold has established each gene as being different enough between tissues, develop a sub-model of the genes that describes which are different and which are not. • Different levels of γ will create different sub-models.

Back to Bayes: Step 3 • Step 3 was to evaluate our process. In this case, we use the L measure to evaluate the sub-models. • The model with the smallest L measure is the best-fitting model • It assesses goodness of fit based on: ⇒how well the model predictions compare to the observed data ⇒ the variability of the predictions

Sampling From the Posterior: Gibbs Sampler • Generating the mean expression levels for each type of tissue for each gene requires the parameters μ and σ2 • Gibbs Sampler makes use of conditional distributions; in our case these stem from the priors. • The algorithm will ultimately yield μ, σ2, b0, μ0, e, and u0 for each tissue type. All but μ and σ2 are integrated out, and the resulting μ and σ2 can be passed into the posterior equation.

The Results: Table 2 Number of genes to be declared different based on Several Choices of Hyperparameters and Various Choices of γ0

The Results: Table 3 • Number of genes to be declared different based on Several Choices of Hyperparameters

The Results: Tables 4 & 5 • This determines nonparametrical (based on no prior knowledge of the parameters in the distribution ie μ and σ2 that we got from our priors) results with the results of our algorithm • Table 4 Compares Genes identified using Informative Priors and Table 5 Compares Genes identified using Moderate Priors (less informative) • The percentages are the posterior probabilities—this would correspond to the thresholds. • The sum is the number of genes that overlapped—we can see that the lower the threshold, the more genes overlap. • Comparing Table 4 to Table 5, we can see that a less informative prior will result in more genes overlapping (which supports the result of analyzing the L statistic in Table 3).

The Results: Table 6 • That is not to say more genes passing the test (of able to help distinguish cancerous tissue from non-cancerous tissue) is better; the threshold uses more discretion in declaring a gene different, and the L statistic tells us the goodness of the fit. We need both.

The Results: Table 7 • Using the Full Model (ie, no threshold) change the informative level of the prior and compare to the L measure

Conclusion • Apply a Gibbs Sampler to sample from a hierarchical class of prior distribution • Use the results to sample from the posterior distribution and produce a summary of the results that describes how likely the gene is to be different based on tissue type. • Use thresholds to decide which genes are different enough to make a model of genes that can be applied to this problem. • Assess the model with the L measure to check the goodness of fit.

Bibliography • ‘Bayesion Models for Gene Expression With DNA Microarray Data’, Ibrahim, Chen, and Gray, Journal of the American Statistical Association, Mar 02; 97,457 • Monte Carlo Strategies in Scientific Computing, Liu, Springer-Verlag New York, Inc. 2001

Presentation By Lara DePadilla