Gene expression analysis

Gene expression analysis Hyunju Lee School of Information and Communications From Campbell and Heyer,“Discovering genomics, proteomics, & Bioinformatics,” , 2nd ed.,wileylss From Baxevanis and Ouellette, "Boinformatics: A practical guide to the analysis of genes and proteins," 3rd ed., Wiley-liss, 2005. From Xianghong Zhou’s class note.

Technologies for measuring gene expressions • Northern blotting • Real Time PCR • DNA microarrays or Gene Chips • RNA-Seq

The usage of microarray technology Microarray technology can provide the measurement of expression level of the whole transcriptome AT ONCE * Transcriptome : a complete of all transcripts in a cell at any given time point

What Can Microarray Analysis tell us: • Which genes are involved in which biological processes? • Changes in gene expression are important in many biological contexts: • Development • Cancer • Other diseases • Environmental adaptation • The quantitative description of cellular response to external perturbations

cDNA array technology (1) • Array preparation • Probes for cDNA arrays are usually products of PCR generated from cDNA libraries or clone collections • Probes are printed onto glass slides or nylon membranes as spots at defined locations. • Target preparation Schulze, A, and J Downward. "Navigating Gene Expression using Microarrays --A Technology Review." Nat Cell Biol. 3, no. 8 (August 2001): E190-5.

cDNA array technology (2)

cDNA array technology (3) • Samples • Cells are grown in two different conditions, such as in the presence and absence of oxygen. • The two populations of mRNA are harvested from each population of cells and separately converted into cDNAs by enzyme reverse transcriptase. • The nucleotides used to make the cDNA include either a green dye called Cy3 or a red dye called Cy5. Therefore, the two populations of cDNAs are colored either green or red, each color representing the transcriptome from one population of cells.

cDNA array technology (4) • Hybridize • The two populations of cDNAs (green and red) are mixed and incubated overnight with the DNA chip to form base pairs between complementary regions. • After a long incubation (typically overnight), the cDNAs that did not bind to any spots are washed off and the chip is allowed to dry in the dark.

cDNA array technology (5) • Scan • When the microarray is dry, it is put into a scanner that uses light and sensors to record the location and two-color intensities for each spot. • The two color images (one green and one red) are stored in a computer for image analysis. • New merged image, with yellow spots indicating which open reading frames (ORFs) are transcribed in both transcriptomes.

Reading spotted microarrays • Locate spots in the image. • Compute relative expression levels. • Throw out poorly measured spots. • Normalize each channel. • Calculate log expression ratios. Section of cDNA image: some spots run into each other; these spots have excessively large areas

Quantificationof expression • For each spot on the slide we calculate Red intensity = R(foreground) – R(background) • and Green intensity = G(foreground) - G(background) • and combine them in the log (base 2) ratio Log2(Red intensity / Green intensity)

Gene expression data • On p genes for n slides: p is O(10,000), and n is O(10-100), but growing, • These values are conventionally displayed on a red (>0) yellow (0)green (<0) scale. Gene expression level of gene 5 in slide 4 = Log2( Red intensity / Green intensity)

Two approaches towards microarray experiments in cell biology Schulze, A, and J Downward. "Navigating Gene Expression using Microarrays --A Technology Review." Nat Cell Biol. 3, no. 8 (August 2001): E190-5.

RNA Seq • RNA-seq is rapidly gaining ground on microarray technology in terms of popularity • Sequence and align RNA fragments • Generate counts for genes/exons/regions • Perform comparative analysis (e.g., differential expression)

RNA Seq • Quantification of transcripts • RPKM = reads (fragment) per kilobase of exons per million mapped reads • 1 RPKM ~ 1 copy in a cell 10 million mapped reads 1kb in length 2kb in length Nature Immunology 13, 802-807

Clustering From Sorin Draghici “Data Analysis Tools for DNA Microarrays,” Terry speed,“Statistical analysis of gene expression microarray data,” and Dr. Xianghong Zhou’s lecture slide.

To classify breast carcinomas based on the variations in gene expression derived from cDNA microarray and to correlate tumor characteristics to clinical outcome. * These plots were generated using Michael Eisen’s Cluster and Treeview packages.

Why cluster? • Cluster genes (rows) • Measure expression at multiple time-points, different conditions, etc. • Similar expression patterns may suggest similar functions of genes • Cluster samples (columns) • e.g., expression levels of thousands of genes for each tumor sample • Similar expression patterns may suggest biological relationship among samples

Clustering • Cluster analysis: does not know the # of groups in advance but wishes to establish groups and then analyze group membership • Clustering is the process of grouping together similar entities

Distance measures • Given vectors x=(x1, …, xn), y=(y1, …, yn) • Euclidean distance • Correlation distance

Euclidean distance The Euclidean distance is computed in accordance to the Pythagorean Theorem. • Here n is number of dimensions in the data vector. • For instance: • Number of time-points/conditions • (when clustering genes) • Number of genes • (when clustering samples) b = x1- y1 (y1, y2) a = x2- y2 (x1, x2)

Correlation distance The location/scale invariance of the 1-correlation dissimilarity makes it a popular choice for microarray data.

Correlation distance • We might care more about the overall shape of expression profiles more than the actual magnitudes • That is, we want to consider genes similar when they go “up” and “down” together Genes Genes

Which distance measure to use? • Euclidean distance: the usual distance as we know it from our environment. It can be used as a reference when summarizing the other distance • Correlation distance: will look for similar variation as opposed to similar numerical values An example: x1= (1,2,3,4,5) x2= (100,200,300,400,500) x3= (5,4,3,2,1) dc(x1,x2) = 1 – 1 = 0 dc(x1,x3) = 1 – (-1) = 2 dE(x1,x2) = 734.2 dE(x1,x3) = 6.32

Two basic types of clustering methods

Partitioning methods • Partition the data into a prespecified number k ofmutually exclusive and exhaustive groups. • Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares. • Examples: – k-means, self-organizing maps (SOM), etc.

K-means clustering Select k initial seed, k is pre-determined.

K-means clustering Step1: For each data point, assign cluster by selecting data points that are closest to the seed Step2: Calculate cluster mean Step3: Seed is changed to the mean of the cluster Step4: Repeat until Seeds don’t change

Agglomerative hierarchical clustering • Bottom-up algorithm • Start with the objects as clusters. • In each iteration, merge the two clusters with the minimal distance from each other - until you are left with a single cluster comprising all objects. • But what is the distance between two clusters?

Distances between clusters used for hierarchical clustering • Calculation of the distance between two clusters is based on the pairwise distances between members of the clusters. • Single linkage: smallest distance • Complete linkage: largest distance • Centroid linkage: distance between two centroids • Average linkage: average distance • Complete linkage gives preference to compact/spherical clusters. Single linkage can produce long stretched clusters. Linkage types in hierarchical clustering. Left to right: centroid linkage, single linkage, complete linkage, and average linkage

Centroid • The centroid of a group of patterns is the point that has each coordinate equal to the mean of the corresponding coordinates of the given patterns. • For instance, assume that the 3 experiments are in the same cluster, Exp1 = (1,2,3), Exp2= (2,3,4) and Exp3 = (3,4,5), then this cluster has the centroid in

Cutting tree diagrams • The height of a node in the dendrogram represents the distance of the two children clusters. • A hierarchical clustering diagram maybe used to divide the data into a predetermined number of clusters. The division may be done by cutting the tree at a certain depth (distance from the root).

Bottom-up methods Top-down methods centroids linkage mean linkage

Comparison of clustering algorithms • Hierarchical clustering + Widely used. + Easy to understand. + Does not require the number of clusters a priori. – Difficult to implement well. – Requires post-processing – Greediness can lock in early mistakes. – There is no reason to think that expression data is organized hierarchically.

Identifying differentially expressed genes FromSorin Draghici “Data Analysis Tools for DNA Microarrays,” and Dr. Xianghong Zhou’s lecture slide.

Introduction • Many microarray experiments are carried out to find genes which are differentially expressed between two (or more) samples of cells: • cells (from the liver, say), in a mouse with a gene knocked out, compared with liver cells in a normal mouse of the same strain • tumor cells in some organ (say the liver), compared with normal cells from the same organ • cells from an organism (say yeast) after a treatment (say by heat, or cold, or a drug) compared with cells of the same kind in the untreated state • cells from some part of a developing organ or organism at one time, compared with cells of the same kind at a later time, and so on

Identifying differentially expressed genes • Problem: have samples in two groups A and B, want to identify genes differentially expressed in A and in B. • Methods – Fold change (for a gene g, its expression mean in A divided by its expression mean in B) – t-test to pick up genes with maximal difference in mean expression between sample groups, and minimal variance of expression within sample groups Image by O. Troyaskaya

Fold change • Fold change between control and experiment • Arbitrary threshold such as 2 or 3 fold is chosen and the difference is considered as significant if it is larger than the threshold. Fold change on a scatter plot Fold change on a ratio-intensity plot experiment Control / experiment control Control * experiment

Fold change (cont.) • Drawback • The fold threshold is chosen arbitrarily and may often be inappropriate. • Microarray technology tends to have a bad signal/noise ratio for genes with low expression levels.

t-test • A t-test is an inferential test that determines if there is a significant difference between the means of two data sets. • In other words, a t-test decides if the two data sets come from the same population or from different populations. • When we are looking at the differences between scores for two groups, we have to judge the difference between their means relative to the spread or variability of their scores. The t-test does just this.

Statistical Analysis of the t-test

What is p-value? • Definitions of p-value The probability of observing a test statistic as extreme as or more extreme than the observed value, assuming that the null hypothesis is true. Thus, low p-values indicate high statistical significance.

The p-values for two sample t-test p-value

Hypothesis Testing 1. Clearly define the hypothesis 2. Generate two hypothesis - Null hypothesis - Research hypothesis : reflect our expectation 3. Calculate statistical significance - P-value is the probability of drawing the wrong conclusion by rejecting a true null hypothesis. - The significance level is the amount of uncertainty we are prepared to accept in our study.

Example • State the problem. Given a single gene (e.g. AC002378), is this gene expressed differently between cancer and healthy subjects? 2. State the null and alternative hypothesis. Null hypothesis H0: there is no difference in the expression of this gene in cancer patients vs. control subjects and all measurements come from a single distribution. Research hypothesis Ha: there are two distributions, one that describes the expression of the given gene in cancer patients and one that describes the expression of the same gene in control subjects. • Choose the level of significance. We choose to work at 5% significance level. 4. Find the appropriate statistical model and test statistic.

Example We assume that the two samples have equal variance and calculate the pooled variance. P-value (the probability of having such a value by chance) : 0.04

T-test table

Gene expression analysis