1 / 49

Quick R Tips

Quick R Tips. How to find out what packages are available library() How to find out what packages are actually installed locally (.packages()). Biological question Differentially expressed genes Sample class prediction etc. Experimental design. Microarray experiment. Image analysis.

jenn
Download Presentation

Quick R Tips

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quick R Tips • How to find out what packages are available • library() • How to find out what packages are actually installed locally • (.packages())

  2. Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Normalization Estimation Testing Clustering Discrimination Biological verification and interpretation

  3. Microarray Data Flow Microarray experiment Unsupervised Analysis – clustering Image Analysis Database Supervised Analysis Data Selection & Missing value estimation Normalization & Centering Networks & Data Integration Data Matrix Decomposition techniques

  4. A note about Affymetrix (1-color) pre-processing Two “standard” methods • MAS 5.0 (now GCOS/GDAS) by Affymetrix (compares PM and MM probes) • RMA by Speed group (UC Berkeley) (ignores MM probes) cross-chip sequence specific background correction within-probe set aggregation of intensity values within-chip

  5. Why normalize? • Microarray data have significant systematic variation both within arrays and between arrays that is not truebiological variation • Accurate comparison of genes’ relative expression within and across conditions requires normalization of effects • Sources of variation: • Spatial location on the array • Dye biases which vary with spot intensity • Plate origin • Printing/spotting quality • Experimenter

  6. Normalization – Thoughts • There are many different ways to normalize data • Global median, LOWESS, LOESS, RMA etc • By print tip, spatial, etc • BUT: don’t expect it to fix bad data! • Won’t make up for lack of replicates • Won’t make up for horrible slides

  7. #Create a boxplot of the normalized data boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue") #To save the boxplot as a jpeg file jpeg("normal_boxplot.jpg") boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue") dev.off()

  8. Microarray Data Analysis(slides used with permission of Dr. John Quackenbush, Dana Farber –creator of MeV software )http://www.tm4.org/mev/

  9. Classification: • Hierarchical clustering • K-means clustering Coherence: • PCA • Relevance Network Differential gene expression: • T-test • Analysis of Variance • Significance of Microarray (SAM)

  10. Hierarchical Clustering • A type of cluster analysis • There is both “divisive” and “agglomerative” HC…agglomerative is most commonly used • Group objects that are “close” to one another based on some distance/similarity metric • Clusters are created and linked based on a metric that evaluates the cluster-to-cluster distance • Results are displayed as a dendrogram

  11. g1 g1 g1 g8 g2 g8 g3 g4 g2 g2 g3 g4 g5 g4 g3 g5 g6 g5 g7 g6 g6 g7 g8 g7 Hierarchical Clustering g1 is most like g8 g4 is most like {g1, g8} (HCL-2)

  12. g1 g1 g1 g8 g8 g8 g4 g4 g4 g2 g5 g2 g3 g3 g7 g5 g2 g5 g6 g7 g3 g7 g6 g6 Hierarchical Clustering g5 is most like g7 {g5,g7} is most like {g1, g4, g8} (HCL-3)

  13. g1 g8 g4 g5 g7 g2 g3 g6 Hierarchical Tree (HCL-4)

  14. Hierarchical Clustering During construction of the hierarchy, decisions must be made to determine which clusters should be joined. The distance or similarity between clusters must be calculated. The rules that govern this calculation are linkage methods. (HCL-5)

  15. Agglomerative Linkage Methods • Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked. • Three linkage methods that are commonly used are: • Single Linkage • Average Linkage • Complete Linkage (HCL-6)

  16. Single Ave. Complete Comparison of Linkage Methods (HCL-10)

  17. Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Exp 2 Exp 2 Exp 3 Exp 4 Exp 4 Exp 4 Exp 1 Exp 1 Exp 3 Exp 5 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Bootstrapping (ST) Bootstrapping – resampling with replacement Original expression matrix: Various bootstrapped matrices (by experiments): Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6

  18. Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Exp 1 Exp 3 Exp 4 Exp 5 Exp 6 Exp 1 Exp 2 Exp 3 Exp 4 Exp 6 Gene 1 Gene 1 Gene 2 Gene 2 Gene 3 Gene 3 Gene 4 Gene 4 Gene 5 Gene 5 Gene 6 Gene 6 Jackknifing – resampling without replacement Jackknifing (ST) Original expression matrix: Various jackknifed matrices (by experiments):

  19. Analysis of Bootstrapped and Jackknifed Support Trees • Bootstrapped or jackknifed expression matrices are created many times by randomly resampling the original expression matrix, using either the bootstrap or jackknife procedure. • Each time, hierarchical trees are created from the resampled matrices. • The trees are compared to the tree obtained from the original data set. • The more frequently a given cluster from the original tree is found in the resampled trees, the stronger the support for the cluster. • As each resampled matrix lacks some of the original data, high support for a cluster means that the clustering is not biased by a small subset of the data.

  20. Hierarchical Clustering in R

  21. Step 1: Data matrix • First you need a numeric matrix • Typical array data set will have samples as columns and genes as rows • We want to be sure our data are in the form of an expression matrix • Use Biobase library/package • See http://www.bioconductor.org/packages/2.2/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf > exprs<-as.matrix(data, header=TRUE, sep="\t", row.names=1, as.is=TRUE)

  22. Step 2: Calculate Distance Matrix • Default dist() method in R uses rows as the vectors..but we want the distance between samples….i.e., the columns of our matrix. • There is a handy package to help us at MD Anderson called oompaBase source("http://bioinformatics.mdanderson.org/OOMPA/oompaLite.R") oompaLite() oompainstall(groupName="all") • Once installed, be sure to locally activate the libraries library(oompaBase) library(ClassDiscovery) library(ClassComparison) • oompaBase also requires the mclust and cobs packages…download these from CRAN

  23. Use the function distanceMatrix() to create a distance matrix of your samples…. • Uses the expression set created in Step 1 as input • Remember that there are many different types of distance metrics to choose from! • See help(distanceMatrix) x<- distanceMatrix(exprs,'pearson')

  24. Step 3: Cluster • Use the hclust() function to create a hierarchical cluster based on your distance matrix, x, created in Step 2. > y<-hclust(x,method="complete") > plot(y)

  25. Testing for Differential Gene Expression with the T-test

  26. Get the multtest package from CRAN • Package contains data from the Golub leukemia microarray data set (ALL v AML) • 38 arrays • 27 from lymphoblastic • 11 from myeloid http://people.cryst.bbk.ac.uk/wernisch/macourse/

  27. library(multtest) • data(golub) • golub.cl • Generate the T statistic • teststat <-mt.teststat(golub, golub.cl) • Convert into P-values • rawp0 <-2*pt(abs(teststat),lower.tail=F, df=38-2) • Correct for multiple testing and show the ten most significant genes • procs <-c(“Bonferroni”, “BH”) • res<-mt.rawp2adjp((rawp0), procs) • res$adjp[1:10,] http://people.cryst.bbk.ac.uk/wernisch/macourse/

  28. K-Means / K-Medians Clustering (KMC)– 1 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 1. Specify number of clusters, e.g., 5. 2. Randomly assign genes to clusters.

  29. G3 G6 G1 G8 G4 G5 G2 G10 G9 G12 G13 G11 G7 K-Means Clustering – 2 3. Calculate mean / median expression profile of each cluster. 4. Shuffle genes among clusters such that each gene is now in the cluster whose mean / median expression profile (calculated in step 3) is the closest to that gene’s expression profile. 5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been reached. K-Means / K-Medians is most useful when the user has an a-priori hypothesis about the number of clusters the genes should group into.

  30. Principal Components (PCAG and PCAE) – 1 • PCA simplifies the “views” of the data. • Suppose we have measurements for each gene on multiple • experiments. • Suppose some of the experiments are correlated. • PCA will ignore the redundant experiments, and will take a • weighted average of some of the experiments, thus possibly making • the trends in the data more interpretable. • 5. The components can be thought of as axes in n-dimensional • space, where n is the number of components. Each axis represents a • different trend in the data.

  31. x z y “Cloud” of data points (e.g., genes) in 3-dimensional space Data points resolved along 3 principal component axes. PCAG and PCAE - 2 In this example, x-axis could mean a continuum from over-to under-expression (“blue” and “green” genes over-expressed, yellow genes under-expressed) y-axis could mean that “gray” genes are over-expressed in first five expts and under expressed in The remaining expts, while “brown” genes are under-expressed in the first five expts, and over-expressed in the remaining expts. z-axis might represent different cyclic patterns, e.g., “red” genes might be over-expressed in odd-numbered expts and under-expressed in even-numbered ones, whereas the opposite is true for “purple” genes. Interpretation of components is somewhat subjective.

  32. Relevance Networks 10 H = -Sp(x)log2(p(x)) x=1 Set of genes whose expression profiles are predictive of one another. Can be used to identify negative correlations between genes Genes with low entropy (least variable across experiments) are excluded from analysis.

  33. E C A D E B C A D B .75 .92 .15 .37 .02 .28 .63 .51 .40 .11 Relevance Networks Tmin = 0.50 The expression pattern of each gene compared to that of every other gene. The remaining relationships between genes define the subnets Tmax = 0.90 Correlation coefficients outside the boundaries defined by the minimum and maximum thresholds are eliminated. The ability of each gene to predict the expression of each other gene is assigned a correlation coefficient

  34. Group A Group B Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Gene 1 Gene 1 Gene 2 Gene 2 Gene 3 Gene 3 Gene 4 Gene 4 Gene 5 Gene 5 Gene 6 Gene 6 • Assign experiments to two groups, e.g., in the expression matrix • below, assign Experiments 1, 2 and 5 to group A, and • experiments 3, 4 and 6 to group B. T-Tests (TTEST) – Between subjects (or unpaired) - 1 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?

  35. TTEST – Between subjects - 2 3. Calculate t-statistic for each gene 4. Calculate probability value of the t-statistic for each gene either from: A. Theoretical t-distribution OR B. Permutation tests.

  36. Group A Group B Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Gene 1 Group A Group B Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Gene 1 TTEST - Between subjects - 3 Permutation tests i) For each gene, compute t-statistic ii) Randomly shuffle the values of the gene between groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Original grouping Randomized grouping

  37. TTEST - Between subjects - 4 Permutation tests - continued iii) Compute t-statistic for the randomized gene iv) Repeat steps i-iii n times (where n is specified by the user). v) Let x = the number of times the absolute value of the original t-statistic exceeds the absolute values of the randomized t-statistic over n randomizations. vi) Then, the p-value associated with the gene = 1 – (x/n)

  38. TTEST - Between subjects - 5 • 5. Determine whether a gene’s expression levels are significantly • different between the two groups by one of three methods: • Just alpha: If the calculated p-value for a gene is less than • or equal to the user-input alpha (critical p-value), the gene is • considered significant. • OR • Use Bonferroni corrections to reduce the probability of • erroneously classifying non-significant genes as significant. • B) Standard Bonferroni correction: The user-input alpha is divided • by the total number of genes to give a critical p-value that is used • as above.

  39. TTEST - Between subjects – 6 5C) Adjusted Bonferroni: i) The t-values for all the genes are ranked in descending order. ii) For the gene with the highest t-value, the critical p-value becomes (alpha / N), where N is the total number of genes; for the gene with the second-highest t-value, the critical p-value will be (alpha/ N-1), and so on.

  40. The problem of multiple testing • (adapted from presentation by Anja von Heydebreck, Max–Planck–Institute for Molecular Genetics, • Dept. Computational Molecular Biology, Berlin, Germany • http://www.bioconductor.org/workshops/Heidelberg02/mult.pdf) • Let’s imagine there are 10,000 genes on a chip, AND • None of them is differentially expressed. • Suppose we use a statistical test for differential • expression, where we consider a gene to be differentially expressed if it meets the criterion at a • p-value of p < 0.05.

  41. The problem of multiple testing – 2 • Let’s say that applying this test to gene “G1” yields a p-value of p = 0.01 • Remember that a p-value of 0.01 means that there is a 1% chance that the gene is not differentially expressed, i.e., • Even though we conclude that the gene is differentially expressed (because p < 0.05), there is a 1% chance that our conclusion is wrong. • We might be willing to live with such a low probability • of being wrong • BUT .....

  42. The problem of multiple testing – 3 • We are testing 10,000 genes, not just one!!! • Even though none of the genes is differentially expressed, about 5% of the genes (i.e., 500 genes) will be erroneously concluded to be differentially expressed, because we have decided to “live with” a p-value of 0.05 • If only one gene were being studied, a 5% margin of error might not be a big deal, but 500 false conclusions in one study? That doesn’t sound too good.

  43. The problem of multiple testing - 4 • There are “tricks” we can use to reduce the severity of • this problem. • They all involve “slashing” the p-value for each test • (i.e., gene), so that while the critical p-value for the entire • data set might still equal 0.05, each gene will be • evaluated at a lower p-value.

  44. Don’t get too hung up on p-values. • Ultimately, what matters is biological relevance. • P-values should help you evaluate the strength of the • evidence, rather than being used as an absolute yardstick • of significance. Statistical significance is not necessarily • the same as biological significance.

  45. Significance analysis of microarrays (SAM) • SAM can be used to pick out significant genes based on differential expression between sets of samples.

  46. SAM -2 • SAM gives estimates of the False Discovery Rate (FDR), which is the proportion of genes likely to have been wrongly identified by chance as being significant. • It is a very interactive algorithm – allows users to dynamically change thresholds for significance (through the tuning parameter delta) after looking at the distribution of the test statistic. • The ability to dynamically alter the input parameters based on immediate visual feedback, even before completing the analysis, should make the data-mining process more sensitive.

  47. SAM designs Two-class unpaired: to pick out genes whose mean expression level is significantly different between two groups of samples (analogous to between subjects t-test). Two-class paired: samples are split into two groups, and there is a 1-to-1 correspondence between an sample in group A and one in group B (analogous to paired t-test).

  48. SAM designs - 2 Multi-class: picks up genes whose mean expression is different across > 2 groups of samples (analogous to one-way ANOVA) Censored survival: picks up genes whose expression levels are correlated with duration of survival. One-class: picks up genes whose mean expression across experiments is different from a user-specified mean.

  49. Significant positive genes (i.e., mean expression of group B > mean expression of group A) in red “Observed d = expected d” line Tuning parameter “delta” limits, can be dynamically changed by using the slider bar or entering a value in the text field. The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant. Significant negative genes (i.e., mean expression of group A > mean expression of group B) in green SAM Two-Class Unpaired– 4

More Related