1 / 43

Essential Statistics in Biology: Getting the Numbers Right

Essential Statistics in Biology: Getting the Numbers Right. Raphael Gottardo Clinical Research Institute of Montreal (IRCM) raphael.gottardo@ircm.qc.ca http://www.rglab.org. Outline. Exploratory Data Analysis 1-2 sample t -tests, multiple testing Clustering SVD/PCA

tale
Download Presentation

Essential Statistics in Biology: Getting the Numbers Right

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Essential Statistics in Biology: Getting the Numbers Right • Raphael Gottardo • Clinical Research Institute of Montreal (IRCM) • raphael.gottardo@ircm.qc.ca • http://www.rglab.org

  2. Outline • Exploratory Data Analysis • 1-2 sample t-tests, multiple testing • Clustering • SVD/PCA • Frequentists vs. Bayesians 2 Day 1

  3. Clustering(Multivariate analysis)

  4. Outline • Basics of clustering • Hierarchical clustering • K-means • Model based clustering 4 Day 1 - Section 3

  5. What is it? Clustering is the classification of similar objects into different groups. Partition a data set into subsets (clusters), so that the data in each subset are “close” to one another - often proximity according to some defined distance measure. Examples: www, gene clustering 5 Day 1 - Section 3

  6. Hierarchical clustering Given N items and a distance metric • Assign each item to a clusterInitialize the distance matrix between clusters as the distance between items • Find the closest pair of clusters and merge them into a single cluster • Compute new distances between clusters • Repeat 2-3 until call items are classified into a single cluster 6 Day 1 - Section 3

  7. Single linkage The distance between clusters is defined as the shortest distance from any member of one cluster to any member of the other cluster. Cluster 1 Cluster 2 d 7 Day 1 - Section 3

  8. Complete linkage The distance between clusters is defined as the greatest distance from any member of one cluster to any member of the other cluster. Cluster 1 Cluster 2 d 8 Day 1 - Section 3

  9. Average linkage The distance between clusters is defined as the average distance from any member of one cluster to any member of the other cluster. Cluster 1 Cluster 2 d=Average of all distances 9 Day 1 - Section 3

  10. Example Cell cycle dataset (Cho et al. 1998) Expression levels of ~6000 genes during the cell cycle 17 time points (2 cell cycles) 10 Day 1 - Section 3

  11. Example cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])D.cho<-dist(cho.data, method = "euclidean")hc.single<-hclust(D.cho, method = "single", members=NULL)plot(hc.single)rect.hclust(hc.single,k=4)class.single<-cutree(hc.single, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.single==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.single==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==4,]),type="l",xlab="time",ylab="log expression value")hc.complete<-hclust(D.cho, method = "complete", members=NULL)plot(hc.complete)rect.hclust(hc.complete,k=4)class.complete<-cutree(hc.complete, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.complete==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==4,]),type="l",xlab="time",ylab="log expression value")hc.average<-hclust(D.cho, method = "average", members=NULL)plot(hc.average)rect.hclust(hc.average,k=4)class.average<-cutree(hc.average, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.average==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.average==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==4,]),type="l",xlab="time",ylab="log expression value")set.seed(100)km.cho<-kmeans(cho.data, 4)par(mfrow=c(2,2))matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value") 11 Day 1 - Section 3

  12. Example Single linkage 12 Day 1 - Section 3

  13. Example Single linkage Single linkage k=2 13 Day 1 - Section 3

  14. Example Single linkage k=3 14 Day 1 - Section 3

  15. Example Single linkage k=4 15 Day 1 - Section 3

  16. Example Single linkage k=5 16 Day 1 - Section 3

  17. Example Single linkage k=25 17 Day 1 - Section 3

  18. Example 1 2 Single linkage k=4 3 4 18 Day 1 - Section 3

  19. Example Complete linkage k=4 19 Day 1 - Section 3

  20. Example 1 2 Complete linkage k=4 3 4 20 Day 1 - Section 3

  21. K-means N items, assume K clusters Goal is to minimized over the possible assignments and centroids . represents the location of the cluster. 21 Day 1 - Section 3

  22. K-means - algorithm • Divide the data into K clustersInitialize the centroids with the mean of the clusters • Assign each item to the cluster with closest centroid • When all objects have been assigned, recalculate the centroids (mean) • Repeat 2-3 until the centroids no longer move 22 Day 1 - Section 3

  23. K-means - algorithm set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100) for(i in 1:4) { set.seed(100) cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=i) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) Sys.sleep(2) } 23 Day 1 - Section 3

  24. Example 1 2 Why? 3 4 24 Day 1 - Section 3

  25. Example 1 2 3 4 26 Day 1 - Section 3

  26. Summary • K-means and hierarchical clustering methods are useful techniques • Fast and easy to implementBeware of memory requirements for HC • A bit “ad hoc”: • Number of clusters? • Distance metric? • Good clustering? 27 Day 1 - Section 3

  27. Model based clustering • Based on probability models (e.g. Normal mixture models) • We could talk about good clustering • Compare several models • Estimate the number of clusters! 28 Day 1 - Section 3

  28. Model based clustering Yeung et al. (2001) Multivariate observations K clusters Assume observation i belongs to cluster k, then that is each cluster can be represented by a multivariate normal distribution with mean and covariance 29 Day 1 - Section 3

  29. Model based clustering Banfield and Raftery (1993) Eigenvalue decomposition Volume Orientation Shape 30 Day 1 - Section 3

  30. Model based clustering Equal volume spherical EII 0 Unequal volume spherical VII 0 Equal volume, shape, orientation (EEE) 0 Unconstrained (VVV) 0 31 Day 1 - Section 3

  31. Estimation Likelihood (Mixture model) Given the number of clusters and the covariance structure the EM algorithm can be used Mclust R package available from CRAN 32 Day 1 - Section 3

  32. Model selection Which model is appropriate? - Which covariance structure? - How many clusters? Compare the different models using BIC 33 Day 1 - Section 3

  33. Model selection We wish to compare two models and with parameters and respectively. Given the observed data D, define the integrated likelihood Probability to observe the data given model NB: and might have different dimensions 34 Day 1 - Section 3

  34. Model selection To compare two models and use the integrated likelihoods. The integral is difficult to compute! Bayesian information criteria: is the maximum likelihood is the number of parameter in model 35 Day 1 - Section 3

  35. Model selection Bayesian information criteria: Measure of fit Penalty term A large BIC score indicates strong evidence for the corresponding model BIC can be used to choose the number of clusters and the covariance parametrization (Mclust) 36 Day 1 - Section 3

  36. Example revisited 1 EII library(mclust) cho.mclust.bic<-EMclust(cho.data.std,modelNames=c("EII","EEI")) plot(cho.mclust.bic) cho.mclust<-EMclust(cho.data.std,4,"EII") sum.cho<-summary(cho.mclust,cho.data.std) 2 EEI 37 Day 1 - Section 3

  37. Example revisited par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==4,]),type="l",xlab="time",ylab="log expression value") 38 Day 1 - Section 3

  38. Example revisited 1 2 EII 4 clusters 3 4 39 Day 1 - Section 3

  39. Example revisited cho.mclust<-EMclust(cho.data.std,3,"EEI") sum.cho<-summary(cho.mclust,cho.data.std) par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") 40 Day 1 - Section 3

  40. Example revisited 1 2 EEI 3 clusters 3 41 Day 1 - Section 3

  41. Summary • Model based clustering is a nice alternative to heuristic clustering algorithms • BIC can be used for choosing the covariance structure and the number of clusters 42 Day 1 - Section 3

  42. Conclusion • We have seen a few clustering algorithms • There are many others • Two way clustering • Plaid model ... • Clustering is a useful tool and ... a dangerous weapon • To be consumed with moderation! 43 Day 1 - Section 3

More Related