Create Presentation
Download Presentation

Download

Download Presentation

Statistical analysis of array data: Dimensionality reduction, Clustering

2 Views
Download Presentation

Download Presentation
## Statistical analysis of array data: Dimensionality reduction, Clustering

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Statistical analysis of array data: Dimensionality**reduction, Clustering Katja Astikainen, Riikka Kaven 25.2.2005**Contents**• Problems and approaches • Dimensionality reduction by PCA • Clustering overview • Hierarchical clustering • K-means • Mixture models and EM**Problems and approaches**• Basic idea is to find patterns of expression across multiple genes and experiments • Models of expression are utilized in e.g. classification of diseases more precisely (tautiluokitus,sairausaste) • Expression patterns can be utilized to exploring cellular pathways • With help of gene expression modeling and also condition (experiment) clustering one can find genes that are co-regulated • clustering methods can also be used for sequens alignments • There are several methods for this, but we are going introduce: • Principal Component Analysis (PCA) • Clustering (hierarchical, K-means, EM)**Dimensionality reduction by PCA**PCA is statistical data analysis technique • method to reduce dimensionality • method to identify new meaningful underlying variables • method to compress the data • method to visualize the data**Dimensionality reduction by PCA**• We have N data points xi,…,xn in M dimensional space, where values x are genes expression vectors. • With PCA we can reduct the dimension to K which is usually much lower than M. • Imagine taking three-dimensional cloud of datapoints and rotating it so you can view it from different perspectives. You might imagine that certain views would allow you to better separate the data into groups than others. • With PCA we can ignore some of the redundant experiments (low variance), or use some average of the information without loss of information.**Dimensionality reduction by PCA**• We are looking for unit vector u1 such that, on average the squared length of of the projection of the xs along the u1 is maximal (vectors are column vectors) • Generally if the first u1,…,uk-1 components have been determined the next component is the one that maximize the residual variance • The principal components for the expression vectors are given by ci=uix**Dimensionality reduction by PCA**• How can we find the eigenvectors ui • Find such eigenvectoctors wich shows the most informative part of the data; vectors that show the direction of maximal variance of the data. • Fist we calculate the covariance matrix • Find out the eigenvalues and eigenvectors uk from the covariance matrix • eigen value is a measure of the proportion of the variance explained by the corresponding eigenvector • Select the uis wich are the eigenvectors of the sample covariance matrix associated with the K largest eigenvalues • eigenvectors wich explains the most of the variance in the data • discovers the important features and patterns in the data • for datavisualization use two or three dimensional spaces**Clustering overview**• Data analysis methods for discovering patterns and underlying cluster structures • Different kind of methods such as Hierarchical clustering, partitioning based k-means and Self Organizing map (SOM) • There’s no single method that is best for every data • clustering methods are unsuperviced methods (like k-means) • there is no information about the true clusters or their amount • clustering algorithms are used for analysing the data • discovered clusters are just estimations of the truth (often the result is local optimum)**Clustering overview**• Data types • Typically the clustered data is numerical vector data like gene expression data (expression vectors) • Numerical data can also be represented in relative coordinates • Data might also be qualitative (nominal) which brings challenge for comparing the data elements • Number of clusters is often unknown • One way to estimate the number of clusters is analysing the data by PCA • you might use the eigenvectors to estimate the number of clusters • Other way is to make guesses and justify the number of cluster by good results (what ever they are)**Clustering overview**• Similarity measures • Pearson correlation (normalized vectors dot product) • Distance measures • euclidean (natural distance between two vectors) • It is important to use appropriate distance/similarity measures • in euclidean space vectors might be close to each other but their correlation could be 0 1000000000 0000000001**Clustering overview**Cost function and probabilististic interpretation: • For comparing different ways of clustering the same data, we need some kind of cost function for the clustering algorithm • The goal of clustering is to try to minimize such cost function • Generally cost function depends on some quantities: • Centers of the clusters • The distance of each point in a cluster to the cluster center • The average degree of similarity of a points in a cluster • Cost functions are algorithm spesific, so comparing the results of different clustering algorithms might be almost impossible**Clustering overview**Cost function and probabilististic interpretation: • There are some advantages associated with probabilistic models they are often utilized in cost functions • It is popular method to use in the clustering cost function the negative log-likelihood of an underlying probabilistic model**Hierarchical clustering**• The basic idea is to construct hierarchical tree which consist of nested clusters • Algorithm is bottom-up method where clustering starts from single data points (genes) and stops when all data points are in same cluster (the root of the tree) • Clustering begins with computing pairwise similarities between each data point and when clusters are formed similarity comparing is made between clusters. • Branching process is repeated at most N-1 times which means that the leaf nodes (genes) make first pairs and the tree becomes a binary-tree.**Hierarchical clustering:phases**• Calculate the pairwais similarities between data points into matrix • Find two datapoints (nodes in the tree) wich are closest to each other or are most similar. • Group them together to make a new cluster. • Calculate the averige vector of datapoints which is expression profile for the cluster (inner node in the tree that joins the leaf nodes = datapoints vectors) • Calculate new correlation matrix • calculate pairwise similarity between the new cluster and other clusters.**Tree Visualization**• With Hierarchical clustering we could find the dendoclusters of datapoints but the constructed tree isn’t yet in optimal order • After finding the dendogram which tells the similarity between nodes and genes, the final and optimal linear order for nodes can be discovered with help of dynamic programming**genes**experiments Tree visualization with dynamic programming[2] Goal: Quickly and easily arrange the data for further inspection A B C D E**Tree visualization with dynamic programming [2]**nearest: we use correlation coefficient (normalized dot product) can use other measures as well A B C D E Greedily join nearest cluster pair [3]**Tree visualization with dynamic programming [2]**• Greedily join nearest cluster pair [3] • Optimal ordering: minimize summed distance between consecutive genes • Criterion suggested by Eisen A C B D E**Tree visualization with dynamic programming [2]**• Greedily join nearest cluster pair [3] • Optimal ordering: minimize summed distance between consecutive genes • Criterion suggested by Eisen B A C E D**Hierarchical clustering:dynamic programming**• Optimal linear ordering for genes expression vectors can be computed in O(N4) steps • We would like to maximize the similarity between neighbournodes where is the ith leaf when the tree is ordered according to . The algorithm works from bottom up towards the root by recursively computing the cost of the optimal ordering M(V,U,W) [1]**Hierarchical clustering:dynamic programming**• The dynamic programming recurrence is given by: • The optimal cost M(V) for V is obtained by maximizing over all pairs, U, W. • The global optimal cost is obtained recursively when V is the root of the tree, and the optimal tree can be found by standard backtracking. [1]**k-means algorithm**• Data points are divided into k clusters • Find by iterating such group of centroids C={v1,…,vK}, which minimize the squared distances (d2) between expression vectors xj…xn and the centroid which they belong REP[xj,C]: where the distance measure d is euclidean. In practise the result is approximation (local optimum). • Each expression vector belongs into one cluster.**k-means-algorithm: phases**• Initially put the expression vectors randomly into k clusters. • Define the clusters centroids by calculating the average vector from expression vectors which belong into the cluster. • Compute the distances between expression vectors and centroids. • Move every expression vector into cluster with closest centroid. • Define new centroids for clusters. If clusters centroids are stabile or some other stopping criteria is achieved, stop algorithm. Otherwise repeat steps 3-5.**k-means clustering**Kuva 4 [4]: K-means example: 1) Expression vectors are randomly divided into three clusters 2) Define the centroids. 3) Compute expression vectors distances to the centroids. 4) Compute centroids new locations. 5) Compute expression vectors distances to the centroids. 6) Compute centroids new locations and finish the clustering cause the centroids are stabilized. Clusters formed are circled.**Mixture models and EM**• EM algortihm is based on modelling complex distributions by combining together simple Gaussian distributions of clusters • K-means algorithm is an oline approximation of EM algorithm • maximizes the quadratic log-likelihood (minimizes quadratic distances of datapoints to their clusters centroids) • The EM algorithm is used to optimize the centers of each cluster (weighted variance is maximal) which means that we find the maximum likelihood estimate for the center of the Gaussian distribution of the cluster • Some initial guesses has to be made before starting • number of clusters (k) • initial centers of clusters**Mixture models and EM**Algorithm is an iterative process with two optimization task: • E-step: the membership probabilities (hidden variables) of each datapoint for each mixture model (cluster) are Estimated The maximum likehood estimate of the mixing coefficient is the sample mean of the conditional probatilities that d1 comes from model k**Mixture model and EM**• M-step: K-separate estimation problems of Maximizing the log-likelihood of k component with a weight given by the estimated membership probabilities • In M-step means of Gaussian distributions are estimated so that they maximize the likelihood of the models**References**[1] Baldi, P and Hatfield, Wesley G, DNA Microarrays and Gene Expression, Cambridge University Press, 2002, 73-96. [2] URL http://www-2.cs.cmu.edu/~zivbj/class04/lecture11.ppt [3] Eisen MB, Spellman PT, Brown PO and Botstein D. (1998). Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U S A 95, 14863-8. [4] Gasch, A. P. and Eisen, M. B., Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biology, 3,11(2002), 1–22. URL http://citeseer.ist.psu.edu/gasch02exploring.html.