1 / 17

Bioinformatics Cluster Analysis

Bioinformatics Cluster Analysis. Mentee: Joonoh Lim Mentor: Sanketh Shetty. Background. Cluster analysis is an unsupervised method of determining groupings (clusters) in data sets. In biology, cluster analysis is used to study genes and gene expressions.

neal
Download Presentation

Bioinformatics Cluster Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BioinformaticsCluster Analysis Mentee: Joonoh Lim Mentor: SankethShetty

  2. Background • Cluster analysis is an unsupervised method of determining groupings (clusters) in data sets. • In biology, cluster analysis is used to study genes and gene expressions. • There are three categories of gene expression data clustering: gene-based, sample-based, subspace clustering. • Data set is usually obtained by DNA microarray.

  3. DNA Microarray

  4. Establishing Data Set Gene-based Sample-based 15 x 15 x 8 → 225 x 8 15 x 15 x 8 → 8 x 225

  5. Types of Clustering Algorithms • Partitional Methods • K-means Clustering • Affinity Propagation • Spectral Clustering • Mean-shift Clustering • Normalized-cuts • Gaussian Mixture Models • Hierarchical Methods • Single linkage • Complete linkage • Average Linkage

  6. Proximity measure • Defines the similarity between data objects • Examples: Euclidean distance, Pearson’s correlation coefficient, Jackknife correlation, Spearman’s rank-order correlation, City block distance (Manhattan distance), Angular separation, etc.. • We use Euclidean distance. The Euclidean distance between points and is defined as:

  7. Hierarchical Clustering • Single linkage: group two objects in minimum distance http://www.resample.com/xlminer/help/HClst/HClst_intro.htm

  8. Hierarchical ClusteringEx)Colon Cancer data Using complete linkage Dendrogram

  9. K-means Clustering www.cs.cmu.edu/~awm

  10. K-means clusteringEx) Colon Cancer data • K = 5

  11. K-means clusteringEx) Colon Cancer data • K = 10

  12. K-means clusteringEx) Colon Cancer data • K = 15

  13. K-means clusteringEx) Colon Cancer data • K = 30

  14. Determining cluster numbers • One of widely used methods is “elbow” method. • Elbow method is to plot the percent variance explained versus the number of clusters and to find the point where increasing the number of clusters does not add much information anymore. • Percentage of variance explained is the ratio of the between-group variance to the total variance.

  15. Elbow Method (Criterion) wikipedia

  16. Challenges and Future Research Directions • No single “best” algorithm. • The performance of different clustering algorithms strongly depends on both data distribution and application requirement. • Clustering is generally “unsupervised” learning problem. • However, often some “partial” knowledge is available, such as the functions of some genes. • If a clustering could integrate such partial knowledge as some ‘clustering constraints’, we can expect more biologically meaningful and reliable results.

  17. Questions? Thank you!

More Related