1 / 17

Chap. 17 Clustering

Chap. 17 Clustering. Objectives of Data Analysis . Classification Pattern recognition Diagnostics Trends Outliers Quality control Discrimination Regression Data comparisons. Clustering.

lupita
Download Presentation

Chap. 17 Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chap. 17 Clustering

  2. Objectives of Data Analysis • Classification • Pattern recognition • Diagnostics • Trends • Outliers • Quality control • Discrimination • Regression • Data comparisons

  3. Clustering • Family of related sequences evolved from a common ancestor is studied with phylogenetic trees showing the order of evolution • Criteria needed • Closeness between sequences • The number of clusters • Hierarchical Clustering Algorithm • K-mean

  4. Example: Amino Acid (AA) - Basic

  5. Clustering of AAs • How many clusters ? • Use 4 AA groups • Good for acidic and basic • P in polar group • Nonpolar group is wide spread • Similarities of AA’s determine the ease of substitutions • Some alignment tools show similar AA’s in colors • Needs a more systematic approach

  6. Physico-Chemical Properties • Physico-chemical properties of AA determine protein structures • (1) Size in volume • (2) Partial Vol. • Measure expanded volume in solution when dissolved • (3) Bulkiness • The ratio of side chain volume to its length: average cross-sectional area of the side chain • (4) pH of isoelectric point of AA (pI) • (5) Hydrophobicity • (6) Polarity index • (7) Surface area • (8) Fraction of area • Fraction of the accessible surface area that is buried in the interior in a set of known crystal structures

  7. Red: acidic Orange: basic Green: polar (hydrophillic) Yellow: non-polar (hydrophobic)

  8. Hierarchical Clustering • Hierarchical Clustering Algorithm • Each point forms its own cluster, initially • Join two clusters with the highest similarity to form a single larger cluster • Recompute similarities between all cluster • Repeat two steps above until all points are connected to clusters • Criteria of similarities ? • Use scaled coordinates z • Vector zi from origin to each data point i with length |zi|2= ∑k zik2 • Use cosine angle between two points for similarity • cosθij = ∑k zikzjk / |zi||zj| • N elements, nxn distrance matrix d

  9. Hierarchical_Clustering (d, n) Form n clusters, each with 1 element Construct a graph T by assigning an isolated vertex to each cluster while there is more than 1 cluster Find the two closest clusters C1 and C2 Merge C1 and C2 into new cluster C with | C1 | + | C2| elements Compute distance from C to all other clusters Add a new vertex C to T Remove rows and columns of d for C1 and C2, and add for C return T

  10. Hierarchical Clustering • Generates a set of clusters within clusters • Algorithm can be arranged as a tree • Each node becomes where two smaller clusters join • CLUTO package with cosine and group-average rules • Red/green indicates values significantly higher/lower than the average • Dark colors close to the average

  11. Clustering of properties: properties can be ordered illustrating groups of properties that are correlated • red on both pI and polarity scale • green on hydrophobicity and pI (can be separated into two smaller clusters) • green on volume and surface area • C is unusual in protein structure due to its potential to form disulfide bonds between pairs of cysteine residues (thus, difficult to interchange for other residues) • Hydrophobic • Two largest AA’s

  12. 6 clusters

  13. Dayhoff Clustering - 1978 • http://www.dayhoff.cc/index.html • In PAM matrix, considered probabilities of pairs of amino acids appearing together • Pairs of amino acids that tend to appear together are grouped into a cluster • six clusters • (KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY) • Contrast to clusters via hierarchical clustering • (KRH) (EDQN) (PTSGA) (C) (VLIMF) (WY)

  14. (KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)

  15. Murphy, Wallqvist, Levy, 2000 • To study protein folding • Used BLOSUM50 similarity matrix • Determine correlation coefficients between similarity matrix elements for all pairs of AA’s • e.g., CAV = (∑i MA,i MV,i )/[(∑i MA,i MA,i q)*(∑i MV,i MV,i )] with summation over i is taken for 20 AA’s • Group two AA’s with highest CC’s, and either add the next AA with the highest CC to a group or a new group

  16. k-mean Clustering • The number of clusters, k, is known ahead of the time • Minimize the squared errors between data points and k cluster centers • No known polynomial algorithm • Heuristics – Lloyd algorithm • initially partition n points arbitrarily to k centers, then move some points between clusters • Converge to a local minimum, may move many points in each iteration k-means Clustering Problem Given n data points, find k center points minimizing the squared error distortion, d(V, X) = ∑id(vi,X)2/n input: A set V of n data points and a parameter k output: A set X consisting of k center points minimizing d(V,X) over all possible choices of X

  17. K-mean Clustering-2 • Assume every possible partition of n elements to k clusters • And each partition has cost(P) • Move one point in each iteration Progressive_Greedy_k-means(n) Select an arbitray partition P into k clusters while forever bestChange = 0 for every cluster C for every element i not in C if moving i to C reduces Cost(P) if Δ(i → C) > bestChange bestChange ← Δ(i → C) i* = i C* = C if bestChange >0 change partition P by moving i* to C* else return P

More Related