Chap. 17 Clustering

Chap. 17 Clustering

Objectives of Data Analysis • Classification • Pattern recognition • Diagnostics • Trends • Outliers • Quality control • Discrimination • Regression • Data comparisons

Clustering • Family of related sequences evolved from a common ancestor is studied with phylogenetic trees showing the order of evolution • Criteria needed • Closeness between sequences • The number of clusters • Hierarchical Clustering Algorithm • K-mean

Example: Amino Acid (AA) - Basic

Clustering of AAs • How many clusters ? • Use 4 AA groups • Good for acidic and basic • P in polar group • Nonpolar group is wide spread • Similarities of AA’s determine the ease of substitutions • Some alignment tools show similar AA’s in colors • Needs a more systematic approach

Physico-Chemical Properties • Physico-chemical properties of AA determine protein structures • (1) Size in volume • (2) Partial Vol. • Measure expanded volume in solution when dissolved • (3) Bulkiness • The ratio of side chain volume to its length: average cross-sectional area of the side chain • (4) pH of isoelectric point of AA (pI) • (5) Hydrophobicity • (6) Polarity index • (7) Surface area • (8) Fraction of area • Fraction of the accessible surface area that is buried in the interior in a set of known crystal structures

Red: acidic Orange: basic Green: polar (hydrophillic) Yellow: non-polar (hydrophobic)

Hierarchical Clustering • Hierarchical Clustering Algorithm • Each point forms its own cluster, initially • Join two clusters with the highest similarity to form a single larger cluster • Recompute similarities between all cluster • Repeat two steps above until all points are connected to clusters • Criteria of similarities ? • Use scaled coordinates z • Vector zi from origin to each data point i with length |zi|2= ∑k zik2 • Use cosine angle between two points for similarity • cosθij = ∑k zikzjk / |zi||zj| • N elements, nxn distrance matrix d

Hierarchical_Clustering (d, n) Form n clusters, each with 1 element Construct a graph T by assigning an isolated vertex to each cluster while there is more than 1 cluster Find the two closest clusters C1 and C2 Merge C1 and C2 into new cluster C with | C1 | + | C2| elements Compute distance from C to all other clusters Add a new vertex C to T Remove rows and columns of d for C1 and C2, and add for C return T

Hierarchical Clustering • Generates a set of clusters within clusters • Algorithm can be arranged as a tree • Each node becomes where two smaller clusters join • CLUTO package with cosine and group-average rules • Red/green indicates values significantly higher/lower than the average • Dark colors close to the average

Clustering of properties: properties can be ordered illustrating groups of properties that are correlated • red on both pI and polarity scale • green on hydrophobicity and pI (can be separated into two smaller clusters) • green on volume and surface area • C is unusual in protein structure due to its potential to form disulfide bonds between pairs of cysteine residues (thus, difficult to interchange for other residues) • Hydrophobic • Two largest AA’s

6 clusters

Dayhoff Clustering - 1978 • http://www.dayhoff.cc/index.html • In PAM matrix, considered probabilities of pairs of amino acids appearing together • Pairs of amino acids that tend to appear together are grouped into a cluster • six clusters • (KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY) • Contrast to clusters via hierarchical clustering • (KRH) (EDQN) (PTSGA) (C) (VLIMF) (WY)

(KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)

Murphy, Wallqvist, Levy, 2000 • To study protein folding • Used BLOSUM50 similarity matrix • Determine correlation coefficients between similarity matrix elements for all pairs of AA’s • e.g., CAV = (∑i MA,i MV,i )/[(∑i MA,i MA,i q)*(∑i MV,i MV,i )] with summation over i is taken for 20 AA’s • Group two AA’s with highest CC’s, and either add the next AA with the highest CC to a group or a new group

k-mean Clustering • The number of clusters, k, is known ahead of the time • Minimize the squared errors between data points and k cluster centers • No known polynomial algorithm • Heuristics – Lloyd algorithm • initially partition n points arbitrarily to k centers, then move some points between clusters • Converge to a local minimum, may move many points in each iteration k-means Clustering Problem Given n data points, find k center points minimizing the squared error distortion, d(V, X) = ∑id(vi,X)2/n input: A set V of n data points and a parameter k output: A set X consisting of k center points minimizing d(V,X) over all possible choices of X

K-mean Clustering-2 • Assume every possible partition of n elements to k clusters • And each partition has cost(P) • Move one point in each iteration Progressive_Greedy_k-means(n) Select an arbitray partition P into k clusters while forever bestChange = 0 for every cluster C for every element i not in C if moving i to C reduces Cost(P) if Δ(i → C) > bestChange bestChange ← Δ(i → C) i* = i C* = C if bestChange >0 change partition P by moving i* to C* else return P

Chap. 17 Clustering