Chap 17 clustering
1 / 17

Chap. 17 Clustering - PowerPoint PPT Presentation

  • Uploaded on

Chap. 17 Clustering. Objectives of Data Analysis . Classification Pattern recognition Diagnostics Trends Outliers Quality control Discrimination Regression Data comparisons. Clustering.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Chap. 17 Clustering' - lupita

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Objectives of Data Analysis

  • Classification

    • Pattern recognition

    • Diagnostics

  • Trends

    • Outliers

    • Quality control

  • Discrimination

  • Regression

    • Data comparisons


  • Family of related sequences evolved from a common ancestor is studied with phylogenetic trees showing the order of evolution

  • Criteria needed

    • Closeness between sequences

    • The number of clusters

  • Hierarchical Clustering Algorithm

  • K-mean

Clustering of AAs

  • How many clusters ?

    • Use 4 AA groups

    • Good for acidic and basic

    • P in polar group

    • Nonpolar group is wide spread

  • Similarities of AA’s determine the ease of substitutions

  • Some alignment tools show similar AA’s in colors

  • Needs a more systematic approach

Physico-Chemical Properties

  • Physico-chemical properties of AA determine protein structures

    • (1) Size in volume

    • (2) Partial Vol.

      • Measure expanded volume in solution when dissolved

    • (3) Bulkiness

      • The ratio of side chain volume to its length: average cross-sectional area of the side chain

    • (4) pH of isoelectric point of AA (pI)

    • (5) Hydrophobicity

    • (6) Polarity index

    • (7) Surface area

    • (8) Fraction of area

      • Fraction of the accessible surface area that is buried in the interior in a set of known crystal structures

Red: acidic

Orange: basic

Green: polar


Yellow: non-polar


Hierarchical Clustering

  • Hierarchical Clustering Algorithm

    • Each point forms its own cluster, initially

    • Join two clusters with the highest similarity to form a single larger cluster

    • Recompute similarities between all cluster

    • Repeat two steps above until all points are connected to clusters

  • Criteria of similarities ?

    • Use scaled coordinates z

      • Vector zi from origin to each data point i with length |zi|2= ∑k zik2

      • Use cosine angle between two points for similarity

        • cosθij = ∑k zikzjk / |zi||zj|

    • N elements, nxn distrance matrix d

Hierarchical_Clustering (d, n)

Form n clusters, each with 1 element

Construct a graph T by assigning an isolated vertex to each cluster

while there is more than 1 cluster

Find the two closest clusters C1 and C2

Merge C1 and C2 into new cluster C with | C1 | + | C2| elements

Compute distance from C to all other clusters

Add a new vertex C to T

Remove rows and columns of d for C1 and C2, and add for C

return T

Hierarchical Clustering

  • Generates a set of clusters within clusters

  • Algorithm can be arranged as a tree

    • Each node becomes where two smaller clusters join

  • CLUTO package with cosine and group-average rules

    • Red/green indicates values significantly higher/lower than the average

    • Dark colors close to the average

  • red on both pI and polarity scale

  • green on hydrophobicity and pI (can be separated into two smaller clusters)

  • green on volume and surface area

  • C is unusual in protein structure due to its potential to form disulfide bonds between pairs of cysteine residues (thus, difficult to interchange for other residues)

  • Hydrophobic

  • Two largest AA’s

  • 6 clusters illustrating groups of properties that are correlated

Dayhoff Clustering - 1978 illustrating groups of properties that are correlated


  • In PAM matrix, considered probabilities of pairs of amino acids appearing together

  • Pairs of amino acids that tend to appear together are grouped into a cluster

  • six clusters

    • (KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)

  • Contrast to clusters via hierarchical clustering

    • (KRH) (EDQN) (PTSGA) (C) (VLIMF) (WY)

Murphy, Wallqvist, Levy, 2000 illustrating groups of properties that are correlated

  • To study protein folding

  • Used BLOSUM50 similarity matrix

    • Determine correlation coefficients between similarity matrix elements for all pairs of AA’s

      • e.g., CAV = (∑i MA,i MV,i )/[(∑i MA,i MA,i q)*(∑i MV,i MV,i )] with summation over i is taken for 20 AA’s

    • Group two AA’s with highest CC’s, and either add the next AA with the highest CC to a group or a new group

k-mean illustrating groups of properties that are correlated Clustering

  • The number of clusters, k, is known ahead of the time

  • Minimize the squared errors between data points and k cluster centers

  • No known polynomial algorithm

    • Heuristics – Lloyd algorithm

      • initially partition n points arbitrarily to k centers, then move some points between clusters

      • Converge to a local minimum, may move many points in each iteration

k-means Clustering Problem

Given n data points, find k center points minimizing the squared error


d(V, X) = ∑id(vi,X)2/n

input: A set V of n data points and a parameter k

output: A set X consisting of k center points minimizing d(V,X)

over all possible choices of X

K-mean Clustering-2 illustrating groups of properties that are correlated

  • Assume every possible partition of n elements to k clusters

  • And each partition has cost(P)

    • Move one point in each iteration


Select an arbitray partition P into k clusters

while forever

bestChange = 0

for every cluster C

for every element i not in C

if moving i to C reduces Cost(P)

if Δ(i → C) > bestChange

bestChange ← Δ(i → C)

i* = i

C* = C

if bestChange >0

change partition P by moving i* to C*


return P