- 83 Views
- Uploaded on
- Presentation posted in: General

Chap. 17 Clustering

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Objectives of Data Analysis

- Classification
- Pattern recognition
- Diagnostics

- Trends
- Outliers
- Quality control

- Discrimination
- Regression
- Data comparisons

Clustering

- Family of related sequences evolved from a common ancestor is studied with phylogenetic trees showing the order of evolution
- Criteria needed
- Closeness between sequences
- The number of clusters

- Hierarchical Clustering Algorithm
- K-mean

Example: Amino Acid (AA) - Basic

Clustering of AAs

- How many clusters ?
- Use 4 AA groups
- Good for acidic and basic
- P in polar group
- Nonpolar group is wide spread

- Similarities of AA’s determine the ease of substitutions
- Some alignment tools show similar AA’s in colors
- Needs a more systematic approach

Physico-Chemical Properties

- Physico-chemical properties of AA determine protein structures
- (1) Size in volume
- (2) Partial Vol.
- Measure expanded volume in solution when dissolved

- (3) Bulkiness
- The ratio of side chain volume to its length: average cross-sectional area of the side chain

- (4) pH of isoelectric point of AA (pI)
- (5) Hydrophobicity
- (6) Polarity index
- (7) Surface area
- (8) Fraction of area
- Fraction of the accessible surface area that is buried in the interior in a set of known crystal structures

Red: acidic

Orange: basic

Green: polar

(hydrophillic)

Yellow: non-polar

(hydrophobic)

Hierarchical Clustering

- Hierarchical Clustering Algorithm
- Each point forms its own cluster, initially
- Join two clusters with the highest similarity to form a single larger cluster
- Recompute similarities between all cluster
- Repeat two steps above until all points are connected to clusters

- Criteria of similarities ?
- Use scaled coordinates z
- Vector zi from origin to each data point i with length |zi|2= ∑k zik2
- Use cosine angle between two points for similarity
- cosθij = ∑k zikzjk / |zi||zj|

- N elements, nxn distrance matrix d

- Use scaled coordinates z

Hierarchical_Clustering (d, n)

Form n clusters, each with 1 element

Construct a graph T by assigning an isolated vertex to each cluster

while there is more than 1 cluster

Find the two closest clusters C1 and C2

Merge C1 and C2 into new cluster C with | C1 | + | C2| elements

Compute distance from C to all other clusters

Add a new vertex C to T

Remove rows and columns of d for C1 and C2, and add for C

return T

Hierarchical Clustering

- Generates a set of clusters within clusters
- Algorithm can be arranged as a tree
- Each node becomes where two smaller clusters join

- CLUTO package with cosine and group-average rules
- Red/green indicates values significantly higher/lower than the average
- Dark colors close to the average

- Clustering of properties: properties can be ordered illustrating groups of properties that are correlated

- red on both pI and polarity scale
- green on hydrophobicity and pI (can be separated into two smaller clusters)
- green on volume and surface area
- C is unusual in protein structure due to its potential to form disulfide bonds between pairs of cysteine residues (thus, difficult to interchange for other residues)
- Hydrophobic
- Two largest AA’s

- 6 clusters

Dayhoff Clustering - 1978

- http://www.dayhoff.cc/index.html
- In PAM matrix, considered probabilities of pairs of amino acids appearing together
- Pairs of amino acids that tend to appear together are grouped into a cluster
- six clusters
- (KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)

- Contrast to clusters via hierarchical clustering
- (KRH) (EDQN) (PTSGA) (C) (VLIMF) (WY)

- (KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)

Murphy, Wallqvist, Levy, 2000

- To study protein folding
- Used BLOSUM50 similarity matrix
- Determine correlation coefficients between similarity matrix elements for all pairs of AA’s
- e.g., CAV = (∑i MA,i MV,i )/[(∑i MA,i MA,i q)*(∑i MV,i MV,i )] with summation over i is taken for 20 AA’s

- Group two AA’s with highest CC’s, and either add the next AA with the highest CC to a group or a new group

- Determine correlation coefficients between similarity matrix elements for all pairs of AA’s

k-mean Clustering

- The number of clusters, k, is known ahead of the time
- Minimize the squared errors between data points and k cluster centers
- No known polynomial algorithm
- Heuristics – Lloyd algorithm
- initially partition n points arbitrarily to k centers, then move some points between clusters
- Converge to a local minimum, may move many points in each iteration

- Heuristics – Lloyd algorithm

k-means Clustering Problem

Given n data points, find k center points minimizing the squared error

distortion,

d(V, X) = ∑id(vi,X)2/n

input: A set V of n data points and a parameter k

output: A set X consisting of k center points minimizing d(V,X)

over all possible choices of X

K-mean Clustering-2

- Assume every possible partition of n elements to k clusters
- And each partition has cost(P)
- Move one point in each iteration

Progressive_Greedy_k-means(n)

Select an arbitray partition P into k clusters

while forever

bestChange = 0

for every cluster C

for every element i not in C

if moving i to C reduces Cost(P)

if Δ(i → C) > bestChange

bestChange ← Δ(i → C)

i* = i

C* = C

if bestChange >0

change partition P by moving i* to C*

else

return P