protein structure similarity l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Protein Structure Similarity PowerPoint Presentation
Download Presentation
Protein Structure Similarity

Loading in 2 Seconds...

play fullscreen
1 / 71

Protein Structure Similarity - PowerPoint PPT Presentation


  • 288 Views
  • Uploaded on

Protein Structure Similarity. Computation of Best Matches. Two “simultaneous” subproblems Find maximal correspondence set C Find alignment transform T Chicken-and-egg issue: Each subproblem is relatively simple: If we knew C, we could compute T

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Protein Structure Similarity' - erika


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
computation of best matches
Computation of Best Matches

Two “simultaneous” subproblems

  • Find maximal correspondence set C
  • Find alignment transform T

Chicken-and-egg issue:

  • Each subproblem is relatively simple:
    • If we knew C, we could compute T
    • If we knew T, we could get C by proximity
  • But the combination is hard !!!
computation of best matches3

Only requires computing 6 parameters

Computation of Best Matches

Two “simultaneous” subproblems

  • Find maximal correspondence set C
  • Find alignment transform T

Chicken-and-egg issue:

  • Each subproblem is relatively simple:
    • If we knew C, we could compute T
    • If we knew T, we could get C by proximity
  • But the combination is hard !!!
find alignment transform
Find Alignment Transform
  • Two sets of points A= {a1,…,an} and B = {b1,…,bn}
  • Correspondence pairs (ai, bi)
  • Find T = arg minT RMSD(A,T(B)) 
  • O(n) closed-form solution[Arun, Huang, and Blostein, 87] [Horn, 87] [Horn, Hilden, and Negahdaripour, 88]
o n svd based algorithm
O(n) SVD-Based Algorithm
  • T combines translation t and rotation R, such that T(bi) = t + R(bi)
  • b = (Σi=1,...,nbi)/n [mean of the bi’s]
  • Place the origin of coordinate system at b
  • minT RMSD(A,T(B)) simplifies to (up to some constants):
  • t and R can be computed separately
  • t = a[mean of the ai’s]

[Arun, Huang, and Blostein, 87]

o n svd based algorithm6
O(n) SVD-Based Algorithm
  • A3n = [a1-a, ..., an-a]B3n = [b1-b, ..., bn-b]
  • Compute SVD decomposition of 3×3 correlation matrix BAT: BAT = UDVTwhere D is a diagonal matrices with decreasing non-negative entries (singular values) along the diagonal
  • If det(U)det(V) = 1 then S = I, else S = diag(1,1,-1)
  • R = USVT

[Arun, Huang, and Blostein, 87]

o n svd based algorithm7
O(n) SVD-Based Algorithm
  • A3n = [a1-a, ..., an-a]B3n = [b1-b, ..., bn-b]
  • Compute SVD decomposition of 3×3 correlation matrix BAT: BAT = UDVTwhere D is a diagonal matrices with decreasing non-negative entries (singular values) along the diagonal
  • If det(U)det(V) = 1 then S = I, else S = diag(1,1,-1)
  • R = USVT

[Arun, Huang, and Blostein, 87]

slide8

[Arun, Huang, and Blostein, 87]

 rotation matrix

  • [Horn, 87]  quaternion
trial and error approach to protein structure comparison

Guess small correspondence set

Compute T

Update correspondence set(correspondence from proximity)

Apply T

 Trial-and-Error Approach to Protein Structure Comparison
trial and error approach to protein structure comparison10
 Trial-and-Error Approach to Protein Structure Comparison
  • Set CS to a seedcorrespondence set (small set sufficient to generate an alignment transform)
  • Compute the alignment transform T for CS and apply T to the second protein B
  • Update CS to include all pairs of features that are close apart
  • If CS has changed, then return to Step 2else return (CS,T)
trial and error approach to protein structure comparison11
 Trial-and-Error Approach to Protein Structure Comparison

- result= nil

- Iterate N times:

  • Set CS to a seedcorrespondence set (small set sufficient to generate an alignment transform)
  • Compute the alignment transform T for CS and apply T to the second protein B
  • Update CS to include all pairs of features that are close apart
  • If CS has changed, then return to Step 2else result result {(CS,T)}

- Return result

seed generation from fragment
Seed Generation from Fragment
  • From distance matrices

E.g., DALI [Holm and Sander, 1996]

using distance matrices dali

45

40

85

1

Using Distance Matrices (DALI)
  • Distances are invariant to rigid-body transformations
  • DALI [Holm and Sander, 1996] looks for similar hexapeptides by searching for similar 7x7 Ca-Ca distance matrices
seed generation from fragment15
Seed Generation from Fragment
  • From distance matrices

E.g., DALI [Holm and Sander, 1996]

  • From secondary structure elements (SSE’s)

E.g., LOCK [Singh and Brutlag, 1996]

  • From voting scheme (using geometric hashing)

E.g., 3dSEARCH [Singh and Brutlag, 2000]

slide16
LOCK

A.P. Singh and D.L. Brutlag. Hierarchical Protein Structure Superposition Using Both Secondary and Atomic Representations. Proc. ISMB, pp. 284-293, 1997.

LOCK2:J. Shapiro and D.L. Brutlag. FoldMiner: Structural Motif Discovery Using an Improved Superposition Algorithm. Protein Science, 13:278-294, 2004.

http://motif.stanford.edu/lock2/

slide17
LOCK
  • Two levels of features: SSEs and Ca atoms
  • Stage 1 (SSE alignment):Initial alignment is computed using SSEs represented as vectors
  • Stage 2 (atom alignment):Alignment is refined using Ca atoms represented as points
rationale for lock
Rationale for LOCK
  • Using types of features is an effective way to reduce combinatorial explosion and computation
  • SSEs, which are responsible for most of the stability and functionality of the proteins, are more meaningful and better conserved than types of atoms and amino-acids
  • If 2 structures are similar, some of their SSEs should form similar substructures
  • Drawback: It narrows down the set of possible applications, e.g., can’t find small motifs at atomic level
vector based representation
Vector-Based Representation

b-strands

loops

a-helices

One vector per SSE (helix, strand, loop)

vector based representation20
Vector-Based Representation
  • DSSP [Kabsch and Sander, 1983] classifies residues into helices/strands
  • For a-helix starting at residue i:Xorigin= (0.74Xi + Xi+1 + Xi+2 + 0.74Xi+3)/3.48where Xi is the position of the Ca atom of residue i

(angle between two consecutive residues is 100dg  factor 0.74)

  • Similar computation for Xend and for b-strand
scoring similarity

Assume that i and p have been aligned. What is the score of the alignment of k and r?

2Mi

S(di) =

- Mi

1+(di/di0)2

Scoring Similarity
  • Position-independent differences:
    • |angle(i,k)-angle(p,r)|
    • |angle(i,j)-angle(p,q)|
    • |angle(j,k)-angle(q,r)|
    • |distance(i,k)-distance(p,r)|
    • |length(k)-length(r)|
  • Position-dependent differences:
    • angle(k,r)
    • distance(k,r)
  • Scores are additive

Maximal score

Score = S S(di)

Value of di forwhich score is 0

stage 1 sse alignment

E.g., using start, middle, and end points of vectors

Stage 1: SSE Alignment
  • For every pair of SSE vectors of protein A, find all pairs of vectors in B that align well using orientation-independent scores seed correspondence sets
  • For each correspondence set:
    • Find alignment transform and apply it to B
    • Find correspondence set with maximal score

(record transform T and correspondence set CS that yields maximal score)

stage 1 sse alignment23

(i,p), (j,q)

(k,t)

(l,r)

(m,r)

(k,r)

(k,s)

(m,s)

(m,t)

(l,s)

(l,t)

(m,s)

(m,t)

(l,t)

(m,t)

(m,t)

Stage 1: SSE Alignment
  • A = (i, j, k, l, m)
  • B = (p, q, r, s, t)
  • Seed correspondence {(i,p),(j,q)}
  • Simultaneous gaps in both structures are not allowed (not in SCOP2)
  • Terminate a path when score of new correspondence is negative
  • Re-compute new transform with each new correspondence (?)
stage 2 atom core alignment
Stage 2: Atom (Core) Alignment
  • Construct correspondence pairs of atoms :
    • Atom i of A corresponds to atom j of T(B) iff i is the closest atom in A to j and j is the closest atom in T(B) to i
    • The distance between i and T(j) is <e (3Å)
  • Prune correspondence set to largest subset of correspondence pairs that follow backbone alignment constraint
  • Re-compute T to be the transform that minimizes the RMSD of the atoms in the correspondence set
  • Iterate 1-2-3 until RSMD converges
experimental results
Experimental Results
  • 685 protein structures from PDB such that each pair has less than 25% sequence identity
  • 3 families of folds (based on SCOP classification): - myoglobins (11 structures) – ~20% amino acid identity- TIM barrels (50 structures)- immunoglobulins (38 structures)
  • Goal: Given one query protein in each family, find the other members of the family (3×685 = 2055 alignments)
  • Method: For each query, sort the 685 structures by score (computed by LOCK). Select the top k proteins. Count members of family (true positives) and non-members (false positives)
slide26

Myoglobins (11)

TIM-barrels (50)

Immunoglobulins (38)

alignment of 50 tim barrels
Alignment of 50 TIM barrels

a-helices in red

b-strands in yellow

alignments of 31 immunoglobulins
Alignments of 31 Immunoglobulins

Only b-strands are shown

running time
Running Time
  • ~ 1ms per seed correspondence
  • ~ 1h to search 10,000 protein structures
  • ~ 100s of days to compare all pairs of proteins in PDB
  •  Geometric hashing to speedup stage 1
seed generation from fragment32
Seed Generation from Fragment
  • From distance matrices

E.g., DALI [Holm and Sander, 1996]

  • From secondary structure elements (SSE’s)

E.g., LOCK [Singh and Brutlag, 1996]

  • From voting scheme (using geometric hashing)

E.g., 3dSEARCH [Singh and Brutlag, 2000]

voting scheme with hash table
Voting Scheme with Hash Table
  • Many-to-many comparison requires a better organization of computation to avoid repeating the same computation again and again
  • Pre-computation: Index proteins in hash table
  • Query phase: Voting scheme using hash table
  • Several variants on this theme 3d-Lookup [Holm and Sander, 1995] 3dSEARCH [Singh 2002]
voting scheme with hash table34
Voting Scheme with Hash Table
  • Many-to-many comparison requires a better organization of computation to avoid repeting the same computation again and again
  • Pre-computation: Index proteins in hash table
  • Query phase: Voting scheme using hash table
  • Several variants on this theme 3d-Lookup [Holm and Sander, 1995]3dSEARCH[Singh 2002]
indexing target structures in hash table 3dsearch singh 2002
Indexing Target Structures in Hash Table (3dSEARCH [Singh 2002])
  • Hash table: 3-D regular grid of cubic bins (~2Å)
  • For each target structure

For each pair of vectors (i,j)

      • Compute a coordinate system
      • Place an entry for each other vectork into the bin containing the coordinates of the midpoint of the vector (or average of coordinates of origin, middle, and end points). Store ID of coordinate system + k’s orientation and type (a or b) in the entry.
slide36

v

u

v

u

Grid is same for all coordinate systems

slide37

v

v

u

u

Grid is same for all coordinate systems

indexing target structures in hash table 3dsearch singh 200238
Indexing Target Structures in Hash Table (3dSEARCH [Singh 2002])
  • Hash table: 3-D regular grid of cubic bins (~2Å)
  • For each target structure

For each pair of vectors (i,j)

      • Compute a coordinate system
      • Place an entry for each other vectork into the bin containing the coordinates of the midpoint of the vector (or average of coordinates of origin, middle, and end points). Store ID of coordinate system + k’s orientation and type (a or b) in the entry.
  • Grid is sparsely occupied  hash table
  • A structure with n SSEs contributes n(n-1)(n-2) entries. Each vector is represented (n-1)(n-2) times
  • 10,000 structures with 10 SSEs each yield ~7M entries
voting using hash table
Voting Using Hash Table

Given a query structure

  • For each pair of vectors (i,j)
    • Compute a coordinate system
    • For each other vector k
      • Retrieve the bin accessed by this vector and the neighboring bins
      • For every entry (vector) in those bins that has the same orientation and type as k, add a vote for the coordinate system stored in the entry
  • Sort target structures based on max number of votes received by any of its coordinate systems
  •  Small number of target structures. Use LOCK for better alignment
  • Hours of pure LOCK are reduced to seconds
advantages of voting system
Advantages of Voting System
  • Very efficient in practice for many-to-many comparisons
  • Can establish correspondence between partial, disconnected substructures
  • Parallel implementation is straightforward
  • Independent of the order in which vectors are considered
  • Drawback (?): May establish correspondences that do not satisfy the backbone sequence constraint
problem 4 find pharmacophore in ligands
Problem #4: Find Pharmacophore in Ligands
  • Given:
    • Collection of N (= 5 to 10) small flexible ligands with similar activity (binding at same sites)

Inhibitor binding to HIV protease

Benzamidine binding to beta-Trypsin (3ptb)

problem 4 find pharmacophore in ligands43
Problem #4: Find Pharmacophore in Ligands
  • Given:
    • Collection of N (= 5 to 10) small flexible ligands with similar activity (binding at same sites)
    • A set of low-energy conformations (dozens to few hundreds) for each ligand
problem 4 find pharmacophore in ligands44
Problem #4: Find Pharmacophore in Ligands
  • Given:
    • Collection of N (= 5 to 10) small flexible ligands with similar activity (binding at same sites)
    • A set of low-energy conformations (dozens to few hundreds) for each ligand
  • Find a substructure (pharmacophore) that has a match in at least one conformation of each ligand
slide46

O

O

O

H

slide47

O

O

O

H

slide48

O

O

O

O

O

O

H

H

pharmacophore

pharmacophore and rational drug design
Pharmacophore and Rational Drug Design
  • Pharmacophore identification is a form of “reverse engineering” to get a model of a binding site
  • A pharmacophore can be used to modify ligands into more potent drugs and/or to screen large databases of ligands for “leads”
three simultaneous problems
Three Simultaneous Problems
  • Conformations?
  • Correspondence?
  • Transform?
  • But ligands are small molecules
software
Software
  • DISCO [Martin et al., 1993]
  • DISCOtech and GASP [Tripos, Inc.]
  • CATALYST and HIPHOP [Accelrys et al.; Green et al., 1994; Barnum et al., 1996]
  • RAPIDP.W. Finn, L.E. Kavraki, J.C. Latombe, R. Motwani, C. Shelton, S. Venkatasubramanian, and A. Yao. RAPID: Randomized Pharmacophore Identification for Drug Design. Computational Geometry: Theory and Applications, 10, pp. 263-272, 1998
slide52

M1

M2

M3

pairwise comparison
Pairwise Comparison

Multi-Probe({M1,…,MN})

  • Extract invariants from M1 and M2 by calling Pair-Probe(P1,P2) on every pair of conformations of the two ligands
  • Test each candidate invariant S obtained at Step 1 against every ligand Mi, i = 3,…,N by calling Pair-Probe(S,P) on S and each conformation P of Mi
pair probe
Pair-Probe

n: smallest number of atoms/features in a liganda: given constant (0 <a≤ 1) P1 and P2: Conformations of two distinct ligands (or candidate invariant)

Pair-Probe(P1,P2)

Perform s times:

  • Pick a triplet of atoms at random from P1
  • Determine three atoms in P2 congruent to this triplet; compute the alignment transform T
  • Iterate: Apply T to P2; determine the atoms in P1 matching those in P2; update T
  • If the number of matching atoms exceed an, then return this atom set as a candidate invariant S
magnitude of s
Magnitude of s
  • Pr[picking 3 atoms in invariant] a3
  • Pr[failing to find invariant]  (1 -a3)s
  • We want: (1-a3)s  g(g is acceptable probability of failure)
  • s  ln(g)/ln(1-a3)
  • Since x <-ln(1-x) for 0 < x < 1, we get:s  ln(1/g)/a3
  • For g = 10-2 and a = 0.3, we get s  180
some results

1TLP

4TMN

5TMN

6TMN

Some Results
  • 63 to 69 atoms with 10 to 15 torsional degrees of freedom
  • Feature: every non-H atom  ~30 features of 6 types(atom types)
  • Invariant in active conformations: 7-atom pharmacophore + 7-atom scaffolding

#conf t(s) #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14

hausdorf distance

B

A

dH(A,B)

Hausdorf Distance
  • Two sets of points A = {a1,...,an} and B = {b1,...,bm} in k
  • dH(A,B) = maxaA minbB ||a-b||
  • DH(A,B) = max {dH(A,B), dH(B,A)}
  • Variation for shape similarity:ΔH(A,B) = minT DH(A,T(B))
  • But efficient algorithms only exist for planar sets of points
other idea minimize cost of transforming a into b
Other Idea: Minimize cost of transforming A into B
  • Old idea:
    • Graphics: Morphing distance
    • Computer vision: Earth Mover’s distance[Rubner, Tomasi, and Guibas, 1998]
  • Protein similarity:
    • Isotopic distance [Erdmann, 2004]
structure alignment isotopies
Structure Alignment Isotopies
  • Two curves are isotopic if one can be deformed into the other without self-collision
  • Example: Polygonal curve with n vertices
  • One may think of structure alignment as an isotopy deforming one structure into the other
  • Two structures are similar if the isotopy is “small”

M.A. Erdmann. Protein Similarity from Knot Theory: GeometricConvolution and Line Weavings, CMU Tech. Rep. CMU-CS-04-138.

small isotopy
“Small” Isotopy
  • Model a structure as a set of polygonal lines (e.g., vertices are Ca atoms)
  • Two structures A and B are (T,δ)-isotopic if there exists an isotopy deforming A into T(B) in such a way that no vertices of A moves further away than some δ from its initial or final location

[Erdmann 2004]

similarity measure
Similarity Measure
  • dT(A,B) = inf {δ | A is (T,δ)-isotopic to B}
  • d(A,B) = infT dT(A,B)
  • d is computable [Erdmann,2004]
  • But as complex as path planning, hence exponential in the number of degrees of freedom
  • Possibility of approximating d using probabilistic roadmaps?
topology of line weavings
Topology of Line Weavings

1xis

1nar

ahelix axes

M.A. Erdmann. Protein Similarity from Knot Theory: GeometricConvolution and Line Weavings, CMU Tech. Rep. CMU-CS-04-138.

slide66

-

-

 2 topologically equivalent line weavings

3 equivalent classes for 4 lines

[Erdmann 2004]

slide69

+

2 equivalent classes for 3 lines

 2 non-equivalent line weavings

why topology is interesting
Why topology is interesting?

Two conformations may be geometrically close (small RMSD) may require a long continuous deformation to map one into the other (without steric clashes)

conclusion
Conclusion
  • Automatic computation of structure similarity is essential due to the rapid growth of the PDB and other molecule (e.g., ligand) libraries
  • As the growth of new protein structures outpaces that of new folds, detecting structural similarity will have to be much more fine-grained than it is today
  • Biological discoveries will likely lie in local, possibly rare structure similarities, rather than in global fold-level classification
  • Need for better understanding of applications and radically new approaches
  • Still a lot of work ...