Seminar in structural bioinformatics

Seminar in structural bioinformatics Multiple structural alignment of proteins By Elad Kaspani

Multiple structural alignment

Outline • Introduction • What is Multiple structural alignment? • Why do we need Multiple structural alignment? • Pairwise Vs. Multiple structural alignment • MASS - Multiple structural alignment by secondary structures • Problem definition • General strategy • Algorithm description

Outline Cont. • MASS - Multiple structural alignment by secondary structures • Algorithm outline • Complexity • Results Discussion • Summary & Conclusions

Introduction • Proteins sharing a common substructure may have a similar function. • What is Multiple structural alignment ? • Discussion – we already have pairwise alignment, isn’t that enough?

Pairwise Vs. Multiple structural alignment • We have many algorithms pairwise structural alignment task • Only a few methods are available for aligning multiple structures • Most of them are based on series of pairwise comparisons • SSAPm (Taylor et al., 1994) • Prism (Yang and Honig, 2000b) • STAMP (Russell and Barton, 1992)

What do we want? • Classification of existing and newly discovered proteins • Gaining insights into evolutionary relations between proteins • Detecting motifs common to a group of proteins that share a certain function • Structure prediction algorithms

What’s wrong with methods based on series of pairwise comparisons ???

Multiple structural alignment • These methods are limited!!! • In each pairwise comp. , the only information is about the two molecules • alignments optimal for the whole set can be disregarded • dynamic programming disadvantage - dependent on the sequence order of the polypeptide chain • We can’t see the woods 

WHAT DO WE DO THEN????????????? • multiple structural alignment by secondary structures MASS

MASS • Considers all the given structures at the same time • Exploiting the secondary structure representation - reduced time complexity • Does not require that all the input molecules be aligned • Capable of detecting structural motifs shared only by a subset of the molecules

MASS • Can find non-sequential and even non-topological structural motifs • Suitable for a broad range of applications • filter noisy results • highly efficient and robust • Other multiple-based methods • (Escalier et al., 1988) • MUSTA (Leibowitz et al., 2001) • MultiProt (Shatsky et al., 2002)

Secondary structure elements (SSE)

Basic terms • rigid transformation • Q - a subset • T (Q) =R(Q) + t where R is a 3x3 rotation matrix and t is a translation vector • ε-congruent • For ε>0, find two largest subsets of the input sets, P and Q, and a rigid transformation, T, so that distance(P, T (Q)) < ε • How do we measure distance? • RMSD

Problem Definition • The pairwise case: • given two proteins, represented by a set of points in 3D space • each point is associated with an atom’s position • find the largest set that is congruent to two subsets of points from each protein • In computational geometry - largest common point set (LCP) problem

Problem Definition • The multiple case: • given a collection of m point sets, • find the largest set of points, of which an ε-congruent copy appears in each of the input sets • Unfortunately, it’s NP-hard..... • We want not only the largest set of points, but also smaller common substructures

Problem Definition • The multiple subset case: • find solutions where only a subset of the input proteins is well aligned • this complicates the problem ! (why?) • number of subsets is exponential • trade-off between the size of the subset and the size of its core (match list) • scoring function (core size – L, proteins # -k) f(l,k) = k ( . ) L 2

The algorithm :

Method • Input : • a set of m proteins P1, P2, . . . , Pm. • For each protein • the sequence of the 3D coordinates of atoms • assignment of SSE types to each residue • Output : • The multiple alignments with the largest cores, according to the scoring function.

General strategy • We want multiple alignments with at least two SSEs • Bases – ordered pairs of SSEs whose ε-congruent copies appear in several proteins • We look for a set of ε-congruent bases {b1, b2, . . . , bk}, from proteins Pi1, Pi2, . . . , Pikrespectively. • First base (b1) is our pivot

General strategy – cont. • Compute all the k − 1 rigid transformations between this base and the others • Result - (T12, T13, . . . , T1k) defines multiple alignment between Pi1, Pi2, . , Pik • The core may contain more then one base • we will get several alignments with almost the same transformations • (one alignment per base in the core)

General strategy – cont. • Cluster the initial multiple base alignments • Merge thealignment. the core of the new alignment is the union of the cores of the original alignments. • We get smaller set of multiple alignments • Extend the clustered alignments • Find additional matching residues • Give a score to each alignment • Report the highest scoring alignments

Algorithm outline

Algorithm outline - stage 1 • Representation of secondary structure elements: • Axis representation for SSEs • The least squares line from all the Cα atoms • Direction & length determined by protein structure

Algorithm outline – stage 2 • Detection of multiple base alignments: • Use Geometric Hashing to detect bases whose ε-congruent copies appear in several proteins • Each base has fingerprint • invariant to a 3D rigid transformation • the types of the two SSEs • the angle between their axial vectors • the midpoint-to-midpoint distance • their line distance

Base fingerprint

Algorithm outline – stage 2 • Almost-congruent bases have similar fingerprints • the types of their SSEs are the same • the difference between their midpoint-to-midpoint and line distances is up to 1.5 Å • difference between their angles is up to 0.3 radians • reside close to each other in the grid

Algorithm outline – stage 2 • For each grid bin, extract all the bases of the bin and of adjacent bins • Group them together in the same base bucket • Base bucket - stores bases in columns according to the protein they belong to • Bases derived from the same protein are stored in the same column

Base bucket Almost-congruent bases are stored in the same base bucket

Stage 2 cont. • A collection of almost-congruent bases, each belonging to a different column induces a local multiple alignment between the respective proteins • core consists of at least two SSEs • One basis is selected as a pivot • rest of the bases are superimposed on it • Selection of the pivot may influence the alignment • Optional – try each base as pivot

Stage 2 cont. • Multiple alignment is defined by an underlying set of pairwise alignments • For each base bucket we compute all the alignments between two bases taken from two different columns • find the transformation between two bases that aligns the maximal number of atoms with minimal RMSD

Cα atomic level

Stage 3 - Clustering • For pair of proteins that share more then one base • We get more alignments with almost the same transformation, but a different local SSE core • Cluster all the local base alignments to find the ones with similar transformations • merge them into a new global alignment • The match list (core) of the new global alignment • union of the original local match lists • its transformation is the one that aligns the SSEs with minimal RMSD

Stage 4 - Global extension • Now the core of each pairwise alignment is a set of SSEs • Then we extend these alignments by finding additional matching residues • The residues not necessarily belong to SSEs • We want to extend the cores of these alignments by detecting corresponding Cα atoms • We want to transform the second protein, so that it is fully superimposed onto the pivot protein

Stage 4 - Global extension • Detect in linear time close pairs of C atoms, one atom from each protein • These atom pairs are added to the alignment’s match list • transformation of the alignment is refined by employing the Least-Squares Fitting method

Stage 5 – Filtering & Scoring • Computing the best global multiple alignments • What are the best global multiple alignments? • Number of aligned molecules Vs. core size • core size Vs. size of the smallest molecule • number of possible multiple alignments defined by the base buckets is exponential • We do not compute all of them

Stage 5 – Filtering & Scoring • Heuristic solution: • For each BB compute the set of best multiple alignments recursively over the colomns • For a set of multiple base alignments, obtained by last stage (b1, . . . , bk) • Check if there is a base, bk+1, from the current column that improve the alignment’s score Core(b1, . . . , bk+1) = Core(b1, . . . , bk)∩Core(b1, bk+1)

Stage 5 – Filtering & Scoring • Our scoring function • Core size – L • Proteins number - k • f(l,k) = k • Report the highest scoring alignments • Finish ! ( ) . L 2

Complexity • Worst case complexity: • (i) m is the number of proteins • (ii) k is the number of residues in an SSE • (iii) s and n are the number of SSEs and the number of residues found in each protein respectively. • n ~ 300, k ~ 10, s ~ 15 • The number of bases for each protein is O(s 2)

Complexity • For each pair of proteins we construct, cluster and extend O(s 4) pairwise alignments. • This results in O(m 2(s4k3 +s8 log s +s4n)) time where O(m2) is the number of ways of pairing two proteins • In practice, the complexity is much smaller • we only construct the pairwise alignments defined by the BBs and the clustering reduces their number even more

Complexity • The number of evaluated multiple alignments is linear in the number of bases • Each base can be a pivot for only one multiple alignment • We have O(ms2) bases • It takes O(ms2n) time to construct a single multiple alignment and O(m2s4n) time to construct all of them • Running time for intire algorithm is bounded by O(m2s4(k3 + s2 log s + n)), but experiments show that the actual running time is significantly lower

Algorithm outline (reminder)

Results and Discussion

Experiment 1 • Example 1 - Detection of subset alignments and their use for structural classification • We have used MASS to align a set of 12 structures from two families: • Cofilin-like (CL) • Gelsolin-like (GL) • The two families are related structurally but not sequentially

Experiment 1 • The 12-molecule ensemble contains: • four CL structures • eight GL • The running time of MASS on this ensemble was 36 sec. • (Pentium 4 1800 MHz processor)

Experiment 1: core Vs. # Molecules

Experiment 1: Results (A) The structural alignment of all 12 proteins of the ensemble. (B) A subset alignment between only the eight GL proteins.

Experiment 1: Results (C) A subset alignment between only the four CL structures. (D) A subset alignment between only three out of the four CL structures.

Results Discussion • As expected, the maximal core size decreases as the number of aligned molecules increases • The dependence is not linear: • Large decrease between three to four molecules • Between four to five molecules • Between eight to nine molecules

Experiment 2 • Non-topological motif detection • The ensembles share a common SSE motif, but different topology. • In topological motifs, the order and the direction of the corresponding SSEs along the polypeptide chain are conserved while in non-topological they are not.

Seminar in structural bioinformatics