340 likes | 475 Views
This document discusses multiple sequence alignment (MSA) methodologies within computational biology, focusing on the optimization of alignment processes to minimize specific objective functions, such as the Sum of Pair Score (SP-Score). It differentiates between exact algorithms using dynamic programming and heuristic approaches to address the NP-completeness of the problem. Additionally, the concept of compact sets is explored, along with examples from protein sequences that illustrate hierarchical MSA and relevant complexities in alignment problems.
E N D
Multiple Sequence Alignment Based on Compact Set Department of Computer Science National Tsing Hua University Chuan Yi Tang
S1:ATTCG S2:AGTCG S3:ATCAG S’1:A T – T C – G S’2:A – G T C – G S’3:A T –– C A G 2 MSA 2 4 Cost = 8 Multiple Sequence Alignment • Given s set of sequences,the MSA problem is to find an alignment of the sequences such that some object function is minimized • ie.(Sum of Pair Score)
MSA with SP-Score:Exact Algorithm and Heuristics • k : # of Sequences n : Sequences of length • Exactly (using Dynamic Programming) • O((2n)k):D.Snakoff, Simultaneous solution of RNA folding, alignment and Protosequence prolblems, SIAM J. Appl. Math.,(1985) • Heuristics • D.F.Feng,R.F.Doolittle, Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351-360., (1987) • S.F.Altschul,D.J.Lipman, Trees,star and mutiple biological sequence aligment,SIAM J. Appl. Math.,(1989) • D.J.lipman,S.F.Altschul, A tool for multiple sequences alignment,Proc.Nat.Acad. Sci. U.S.A.,(1989) • S.C. Chan,A.K.C. Wang,D.K.Y. Chiu, A survey of multiples sequences comparison methods,Bull.Math Bio.,(1992)
MSA with SP-Score:Complexity • J Comput Biol 1994 Winter;1(4):337-48 On the complexity of multiple sequence alignment. Wang L. Jiang T. McMaster University, Hamilton, Ontario, Canada. We study the computational complexity of two popular problems in multiple sequence alignment: 1. multiple alignment with SP-Score => NP-complete(non-metric) 2. multiple tree alignment => MAX SNP-hard • Theoretical Computer Science;259 (2001) 63-79 The complexity with Multiple sequence alignment with SP-score that is a metric Paola Bonizzoni, Gianluca Della Vedoa 1. multiple alignment with SP-Score => NP-complete(metric)
MSA with SP-Score:Approximation • Approximation Algorithm: • Performance ratio of 2-2/k:D.Gusfilde,Efficient methods for multiple sequence alignment with guaranteed error bounds,Bull. Math Bio.,(1993) • Performance ratio of 2-3/k:P.Pevzner,Multiple alignment,communication cost,and graph matching,SIAM J. Appl. Math.,(1992) • Performance ratio of 2-l/k(assembling l-way alignments,l£ k):V.Bafna,E.L.Lawler and Pevzner,Approximation algorithms for multiple sequences alignment,Theor. Comput. Sci.,(1997) • Polynomial Time Approximation Scheme(PTAS): • MSA within a constant band and allows only constant number of insertion and deletion gaps of arbitrary length per sequence on average :M. Li,B. Ma. And L. Wang, Near optimal alignment within a band in polynomial time,STOC 2000.
Compact Set Definition • Let S be the set of n objects {S1,S2,S3…Sn} and D(Si,Sj) denote the distance between Si and Sj in the distance matrix D. • Consider any C which is a subset of S,if the distance between elements in C and not in C is larger than the longest distance in C , then C is called a compact set. • Property : • The entire set S is a compact set. • Each set consisting of a single object is also a compact set.
Compact Set Example 11Minimal border edge for compact set 3 S6 S5 10Maximal inside edge for compact set 3 S1 S4 Compact Set 1 Distance Matrix S2 S3 Compact Set 2 Compact Set 3
Compact Set Example(con’t) • Compact Set is hierarchical
MSA & Compact Set • Consider 12 Protein sequences example: • S1 :MAPSAPAKTAKALDAKKKVVKGKRTTHRRQVRTSVHFRRPVTLKTARQARFPRKSAPKTSKMDHFRIIQHPLTTESAMKKIEEHNTLVFIVSNDANKYQIKDAVHKLYNVQALKVNTLITPLQQKKAYVRLTADYDALDVANKIGVI • S2 :SSIIDYPLVTEKAMDEMDFQNKLQFIVDIDAAKPEIRDVVESEYDVTVVDVNTQITPEAEKKATVKLSAEDDAQDVASRIGVF • S3 :SWDVIKHPHVTEKAMNDMDFQNKLQFAVDDRASKGEVADAVEEQYDVTVEQVNTQNTMDGEKKAVVRLSEDDDAQEVASRIGVF • S4 :MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNTLVFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYDALDVANKIGII • S5 :MAPSTKATAAKKAVVKGTNGKKALKVRTSASFRLPKTLKLARSPKYATKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNTLVFKVSLKANKYQIKKAVKELYEVDVLSVNTLVRPNGTKKAYVRLTADFDALDIANRIGYI • S6 : MDAFDVIKTPIVSEKTMKLIEEENRLVFYVERKATKEDIKEAIKQLFNAEVAEVNTNITPKGQKKAYIKLKDEYNAGEVAASLGIY • S7 :MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKKSQKIRTKVTFHRPKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTESAMKKIEDNNTLVFIVDIKADKKKIKDAVKKMYDIQTKKVNTLIRPDGTKKAYVRLTPDYDALDVANKIGII • S8 :MAPSTKAASAKKAVVKGSNGSKALKVRTSTTFRLPKTLKLTRAPKYARKAVPHYQRLDNYKVIVAPIASETAMKKVEDGNTLVFQVDIKANKHQIKQAVKDLYEVDVLAVNTLIRPNGTKKAYVRLTADHDALDIANKIGYI • S9 :MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQRRIRTSVTFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLNTESAMKKIEDNNTLLFIVDLKANKRQIADAVKKLYDVTPLRVNTLIRPDGKKKAFVRLTPEVDALDIANKIGFI • S10 :MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNTLVFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYDALDVANKIGII • S11 :APSAKATAAKKAVVKGTNGKKALKVRTSATFRLPKTLKLARAPKYASKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNILVFQVSMKANKYQIKKAVKELYEVDVLKVNTLVRPNGTKKAYVRLTADYDALDIANRIGYI • S12 :MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTKRKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYPLTTDKAMKKIEENNTLTFIVDSRANKTEIKKAIRKLYQVKTVKVNTLIRPDGLKKAYIRLSASYDALDTANKMGLV Original sequence
MSA & Compact Set(con’t) Original distance matrix Original Compact Set Tree Good MSA should Preserve Compact Set as well
MSA & Compact Set(con’t) • S1’ :-----------------MAPSAPAKTAKALDAKKKVVKGKRTTHRRQVRTSVHFRRPVTLKTARQARFPRKSAPKTSKMDHFRIIQHPLTTESA… • S2’ :---------------------------------------------------------------------------------SSIIDYPLVTEKAMDEMDFQNKLQFIVDIDAAKPEIRDV… • S3’ :--------------------------------------------------------------------------------SWDVIKHPHVTEKAMNDMDFQNKLQFAVDDRASKGEV… • S4’ :--------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTES… • S5’ :----------------------MAPSTKATAAKKAVVKGTNGKKALKVRTSASFRLPKTLKLARSPKYATKAVPHYNRLDSYKVIEQPITSETAMKK… • S6’ :------------------------------------------------------------------------------MDAFDVIKTPIVSEKTMKLIEEENRLVFYVERKATKEDIKEA… • S7’ : ----------MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKKSQKIRTKVTFHRPKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTE… • S8’ :----------------------MAPSTKAASAKKAVVKGSNGSKALKVRTSTTFRLPKTLKLTRAPKYARKAVPHYQRLDNYKVIVAPIASETAMKK… • S9’ :------MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQRRIRTSVTFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLN… • S10’ :--------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTE… • S11’ : -----------------------APSAKATAAKKAVVKGTNGKKALKVRTSATFRLPKTLKLARAPKYASKAVPHYNRLDSYKVIEQPITSETAMKK… • S12’ :MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTKRKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYP… MSA by MSA1
MSA & Compact Set(con’t) • S1’ : ------------MAPSAPAKTA-KALDAKKKVVKGK-RTTHR--R--QV--R---TSVHFRRPVTLKTARQARFPRKSAPK-TSKMDHFR-IIQHPL… • S2’ : ---------------------------------------------------------------------------------------S--SIIDYPLVTEKAMDEMDFQNKLQFIVDID- AAK… • S3’ : ---------------------------------------------------------------------------------------SW-DVIKHPHVTEKAMNDMDFQNKLQFAVD-DRA… • S4’ : MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK--K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AIIKFP… • S5’ : -----------------MAPST-KATAAKKAVVKGT-NG--K--KALKV--R---TSASFRLPKTLKLARSPKYATKAVPH-YNRLDSYK-VIEQPITSET… • S6’ : -------------------------------------------------------------------------------------MDAF-DVIKTPIVSEKTMKLIEEENRLVFYVER-KATK… • S7’ : MAP-A--KAD-PS-KKSDPK-A-QAAKVAKAVKSG--STLKK--KSQKI--R---TKVTFHRPKTLKKDRNPKYPRISAPG-RNKLDQY-GILKYP… • S8’ : -----------------MAPST-KAASAKKAVVKGS-NG--S--KALKV--R---TSTTFRLPKTLKLTRAPKYARKAVPH-YQRLDNYK-VIVAPIASET… • S9’ : MPPKSSTKAE-PKASSAKTQVA-KAKSAKKAVVKGT-SS--K--TQRRI--R---TSVTFRRPKTLRLSRKPKYPRTSVPH-APRMDAYRTLVR… • S10’ : MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK--K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AIIKF… • S11’ : ------------------APSA-KATAAKKAVVKGT-NG--K--KALKV--R---TSATFRLPKTLKLARAPKYASKAVPH-YNRLDSYK-VIEQPITSET… • S12’ : ------MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTKRKAYTRPQFRRPHTYRRPATVK-PSSNVSAIKNKWDAFR… MSA by MSA2
MSA & Compact Set(con’t) Compact Set Tree by MSA1 Distance Matrix by MSA1
MSA & Compact Set(con’t) Compact Set Tree by MSA2 Distance Matrix by MSA2
Measure of Compact Set Preservation • How can we measure the Compact Set Preservation in quantity? N1: # of the original Compact Set relations N2: # of the relations preserved after MSA Estimate by Compact Set Preservation =
Compact Set Tree Measure of Compact Set Preservation(con’t) Original Compact Set relations 1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3 4 5 1 4 5 2 4 5 3 Distance Matrix N1 = 10
Measure of Compact Set Preservation(con’t) The relations preserved after MSA 1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3 4 5 1 4 5 2 4 5 3 1 2 4 1 2 5 1 4 3 3 5 1 2 4 3 3 5 2 1 2 3 1 4 5 2 4 5 3 5 4 × × × × × × × After MSA ======> Distance Matrix After MSA N2=10-7=3 => Compact Set Tree after MSA Estimate by Compact Set Preservation = 3/10
Why Pair Wise Compact Set? • Evolutionary tree is the real judge • Evolutionary tree has property to minimize the total evolutionary edges (say tree size) from pair wise distance which seems to be compact • It is true in experiments
Compact Set Relation Preserved Rate for Evolutionary Tree # of relations preserved in Evolutionary Tree / # of Compact Set relations of Pair Wise Distance More larger more better
Compact Set Evaluation Algorithm • Step1 : Construct the original Compact Set Tree T and the Compact Set Tree after MSA T’ [1]. • Step2 : Preorder Traversal T’ to generate the Compact Set relations after MSA R’,and mark the entry in the hash table H’ according to R’. • Step3 : Preorder Traversal T to generate the Original Compact Set Relations R ,and check whether the marked entry in the hash table by R is a subset of the hash table H’. • Total Time Complexity = O( ),where n is the number of sequences • Reference: • 1. E. Dekel,J. Hu and W. Ouyang, An optimal algorithm for finding compact sets, Inform. Process. Lett. 44(1992) 285~289
Our Strategy for MSA • Progressive alignment (Fei Feng and Doolittle: 1987 ) with neighbor first( by using Minimal Spanning Tree(MST) Kruskal Merging Order) • Set-to-Set align. Once a gap, always a gap. Kruskal merging order tree 3 S3:----ACAGACTCCA S4:TTTAAAAGTC---- 1 2 set1 S1 S2 S3 S4 S1:---AACAGACTT-A- S2:----ACAGACTT-AA S3:----ACAGACTCCA- S4:TTTAAAAGTC----- S1:AACAGACTTA- S2:-ACAGACTTAA set2
Q: Why do we use MST Kruskal Order? A1:It has similar structure with compact set MST Order Merge Tree Compact Tree A2:MST Kruskal order is obtained easily
Score function Match Begin- gap Gap-extended ---AACAGACTT-A- ----ACAGAC---AA ----ACAGACTCCA- TTTAAAAGTC-C--- End-gap Mismatch Gap-open
Strategy of set-to-set alignment Score(8, 8) = Max{ Score(7, 7) +(α8:β8) Score(7, 8) +(α8:G3) Score(8, 7) +(G2:β8) *(α8:β8) = (G,C)+(G,-)+(G,G)+(-,C)+(-,-)+(-,G) = (-10)+(-15)+(10)+(-15)+(0)+(-15) = -45 Time Complexity of setα to setβ alignment = (sα*sβ*lα*lβ )=(2*3*8*8), Where sα,sβ are the number of sequences in setα and setβrespectively, and lα,lβ are the length of resulted sequences in setα and setβ respectively.
Time Complexity of our strategy • The worst case happens in that the binary tree is balanced. • Total set-to-set time complexity is bounded by • where l is the length of the resulted sequences and n is the number of sequences. • The worst case time complexity = O(n2l2 )
MSA Useful tools • GCG (Genetics Computer Group) : PileUp • http://gcg.nhri.org.tw:8003/gcg-bin/seqweb.cgi • Clustalw • http://clustalw.genome.ad.jp/
Clustal W • Pairwise alignment • Calculate distance matrix • Construct the unrooted Neighbor-Joining (NJ) tree • Construct the rooted NJ tree • rooted at “mid-point” • Progressive alignment • Align following the rooted NJ tree • set-to-set alignment
SP Score Result Clustalw and our result are better than GCG’s More larger more better
Compact Set Relation Failure rate Result # of relation not preserved / # of source compact set relation More smaller more better
Three-point Relative Scale Preserved Rate For all three species A, B,C, we evaluate their relative distance relation between original distance matrix and the MSA distance are identical or not.
I Believe Tree Only • One might still not believe original pair wise distance is not a good judge • One believes the true evolutionary tree only
Compact Set Relation Failure Rate Take Protein 12 for example # of relations not preserved / # of source Compact Set relations Distance MSA_Method More smaller more better
Future Work • Is our measurement and algorithms really good? Simulations and Web service • Does Our MSA by set-to-set alignment satisfy some approximation property? Theoretical Proving • How can we reduce the time? Hardwired Dynamic Programming ex:PARACEL http://www.paracel.com/