1 / 34

Multiple Sequence Alignment Based on Compact Set

Multiple Sequence Alignment Based on Compact Set. Department of Computer Science National Tsing Hua University Chuan Yi Tang. S 1 : ATTCG S 2 : AGTCG S 3 : ATCAG. S ’ 1 : A T – T C – G S ’ 2 : A – G T C – G S ’ 3 : A T – – C A G. 2. MSA. 2. 4. Cost = 8.

ted
Download Presentation

Multiple Sequence Alignment Based on Compact Set

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignment Based on Compact Set Department of Computer Science National Tsing Hua University Chuan Yi Tang

  2. S1:ATTCG S2:AGTCG S3:ATCAG S’1:A T – T C – G S’2:A – G T C – G S’3:A T –– C A G 2 MSA 2 4 Cost = 8 Multiple Sequence Alignment • Given s set of sequences,the MSA problem is to find an alignment of the sequences such that some object function is minimized • ie.(Sum of Pair Score)

  3. MSA with SP-Score:Exact Algorithm and Heuristics • k : # of Sequences n : Sequences of length • Exactly (using Dynamic Programming) • O((2n)k):D.Snakoff, Simultaneous solution of RNA folding, alignment and Protosequence prolblems, SIAM J. Appl. Math.,(1985) • Heuristics • D.F.Feng,R.F.Doolittle, Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351-360., (1987) • S.F.Altschul,D.J.Lipman, Trees,star and mutiple biological sequence aligment,SIAM J. Appl. Math.,(1989) • D.J.lipman,S.F.Altschul, A tool for multiple sequences alignment,Proc.Nat.Acad. Sci. U.S.A.,(1989) • S.C. Chan,A.K.C. Wang,D.K.Y. Chiu, A survey of multiples sequences comparison methods,Bull.Math Bio.,(1992)

  4. MSA with SP-Score:Complexity • J Comput Biol 1994 Winter;1(4):337-48 On the complexity of multiple sequence alignment. Wang L. Jiang T. McMaster University, Hamilton, Ontario, Canada. We study the computational complexity of two popular problems in multiple sequence alignment: 1. multiple alignment with SP-Score => NP-complete(non-metric) 2. multiple tree alignment => MAX SNP-hard • Theoretical Computer Science;259 (2001) 63-79 The complexity with Multiple sequence alignment with SP-score that is a metric Paola Bonizzoni, Gianluca Della Vedoa 1. multiple alignment with SP-Score => NP-complete(metric)

  5. MSA with SP-Score:Approximation • Approximation Algorithm: • Performance ratio of 2-2/k:D.Gusfilde,Efficient methods for multiple sequence alignment with guaranteed error bounds,Bull. Math Bio.,(1993) • Performance ratio of 2-3/k:P.Pevzner,Multiple alignment,communication cost,and graph matching,SIAM J. Appl. Math.,(1992) • Performance ratio of 2-l/k(assembling l-way alignments,l£ k):V.Bafna,E.L.Lawler and Pevzner,Approximation algorithms for multiple sequences alignment,Theor. Comput. Sci.,(1997) • Polynomial Time Approximation Scheme(PTAS): • MSA within a constant band and allows only constant number of insertion and deletion gaps of arbitrary length per sequence on average :M. Li,B. Ma. And L. Wang, Near optimal alignment within a band in polynomial time,STOC 2000.

  6. Compact Set Definition • Let S be the set of n objects {S1,S2,S3…Sn} and D(Si,Sj) denote the distance between Si and Sj in the distance matrix D. • Consider any C which is a subset of S,if the distance between elements in C and not in C is larger than the longest distance in C , then C is called a compact set. • Property : • The entire set S is a compact set. • Each set consisting of a single object is also a compact set.

  7. Compact Set Example 11Minimal border edge for compact set 3 S6 S5 10Maximal inside edge for compact set 3 S1 S4 Compact Set 1 Distance Matrix S2 S3 Compact Set 2 Compact Set 3

  8. Compact Set Example(con’t) • Compact Set is hierarchical

  9. MSA & Compact Set • Consider 12 Protein sequences example: • S1 :MAPSAPAKTAKALDAKKKVVKGKRTTHRRQVRTSVHFRRPVTLKTARQARFPRKSAPKTSKMDHFRIIQHPLTTESAMKKIEEHNTLVFIVSNDANKYQIKDAVHKLYNVQALKVNTLITPLQQKKAYVRLTADYDALDVANKIGVI • S2 :SSIIDYPLVTEKAMDEMDFQNKLQFIVDIDAAKPEIRDVVESEYDVTVVDVNTQITPEAEKKATVKLSAEDDAQDVASRIGVF • S3 :SWDVIKHPHVTEKAMNDMDFQNKLQFAVDDRASKGEVADAVEEQYDVTVEQVNTQNTMDGEKKAVVRLSEDDDAQEVASRIGVF • S4 :MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNTLVFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYDALDVANKIGII • S5 :MAPSTKATAAKKAVVKGTNGKKALKVRTSASFRLPKTLKLARSPKYATKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNTLVFKVSLKANKYQIKKAVKELYEVDVLSVNTLVRPNGTKKAYVRLTADFDALDIANRIGYI • S6 : MDAFDVIKTPIVSEKTMKLIEEENRLVFYVERKATKEDIKEAIKQLFNAEVAEVNTNITPKGQKKAYIKLKDEYNAGEVAASLGIY • S7 :MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKKSQKIRTKVTFHRPKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTESAMKKIEDNNTLVFIVDIKADKKKIKDAVKKMYDIQTKKVNTLIRPDGTKKAYVRLTPDYDALDVANKIGII • S8 :MAPSTKAASAKKAVVKGSNGSKALKVRTSTTFRLPKTLKLTRAPKYARKAVPHYQRLDNYKVIVAPIASETAMKKVEDGNTLVFQVDIKANKHQIKQAVKDLYEVDVLAVNTLIRPNGTKKAYVRLTADHDALDIANKIGYI • S9 :MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQRRIRTSVTFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLNTESAMKKIEDNNTLLFIVDLKANKRQIADAVKKLYDVTPLRVNTLIRPDGKKKAFVRLTPEVDALDIANKIGFI • S10 :MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNTLVFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYDALDVANKIGII • S11 :APSAKATAAKKAVVKGTNGKKALKVRTSATFRLPKTLKLARAPKYASKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNILVFQVSMKANKYQIKKAVKELYEVDVLKVNTLVRPNGTKKAYVRLTADYDALDIANRIGYI • S12 :MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTKRKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYPLTTDKAMKKIEENNTLTFIVDSRANKTEIKKAIRKLYQVKTVKVNTLIRPDGLKKAYIRLSASYDALDTANKMGLV Original sequence

  10. MSA & Compact Set(con’t) Original distance matrix Original Compact Set Tree Good MSA should Preserve Compact Set as well

  11. MSA & Compact Set(con’t) • S1’ :-----------------MAPSAPAKTAKALDAKKKVVKGKRTTHRRQVRTSVHFRRPVTLKTARQARFPRKSAPKTSKMDHFRIIQHPLTTESA… • S2’ :---------------------------------------------------------------------------------SSIIDYPLVTEKAMDEMDFQNKLQFIVDIDAAKPEIRDV… • S3’ :--------------------------------------------------------------------------------SWDVIKHPHVTEKAMNDMDFQNKLQFAVDDRASKGEV… • S4’ :--------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTES… • S5’ :----------------------MAPSTKATAAKKAVVKGTNGKKALKVRTSASFRLPKTLKLARSPKYATKAVPHYNRLDSYKVIEQPITSETAMKK… • S6’ :------------------------------------------------------------------------------MDAFDVIKTPIVSEKTMKLIEEENRLVFYVERKATKEDIKEA… • S7’ : ----------MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKKSQKIRTKVTFHRPKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTE… • S8’ :----------------------MAPSTKAASAKKAVVKGSNGSKALKVRTSTTFRLPKTLKLTRAPKYARKAVPHYQRLDNYKVIVAPIASETAMKK… • S9’ :------MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQRRIRTSVTFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLN… • S10’ :--------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTE… • S11’ : -----------------------APSAKATAAKKAVVKGTNGKKALKVRTSATFRLPKTLKLARAPKYASKAVPHYNRLDSYKVIEQPITSETAMKK… • S12’ :MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTKRKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYP… MSA by MSA1

  12. MSA & Compact Set(con’t) • S1’ : ------------MAPSAPAKTA-KALDAKKKVVKGK-RTTHR--R--QV--R---TSVHFRRPVTLKTARQARFPRKSAPK-TSKMDHFR-IIQHPL… • S2’ : ---------------------------------------------------------------------------------------S--SIIDYPLVTEKAMDEMDFQNKLQFIVDID- AAK… • S3’ : ---------------------------------------------------------------------------------------SW-DVIKHPHVTEKAMNDMDFQNKLQFAVD-DRA… • S4’ : MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK--K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AIIKFP… • S5’ : -----------------MAPST-KATAAKKAVVKGT-NG--K--KALKV--R---TSASFRLPKTLKLARSPKYATKAVPH-YNRLDSYK-VIEQPITSET… • S6’ : -------------------------------------------------------------------------------------MDAF-DVIKTPIVSEKTMKLIEEENRLVFYVER-KATK… • S7’ : MAP-A--KAD-PS-KKSDPK-A-QAAKVAKAVKSG--STLKK--KSQKI--R---TKVTFHRPKTLKKDRNPKYPRISAPG-RNKLDQY-GILKYP… • S8’ : -----------------MAPST-KAASAKKAVVKGS-NG--S--KALKV--R---TSTTFRLPKTLKLTRAPKYARKAVPH-YQRLDNYK-VIVAPIASET… • S9’ : MPPKSSTKAE-PKASSAKTQVA-KAKSAKKAVVKGT-SS--K--TQRRI--R---TSVTFRRPKTLRLSRKPKYPRTSVPH-APRMDAYRTLVR… • S10’ : MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK--K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AIIKF… • S11’ : ------------------APSA-KATAAKKAVVKGT-NG--K--KALKV--R---TSATFRLPKTLKLARAPKYASKAVPH-YNRLDSYK-VIEQPITSET… • S12’ : ------MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTKRKAYTRPQFRRPHTYRRPATVK-PSSNVSAIKNKWDAFR… MSA by MSA2

  13. MSA & Compact Set(con’t) Compact Set Tree by MSA1 Distance Matrix by MSA1

  14. MSA & Compact Set(con’t) Compact Set Tree by MSA2 Distance Matrix by MSA2

  15. Measure of Compact Set Preservation • How can we measure the Compact Set Preservation in quantity? N1: # of the original Compact Set relations N2: # of the relations preserved after MSA Estimate by Compact Set Preservation =

  16. Compact Set Tree Measure of Compact Set Preservation(con’t) Original Compact Set relations 1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3 4 5 1 4 5 2 4 5 3 Distance Matrix N1 = 10

  17. Measure of Compact Set Preservation(con’t) The relations preserved after MSA 1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3 4 5 1 4 5 2 4 5 3 1 2 4 1 2 5 1 4 3 3 5 1 2 4 3 3 5 2 1 2 3 1 4 5 2 4 5 3 5 4 × × × × × × × After MSA ======> Distance Matrix After MSA N2=10-7=3 => Compact Set Tree after MSA Estimate by Compact Set Preservation = 3/10

  18. Why Pair Wise Compact Set? • Evolutionary tree is the real judge • Evolutionary tree has property to minimize the total evolutionary edges (say tree size) from pair wise distance which seems to be compact • It is true in experiments

  19. Compact Set Relation Preserved Rate for Evolutionary Tree # of relations preserved in Evolutionary Tree / # of Compact Set relations of Pair Wise Distance More larger more better

  20. Compact Set Evaluation Algorithm • Step1 : Construct the original Compact Set Tree T and the Compact Set Tree after MSA T’ [1]. • Step2 : Preorder Traversal T’ to generate the Compact Set relations after MSA R’,and mark the entry in the hash table H’ according to R’. • Step3 : Preorder Traversal T to generate the Original Compact Set Relations R ,and check whether the marked entry in the hash table by R is a subset of the hash table H’. • Total Time Complexity = O( ),where n is the number of sequences • Reference: • 1. E. Dekel,J. Hu and W. Ouyang, An optimal algorithm for finding compact sets, Inform. Process. Lett. 44(1992) 285~289

  21. Our Strategy for MSA • Progressive alignment (Fei Feng and Doolittle: 1987 ) with neighbor first( by using Minimal Spanning Tree(MST) Kruskal Merging Order) • Set-to-Set align. Once a gap, always a gap. Kruskal merging order tree 3 S3:----ACAGACTCCA S4:TTTAAAAGTC---- 1 2 set1 S1 S2 S3 S4 S1:---AACAGACTT-A- S2:----ACAGACTT-AA S3:----ACAGACTCCA- S4:TTTAAAAGTC----- S1:AACAGACTTA- S2:-ACAGACTTAA set2

  22. Q: Why do we use MST Kruskal Order? A1:It has similar structure with compact set MST Order Merge Tree Compact Tree A2:MST Kruskal order is obtained easily

  23. Score function Match Begin- gap Gap-extended ---AACAGACTT-A- ----ACAGAC---AA ----ACAGACTCCA- TTTAAAAGTC-C--- End-gap Mismatch Gap-open

  24. Strategy of set-to-set alignment Score(8, 8) = Max{ Score(7, 7) +(α8:β8) Score(7, 8) +(α8:G3) Score(8, 7) +(G2:β8) *(α8:β8) = (G,C)+(G,-)+(G,G)+(-,C)+(-,-)+(-,G) = (-10)+(-15)+(10)+(-15)+(0)+(-15) = -45 Time Complexity of setα to setβ alignment = (sα*sβ*lα*lβ )=(2*3*8*8), Where sα,sβ are the number of sequences in setα and setβrespectively, and lα,lβ are the length of resulted sequences in setα and setβ respectively.

  25. Time Complexity of our strategy • The worst case happens in that the binary tree is balanced. • Total set-to-set time complexity is bounded by • where l is the length of the resulted sequences and n is the number of sequences. • The worst case time complexity = O(n2l2 )

  26. MSA Useful tools • GCG (Genetics Computer Group) : PileUp • http://gcg.nhri.org.tw:8003/gcg-bin/seqweb.cgi • Clustalw • http://clustalw.genome.ad.jp/

  27. Clustal W • Pairwise alignment • Calculate distance matrix • Construct the unrooted Neighbor-Joining (NJ) tree • Construct the rooted NJ tree • rooted at “mid-point” • Progressive alignment • Align following the rooted NJ tree • set-to-set alignment

  28. Experiment

  29. SP Score Result Clustalw and our result are better than GCG’s More larger more better

  30. Compact Set Relation Failure rate Result # of relation not preserved / # of source compact set relation More smaller more better

  31. Three-point Relative Scale Preserved Rate For all three species A, B,C, we evaluate their relative distance relation between original distance matrix and the MSA distance are identical or not.

  32. I Believe Tree Only • One might still not believe original pair wise distance is not a good judge • One believes the true evolutionary tree only

  33. Compact Set Relation Failure Rate Take Protein 12 for example # of relations not preserved / # of source Compact Set relations Distance MSA_Method More smaller more better

  34. Future Work • Is our measurement and algorithms really good? Simulations and Web service • Does Our MSA by set-to-set alignment satisfy some approximation property? Theoretical Proving • How can we reduce the time? Hardwired Dynamic Programming ex:PARACEL http://www.paracel.com/

More Related