320 likes | 550 Views
Sequence Comparison. Introduction Comparison Homogy -- Analogy Identity -- Similarity Pairwise -- Multiple Scoring Matrixes Gap -- indel Global -- Local M anual alignment , dot plot v isual inspection Dynamic programming Needleman-Wunsch exhaustive global alignment Smith-Waterman
E N D
Sequence Comparison • Introduction • Comparison • Homogy -- Analogy • Identity -- Similarity • Pairwise -- Multiple • Scoring Matrixes • Gap -- indel • Global -- Local • Manual alignment, dot plot • visual inspection • Dynamic programming • Needleman-Wunsch • exhaustive global alignment • Smith-Waterman • exhaustive local alignment • Multiple alignment • Database search • BLAST • FASTA
Sequence Comparison Multiple alignment (Multiple sequence alignment: MSA)
Sequence Comparison Multiple alignment Multiple sequence alignment - Computational complexity V S N S _ S N A A N S V S N S
Sequence Comparison Multiple alignment Multiple sequence alignment - Computational complexity Alignment of protein sequences with 200 amino acids using dynamic programming # of sequences CPU time (approx.) 2 1 sec 4 104 sec – 2,8 hours 5 106 sec – 11,6 days 6 108 sec – 3,2 years 7 1010 sec – 371 years
Sequence Comparison Multiple alignment Approximate methods for MSA • Multidimensional dynamic programming(MSA, Lipman 1988) • Progressive alignments(Clustalw, Higgins 1996; PileUp, Genetics Computer Group (GCG)) • Local alignments(e.g. DiAlign, Morgenstern 1996; lots of others) • Iterative methods (e.g. PRRP, Gotoh 1996) • Statistical methods (e.g. Bayesian Hidden Markov Models)
Sequence Comparison Multiple alignment Multiple sequence alignment - Programs Progressive Multidimentional Dynamic programming Clustal Tree based T-Coffee DCA MSA Combalign Dalign OMA Interalign Prrp Non tree based GA SAGA Sam HMMER GAs Iterative HMMS
Sequence Comparison Multiple alignment Multiple sequence alignment - Computational complexity Program Seq type Alignment Methode Comment ClustalW Prot/DNA Global Progressive No format limitation Run on Windows too! PileUp Prot/DNA Global Progressive Limited by the format and UNIX based MultAlin Prot/DNA Global Progressive/Iterativ Limited by the format T-COFFEE Prot/DNA Global/local Progressive Can be slow
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X) • ClustalW uses a progressive algorithm. Instead of aligning all sequences at once, it adds them little by little. • Pairwise comparison of all sequences to align. • „Clustering by similarity“ resulting in a dendrogram. • Following the dendrogram topology, ClustalW aligns most similar pairs. • Each alignment is replaced by a consensus sequence and • further aligned as if it was a single sequence. • ClustalW treats multiple alignments like single sequences and aligns them progressively two-by-two. • Thus, alignment errors early in the procedure propagate throughout the whole MSA.
1 + 2 1 + 3 1 + 4 2 + 3 2 + 4 3 + 4 Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X) Principle: Pairwise Alignment Guide Tree Multiple Alignment by adding sequences 1 2 3 4 2 3 4 1 1 2 3
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X) Pairwise Comparison of all sequences 1 : 2 1 : 3 1 : 4 1 : 5 2 : 3 2 : 4 2 : 5 3 : 4 3 : 5 4 : 5 Similarity score of every pair distance score of every pair
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X) Sequence 1 2 3 4 5 Guide Tree 1 1 2 3 4 5 Distance Matrix: displays distances of all sequence pairs. 5 2 3 4
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X) Guide Tree 1 5 2 3 4
G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X) G T C C G - C A G G T T - C G C C - G G T T A C T T C C A G G
G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G . . . . and new gaps are inserted. Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X) G T C C G - C A G G T T - C G C C - G G T T A C T T C C A G G
G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G A T C - T - - C A A T C T G - T C C C T A G Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X) G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G A T C T - - C A A T C T G T C C C T A G
core loops Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X) CLUSTAL W (1.74) multiple sequence alignment sp|P20472|PRVA_HUMAN EDIKKAVGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFILKG sp|P32848|PRVA_MOUSE EDIKKAIGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSILKG sp|P18087|PRVA_RANCA GDISKAVEAFAAPDS--FNHKKFFEMCG------LKSKGPDVMKQVFGILDQDRSGFIEEDELCLMLKG sp|P02629|PRVA_LATCH EDIDKALNTFKEAGS--FDHHKFFNLVG------LKGKPDDTLKEVFGILDQDKSGYIEEEELKFVLKG sp|P02616|PRVB_AMPME KDIEAALSSVKAAES--FNYKTFFTKCG------LAGKPTDQVKKVFDILDQDKSGYIEEDELQLFLKN sp|P51879|ONCO_MOUSE DDIAAALQECQDPDT--FEPQKFFQTSG------LSKMSASQLKDIFQFIDNDQSGYLDEDELKYFLQR sp|P56503|PRVB_MERBI ADVAAALKACEAADS--FNYKAFFAKVG------LTAKSADDIKKAFFVIDQDKSGFIEEDELKLFLQV sp|P59747|PRVB_SCOJP AEVTAALDGCKAAGS--FDHKKFFKACG------LSGKSTDEVKKAFAIIDQDKSGFIEEEELKLFLQN sp|P02620|PRVB_MERME ADITAALAACKAEGS--FKHGEFFTKIG------LKGKSAADIKKVFGIIDQDKSDFVEEDELKLFLQN sp|P02630|PRVA_RAJCL ADITKALEQCAAG----FHHTAFFKASG------LSKKSDAELAEIFNVLDGDQSGYIEVEELKNFLKC sp|P02586|TPCS_RABIT EELDAIIEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEIFR- :: : :. *: : . * .:* : ..::: :** .:: * A star indicates an entirely conserved column. : A colon indicates columns, where all residues have roughly the same size and hydropathy. ● A period indicates columns, where the size or the hydropathy has been preserved in the course of evolution.
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X)
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X)
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X)
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X)
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X)
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X)
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X)
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X) >hso MNTWKEAIGQEKQQPYFQHILQQVQQARQSGRTIYPPQEEVFSAFRLTEFDQVRVVILGQDPYHGV NQAHGLAFSVKPGIAPPPSLVNIYKELSTDIMGFQTPSHGYLVGWAKQGVLLLNTVLTVEQGLAHSHANF GWETFTDRVIHVLNEQRDHLVFLLWGSHAQKKGQFIDRTKHCVLTSPHPSPLSAHRGFFGCRHFSKTNQY LRHHNLTEINWQLPMTI >pmu MKTWKDVIGTEKTQPYFKHILDQVHQARASGKIVYPPPQEVFSAFQLTEFEAVKVVIIGQDPYHGPNQAH GLAFSVKPGVVPPPSLMNMYKELTQDIEGFQIPNHGYLVPWAEQGVLLLNTVLTVEQGKAHSHASFGWET FTDRVIAALNAQREKLVFLLWGSHAQKKGQFIDRQKHCVFTAPHPSPLSAHRGFLGCRHFSKTNAYLMAQ GLSPIQWQLASL >hdu MNSWTEAIGEEKVQPYFQQLLQQVYQARASGKIIYPPQHEVFSAFALTDFKAVKVVILGQDPYHGPNQAH GLAFSVKPSVVPPPSLVNIYKELAQDIAGFQVPSHGYLIDWAKQGVLLLNTVLTVQQGMAHSHATLGWEI FTDKVIAQLNDHRENLVFLLWGSHAQKKGQFINRSRHCVLTAPHPSPLSAHRGFFGCQHFSKANAYLQSK GIATINWQLPLVV >apl MNNWTEALGEEKQQPYFQHILQQVHQERMNGVTVFPPQKEVFSAFALTEFKDVKVVILGQDPYHGPNQAH GLAFSVKPPVAPPPSLVNMYKELAQDVEGFQIPNHGYLVDWAKQGVLLLNTVLTVRQGQAHSHANFGWEI FTDKVIAQLNQHRENLVFLLWGSHAQKKGQFIDRSRHCVLTAPHPSPLSAYRGFFGCKHFSKTNRYLLSK GIAPINWQLRLEIDY >hin MKNWTDVIGTEKAQPYFQHTLQQVHLARASGKTIYPPQEDVFNAFKYTAFEDVKVVILGQDPYHGPNQAH GLAFSVKPEVAIPPSLLNIYKELTQDISGFQMPSNGYLVKWAEQGVLLLNTVLTVERGMAHSHANLGWER FTDKVIAVLNEHREKLVFLLWGSHAQKKGQMIDRTRHLVLTAPHPSPLSAHRGFFGCRHFSKTNSYLESH GIKPIDWQI >sfl MANELTWHDVLAEEKQQPYFLNTLQTVASERQSGVTIYPPQKDVFNAFRFTELGDVKVVILGQDPYHGPG QAHGLAFSVRPGIAIPPSLLNMYKELENTIPGFTRPNHGYLESWARQGVLLLNTVLTVRAGQAHSHASLG WETFTDKVISLINQHREGVVFLLWGSHAQKKGAIIDKQRHHVLKAPHPSPLSAHRGFFGCNHFVLANQWL EQRGETPIDWMPVLPAECE
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X)
Sequence Comparison Multiple alignment Multiple sequence alignment – ClustalW (X) Clustal file format vibrio.aln Clustal file format vibrio.dnd CLUSTAL X (1.81) multiple sequence alignment hdu ----------------------MN---SWTEAIGEEKVQPYFQQLLQQVYQARASGKIIY apl ----------------------MN---NWTEALGEEKQQPYFQHILQQVHQERMNGVTVF hso ----------------------MN---TWKEAIGQEKQQPYFQHILQQVQQARQSGRTIY pmu ----------------------MK---TWKDVIGTEKTQPYFKHILDQVHQARASGKIVY hin ----------------------MK---NWTDVIGTEKAQPYFQHTLQQVHLARASGKTIY sfl ----------------------MANELTWHDVLAEEKQQPYFLNTLQTVASERQSGVTIY eco -----------------------ANELTWHDVLAEEKQQPHFLNTLQTVASERQSGVTIY sen ----------------------MATELTWHDVLADEKQQPYFINTLHTVAGERQSGITVY vvu ----------------------MTQQLTWHDVIGAEKEQSYFQQTLNFVEAERQAGKVIY vpa ----------------------MNQSPTWHDVIGEEKKQSYFVDTLNFVEAERAAGKAIY vch ----------------------MSESLTWHDVIGNEKQQAYFQQTLQFVESQRQAGKVIY ype ----------------------MSPSLTWHDVIGQEKEQPYFKDTLAYVAAERRAGKTIY vfi ----------------------MA--LTWNSIISAEKKKAYYQSMSEKIDAQRSLGKSIF vsa ----------------------MN--TSWNDILETEKEKPYYQEMMTYINEARSQGKKIF son --------------------------MTWPAFIDHQRTQPYYQQLIAFVNQERQVGKVIY cbl --------------------MPK---LTWQLLLSQEKNLPYFKNIFTILNQQKKSGKIIY bap --------------------MDNRTLLNWSSILKNEKKKYYFINIINHLFFERQK-KMIF cbu -------------------MTTMAETQTWQTVLGEEKQEPYFQEILDFVKKERKAGKIIY dra --MTDQPDLFGLAPDAPRPIIPANLPEDWQEALLPEFSAPYFHELTDFLRQERKE-YTIY xax --MTE-------------GEGRIQLEPSWKARVGDWLLRPQMRELSAFLRQRKAAGARVF xca --MTE-------------GEGRIQLEPSWKARVGEWLLQPQMQELSAFLRQRKAANARVF xfa --MNEQGKAINSS-----AESRIQLESSWKAHVGNWLLRPEMRDLSSFLRARKVAGVSVY pfl MTMTA--------------DDRIKLEPSWKEALRAEFDQPYMTELRTFLQQERAAGKEIY psy --MTS--------------DDRIKLEPSWKEALRDEFEQPYMAQLREFLRQEHAAGKEIY ppu --MTD--------------DDRIKLEPSWKAALRGEFDQPYMHQLREFLRGEYAAGKEIY pae --MTDN-------------DDRIKLEASWKEALREEFDKPYMKQLGEFLRQEKAAGKAIF avi --MGRV-------------EDRVRLEASWKEALHDEFEKPYMQELSDFLRREKAAGKEIY mde --MQPN-------------GKHVQLCESWMQQIGQEFEQPYMAELKAFLLREKKAGKTIY * : : :: ( hso:0.11940, ( hdu:0.08584, apl:0.08905) :0.03531) :0.00478, pmu:0.11739) :0.00668, hin:0.10800) :0.04106, ( ( ( sfl:0.00482, eco:0.00833) :0.03744, sen:0.05007) :0.11285, ( ype:0.12645, ( ( vvu:0.07310, vpa:0.07734) :0.03829, vch:0.09446) :0.02842) :0.00533) :0.01680) :0.01604,
Sequence Comparison Multiple alignment Multiple sequence alignment – T-Coffee • T_Coffee uses a principle that‘s a bit similar to ClustalW. • Yields more accurate alignments at the cost of computing time. • Builds a progressive alignment as ClustalW, but • Creates a library containing a complete collection of global (ClustalW) and local (Lalign) alignments and thus • Compares segments across the entire data set
Sequence Comparison Multiple alignment Multiple sequence alignment - T-Coffee
Sequence Comparison Multiple alignment Multiple sequence alignment - T-Coffee RED high-quality segments YELLOW GREEN BLUE regions, that you have no reasons to trust