Sequence comparison and Phylogeny - PowerPoint PPT Presentation

sequence comparison and phylogeny n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Sequence comparison and Phylogeny PowerPoint Presentation
Download Presentation
Sequence comparison and Phylogeny

play fullscreen
1 / 149
Sequence comparison and Phylogeny
153 Views
Download Presentation
chet
Download Presentation

Sequence comparison and Phylogeny

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Sequence comparisonand Phylogeny based on Chapter 4 Lesk, Introduction to Bioinformatics

  2. Contents Motivation Sequence comparison and alignments Dot plots Dynamic programming Substitution matrices Dynamic programming: Local and global alignments and gaps BLAST Significance of alignments Multiple sequence alignments Phylogenetic trees

  3. Motivation From where are we? Recent Africa vs. Multi-regional Hypothese In 1999 Encephalitis caused by the West Nile Virus broke out in New York. How did the virus come to New York? How did the nucleus get into the eucaryotic cells? To answer such questions we will need sequence comparison and phylogenetic trees

  4. Sequence SNLIPSPPTYNSAHDYISWESFSNVSYYTRILPSHLIDSPPTYNADYGYKSWEAFSNLSYYTRALP*.** ******: : * ***:***:***** ** MSRSLLLWFLLFLLLLPPLPVLLADPGAPTPVN--------------MLARALLLCAVLALSHTAN :* :* * . .*

  5. Sequence Similarity Searches Sequence similarity can be due to a common evolutionary ancestor… E.g. globin genes in chimpanzees and humans … or common function E.g. v-sys oncogene in simian sarcoma virus leading to cancer in monkeys and the seemingly unrelated growth stimulating protein PDGF, which stimulates cell growth (first success of similarity idea, 1983) In general: If an unknown sequence is found, deduce its function/structure indirectly by finding similar sequences, whose function/structure is known Assumption: Evolution changes sequences “slowly” often maintaining main features of a sequence’s function/structure

  6. Sequence alignment Substitutions, insertions and deletions can be interpreted in evolutionary terms But: distinguish chance similarity and real biological relationship CCGTAA TCGTAA CCGTAT TCGTAC TTGTAA TAGTAC TCGTAG

  7. Evolution Convergent evolution: same sequence evolved from different ancestors Back evolution - mutate to a previous sequence CCGTAA TCGTAA CCGTAT TAGTAA TCGTAC TAGTAC TAGTAC TCGTAG CCGTAA

  8. Similarity vs. Homology Any sequence can be similar Sequences homologues if evolved from common ancestor Homologous sequences: Orthologs: similar biological function Paralogs: different biological function (after gene duplication), e.g. lysozyme and α-lactalbumin, a mammalian regulatory protein Assumption: Similarity indicator for homology Note, altered function of the expressed protein will determine if the organism will survive to reproduce, and hence pass on the altered gene

  9. Sequence alignments Given two or more sequences, we wish to Measure their similarity Determine the residue-residue correspondences Observe patterns of conservation and variability Infer evolutionary relationships

  10. What is the best alignment? Uninformative: -------gctgaacgctataatc------- Without gaps: gctgaacgctataatc With gaps: gctga-a--cg --ct-ataatc Another one: gctg-aa-cg -ctataatc- Formally: The best alignments have only a minimal number of mismatches (insertions, deletions, replace) We need a method to systematically explore and to compute alignments

  11. Scores for an alignment Sequence Identity: Percentage of matches Score each match, mismatch, gap opening, gap extension [attg] a t t g [accc] a c - - Example match +1 mismatch -1 Gap opening -3 Gap extension -1 Uninformative: 0%, score= -19-------gctgaacgctataatc------- Without gaps: 25%, score= -4 gctgaacgctataatc With gaps: 0%, score= -19gctga-a--cg --ct-ataatc Another one: 50%, score= -8 gctg-aa-cg -ctataatc-

  12. Scores for an alignment Sequence Identity: Percentage of matches Score each match, mismatch, gap opening, gap extension [attg] a t t g [accc] a c - - Example match +2 mismatch -1 Gap opening -1 Gap extension -1 Uninformative: 0%, score= -15-------gctgaacgctataatc------- Without gaps: 25%, score= -2gctgaacgctataatc With gaps: 0%, score= -11gctga-a--cg --ct-ataatc Another one: 50%, score= 5gctg-aa-cg -ctataatc-

  13. Dot plots

  14. Dot plots A convenient way of comparing 2 sequences visually Use matrix, put 1 sequence on X-axis, 1 on Y-axis Cells with identical characters filled with a ‘1’, non-identical with ‘0’ (simplest scheme - could have weights)

  15. Dot plots D O R O T H Y C R O W F O O T H O D G K I N D O R O T H Y H O D G K I N

  16. Dot plots D O R O T H Y C R O W F O O T H O D G K I N D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I I N N

  17. Interpreting dot plots What do identical sequences look like? What do unrelated sequences look like? What do distantly related sequences look like? What does reverse sequence look like? Relevant for detections of stems in RNA structure What does a palindrome look like? Relevant for restriction enzymes What do repeats look like? What does a protein with domains A and B and another one with domains B and C look like?

  18. Dot plot for identical sequences D O R O T H Y H O D G K I N D D D O O O O R R O O O O T T H H H Y Y H H H O O O O D D D G G K K I I N N

  19. Dotplot for unrelated sequences D O R O T H Y H O D G K I N O O O O T T T T O O O O D D D I I E T T E R R

  20. Dotplot for distantly related sequences D O R O T H Y H O D G K I N T T I I M O O O O T T H H H Y Y J E N N K K I I N N

  21. Dotplot for reverse sequences Relevant to identify stems in RNA structures Plot sequence against its reverse complement

  22. Dotplot for reverse sequences D O R O T H Y H O D G K I N N N I I K K G G D D D O O O O H H H Y Y H H H T T O O O O R R O O O O D D

  23. Palindromes and restriction enzymes Madam, I'm Adam Able was I ere I saw Elba (supposedly said by Napoleon) Doc note I dissent, a fast never prevents a fatness, I diet on cod. Because DNA is double stranded and the strands run antiparallel, palindromes are defined as any double stranded DNA in which reading 5’ to 3’ both are the same The HindIII cutting site: 5'-AAGCTT-3' 3'-TTCGAA-5' The EcoRI cutting site: 5'-GAATTC-3' 3'-CTTAAG-5'

  24. Dotplot of a Palindrome M A D A M M M M A A A D D A A A M M M

  25. Dotplot of repeats T W E N T Y O N E T W E N T Y T W O T T T T T T W W W W E E E E N N N N T T T T T T Y Y Y T T T T T T W W W W O O O T T T T T T W W W W E E E E N N N N T T T T T T Y Y Y O O O N N N N E E E E

  26. Dotplot of Repeats/Palindrome M A D A M I M A D A M M M M M M A A A A A D D D A A A A A M M M M M I I M M M M M A A A A A D D D A A A A A M M M M M

  27. Dotplot for shared domain D O R O T H Y H O D G K I N D D O O O O R R O O O O T T H H H Y Y M I I L L E R R

  28. Result Dot plot dorothycrowfoothodgkin d* * o * * * ** * r * * o * * * ** * t * * h * * y * h * * o * * * ** * d* * g * k * i * n *

  29. Dotplots Window size 15 Dot if 6 matches in window

  30. Window size 15 • Dot if • 6 matches in window Cacain and Caricain, two proteases from papaya

  31. Cacain and Caricain, two proteases from papaya >gi|1942644|pdb|1MEG| Crystal Structure Of A Caricain D158e Mutant In Complex With E-64 Length = 216 Score = 271 bits (693), Expect = 1e-73 Identities = 142/216 (65%), Positives = 168/216 (77%), Gaps = 4/216 (1%) Query: 1 IPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRS 60 +PE VDWR+KGAVTPV++QGSCGSCWAFSAV T+EGI KIRTG L + SEQEL+DC+RRS Sbjct: 1 LPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLVELSEQELVDCERRS 60 Query: 61 YGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNQGA 120 +GC GGYP AL+ VA+ GIH R+ YPY+ Q CR+++ G KT GV +VQP N+G Sbjct: 61 HGCKGGYPPYALEYVAKNGIHLRSKYPYKAKQGTCRAKQVGGPIVKTSGVGRVQPNNEGN 120 Query: 121 LLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAV----GYGPNYILIKNS 176 LL +IA QPVSVV+++ G+ FQLY+GGIF GPCG KV+HAV AV G YILIKNS Sbjct: 121 LLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVEHAVTAVGYGKSGGKGYILIKNS 180 Query: 177 WGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 212 WGT WGE GYIRIKR GNS GVCGLY SS+YP KN Sbjct: 181 WGTAWGEKGYIRIKRAPGNSPGVCGLYKSSYYPTKN 216 1 lpenvdwrkk gavtpvrhqg scgscwafsa vatveginki rtgklvelse qelvdcerrs 61 hgckggyppy aleyvakngi hlrskypyka kqgtcrakqv ggpivktsgv grvqpnnegn 121 llnaiakqpv svvveskgrp fqlykggife gpcgtkveha vtavgygksg gkgyilikns 181 wgtawgekgy irikrapgns pgvcglykss yyptkn

  32. Window size 15 • Dot if • 6 matches in window Cacain and Cruzain, a protease from Trypanosoma cruzi

  33. >gi|2624670|pdb|1AIM| Cruzain Inhibited By Benzoyl-Tyrosine-Alanine- Fluoromethylketone Length = 215 Score = 121 bits (303), Expect = 3e-28 Identities = 78/202 (38%), Positives = 107/202 (52%), Gaps = 13/202 (6%) Query: 2 PEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRSY 61 P VDWR +GAVT VK+QG CGSCWAFSA+ +E + L SEQ L+ CD+ Sbjct: 2 PAAVDWRARGAVTAVKDQGQCGSCWAFSAIGNVECQWFLAGHPLTNLSEQMLVSCDKTDS 61 Query: 62 GCNGGYPWSALQLVAQY---GIHYRNTYPY---EGVQRYCRSREKGPYAAKTDGVRQVQP 115 GC+GG +A + + Q ++ ++YPY EG+ C + A T V Q Sbjct: 62 GCSGGLMNNAFEWIVQENNGAVYTEDSYPYASGEGISPPCTTSGHTVGATITGHVELPQD 121 Query: 116 YNQGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYGPN----YI 171 Q A ++ N PV+V + A+ + Y GG+ +DH V VGY + Y Sbjct: 122 EAQIAAWLAV-NGPVAVAVDAS--SWMTYTGGVMTSCVSEALDHGVLLVGYNDSAAVPYW 178 Query: 172 LIKNSWGTGWGENGYIRIKRGT 193 +IKNSW T WGE GYIRI +G+ Sbjct: 179 IIKNSWTTQWGEEGYIRIAKGS 200 Cacain and Cruzain, a protease from Trypanosoma

  34. Window size 15 • Dot if • 6 matches in window Cacain and Cathepsin, a human protease

  35. gi|7546546|pdb|1EF7|B Chain B, Crystal Structure Of Human Cathepsin X Length = 242 Score = 52.0 bits (123), Expect = 2e-07 Identities = 60/231 (25%), Positives = 94/231 (40%), Gaps = 34/231 (14%) Query: 1 IPEYVDWRQKGAV---TPVKNQ---GSCGSCWAFSAVVTIEGIIKIRTGNL---NQYSEQ 51 +P+ DWR V + +NQ CGSCWA ++ + I I+ S Q Sbjct: 1 LPKSWDWRNVDGVNYASITRNQHIPQYCGSCWAHASTSAMADRINIKRKGAWPSTLLSVQ 60 Query: 52 ELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCR--------SREKGPY 103 ++DC C GG S Q+GI Y+ + C + K + Sbjct: 61 NVIDCGNAG-SCEGGNDLSVWDYAHQHGIPDETCNNYQAKDQECDKFNQCGTCNEFKECH 119 Query: 104 AAKTDGVRQVQPYN-----QGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNK-V 157 A + + +V Y + + AN P+S + A + Y GGI+ + Sbjct: 120 AIRNYTLWRVGDYGSLSGREKMMAEIYANGPISCGIMATER-LANYTGGIYAEYQDTTYI 178 Query: 158 DHAVAAVGY----GPNYILIKNSWGTGWGENGYIRI-----KRGTGNSYGV 199 +H V+ G+ G Y +++NSWG WGE G++RI K G G Y + Sbjct: 179 NHVVSVAGWGISDGTEYWIVRNSWGEPWGERGWLRIVTSTYKDGKGARYNL 229 Cacain and Cathepsin, a human protease

  36. Window size 5 • Dot if • 2 matches in window Cacain and Cathepsin, a human protease

  37. Window size 1 • Dot if • 1 match in window Cacain and Cathepsin, a human protease

  38. Dynamic programming

  39. From Dotplots to Alignments Obvious best alignment: DOROTHYCROWFOOTHODGKIN DOROTHY--------HODGKIN D O R O T H Y C R O W F O O T H O D G K I N D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I I N N

  40. From Dotplots to Alignments Find “best” path from top left corner to bottom right Moving “east” corresponds to “-” in the second sequence Moving “south” corresponds to “-” in the first sequence Moving “southeast” corresponds to a match (if the characters are the same) or a mismatch (otherwise) Can we automate this?

  41. From Dotplots to Alignments Algorithm (Dynamic Programming): Insert a row 0 and column 0 initialised with 0 Starting from the top left, move down row by row from row 1 and right column by column from column 1 visiting each cell Consider The value of the cell north The value of the cell west The value of the cell northwest if the row/column character mismatch 1 + the value of the cell northwest if the row/column character match Put down the maximum of these values as the value for the current cell Trace back the path with the highest values from the bottom right to the top left and output the alignment

  42. From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3 C4 T5 G6 A7 T

  43. From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0 Insert a row 0 and column 0 initialised with 0

  44. From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 2 T 03 C 04 T 05 G 06 A 07 T 0 0 0 1 1 1 • Consider • Value north • Value west • Value northwest if the row/column character mismatch • 1 + value northwest if the row/column character match • Put down the maximum of these values for current celll

  45. From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4

  46. Reading the Alignment 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4 -tgcat-a- at-c-tgat

  47. Reading the Alignment: there are more than one possibility 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4 ---tgcata atctg-at-

  48. Formally:Longest Common Subsequence LCS What is the length s(V,W) of the longest common subsequence of two sequencesV=v1..vn and W=w1..wm ? Find sequences of indices1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ msuch that vit = wjt for 1 ≤ t ≤ k How? Dynamic programming: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and si-1,jsi,j = maxsi,j-1si-1,j-1 + 1, if vi = wj Then s(V,W) = sn,m is the length of the LCS {

  49. Example LCS 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3 C4 T5 G6 A7 T

  50. Example LCS: 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0 Initialisation: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m