Sequence comparison - PowerPoint PPT Presentation

sequence comparison n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Sequence comparison PowerPoint Presentation
Download Presentation
Sequence comparison

play fullscreen
1 / 107
Sequence comparison
160 Views
Download Presentation
tejana
Download Presentation

Sequence comparison

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Sequence comparison based on Chapter 4 Lesk, Introduction to Bioinformatics

  2. Contents Motivation Sequence comparison and alignments Dot plots Dynamic programming Substitution matrices Dynamic programming: Local and global alignments and gaps BLAST Significance of alignments Multiple sequence alignments

  3. Motivation From where are we? Recent Africa vs. Multi-regional Hypothese In 1999 Encephalitis caused by the West Nile Virus broke out in New York. How did the virus come to New York? How did the nucleus get into the eucaryotic cells? To answer such questions we will need sequence comparison and phylogenetic trees

  4. Sequence alignment Substitutions, insertions and deletions can be interpreted in evolutionary terms But: distinguish chance similarity and real biological relationship CCGTAA TCGTAA CCGTAT TCGTAC TTGTAA TAGTAC TCGTAG

  5. Evolution Convergent evolution: same sequence evolved from different ancestors Back evolution - mutate to a previous sequence CCGTAA TCGTAA CCGTAT TAGTAA TCGTAC TAGTAC TAGTAC TCGTAG CCGTAA

  6. Sequence alignments Given two or more sequences, we wish to Measure their similarity Determine the residue-residue correspondences Observe patterns of conservation and variability Infer evolutionary relationships

  7. What is the best alignment? Uninformative: -------gctgaacgctataatc------- Without gaps: gctgaacgctataatc With gaps: gctga-a--cg --ct-ataatc Another one: gctg-aa-cg -ctataatc- Formally: The best alignments have only a minimal number of mismatches (insertions, deletions, replace) We need a method to systematically explore and to compute alignments

  8. Scores for an alignment Sequence Identity: Percentage of matches Score each match, mismatch, gap opening, gap extension [attg] a t t g [accc] a c - - Example match +1 mismatch -1 Gap opening -3 Gap extension -1 Uninformative: 0%, score= -19-------gctgaacgctataatc------- Without gaps: 25%, score= -4 gctgaacgctataatc With gaps: 0%, score= -19gctga-a--cg --ct-ataatc Another one: 50%, score= -8 gctg-aa-cg -ctataatc-

  9. Scores for an alignment Sequence Identity: Percentage of matches Score each match, mismatch, gap opening, gap extension [attg] a t t g [accc] a c - - Example match +2 mismatch -1 Gap opening -1 Gap extension -1 Uninformative: 0%, score= -15-------gctgaacgctataatc------- Without gaps: 25%, score= -2gctgaacgctataatc With gaps: 0%, score= -11gctga-a--cg --ct-ataatc Another one: 50%, score= 5gctg-aa-cg -ctataatc-

  10. Dot plots

  11. Dot plots A convenient way of comparing 2 sequences visually Use matrix, put 1 sequence on X-axis, 1 on Y-axis Cells with identical characters filled with a ‘1’, non-identical with ‘0’ (simplest scheme - could have weights)

  12. Dot plots D O R O T H Y C R O W F O O T H O D G K I N D O R O T H Y H O D G K I N

  13. Dot plots D O R O T H Y C R O W F O O T H O D G K I N D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I I N N

  14. Interpreting dot plots What do identical sequences look like? What do unrelated sequences look like? What do distantly related sequences look like? What does reverse sequence look like? Relevant for detections of stems in RNA structure What does a palindrome look like? Relevant for restriction enzymes What do repeats look like? What does a protein with domains A and B and another one with domains B and C look like?

  15. Dot plot for identical sequences D O R O T H Y H O D G K I N D D D O O O O R R O O O O T T H H H Y Y H H H O O O O D D D G G K K I I N N

  16. Dotplot for unrelated sequences D O R O T H Y H O D G K I N O O O O T T T T O O O O D D D I I E T T E R R

  17. Dotplot for distantly related sequences D O R O T H Y H O D G K I N T T I I M O O O O T T H H H Y Y J E N N K K I I N N

  18. Dotplot for reverse sequences Relevant to identify stems in RNA structures Plot sequence against its reverse complement

  19. Dotplot for reverse sequences D O R O T H Y H O D G K I N N N I I K K G G D D D O O O O H H H Y Y H H H T T O O O O R R O O O O D D

  20. Dotplot of a Palindrome M A D A M M M M A A A D D A A A M M M

  21. Dotplot of repeats T W E N T Y O N E T W E N T Y T W O T T T T T T W W W W E E E E N N N N T T T T T T Y Y Y T T T T T T W W W W O O O T T T T T T W W W W E E E E N N N N T T T T T T Y Y Y O O O N N N N E E E E

  22. Dotplot of Repeats/Palindrome M A D A M I M A D A M M M M M M A A A A A D D D A A A A A M M M M M I I M M M M M A A A A A D D D A A A A A M M M M M

  23. Dotplot for shared domain D O R O T H Y H O D G K I N D D O O O O R R O O O O T T H H H Y Y M I I L L E R R

  24. Result Dot plot dorothycrowfoothodgkin d* * o * * * ** * r * * o * * * ** * t * * h * * y * h * * o * * * ** * d* * g * k * i * n *

  25. Dotplots Window size 15 Dot if 6 matches in window

  26. Window size 15 • Dot if • 6 matches in window Cacain and Caricain, two proteases from papaya

  27. Cacain and Caricain, two proteases from papaya >gi|1942644|pdb|1MEG| Crystal Structure Of A Caricain D158e Mutant In Complex With E-64 Length = 216 Score = 271 bits (693), Expect = 1e-73 Identities = 142/216 (65%), Positives = 168/216 (77%), Gaps = 4/216 (1%) Query: 1 IPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRS 60 +PE VDWR+KGAVTPV++QGSCGSCWAFSAV T+EGI KIRTG L + SEQEL+DC+RRS Sbjct: 1 LPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLVELSEQELVDCERRS 60 Query: 61 YGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNQGA 120 +GC GGYP AL+ VA+ GIH R+ YPY+ Q CR+++ G KT GV +VQP N+G Sbjct: 61 HGCKGGYPPYALEYVAKNGIHLRSKYPYKAKQGTCRAKQVGGPIVKTSGVGRVQPNNEGN 120 Query: 121 LLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAV----GYGPNYILIKNS 176 LL +IA QPVSVV+++ G+ FQLY+GGIF GPCG KV+HAV AV G YILIKNS Sbjct: 121 LLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVEHAVTAVGYGKSGGKGYILIKNS 180 Query: 177 WGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 212 WGT WGE GYIRIKR GNS GVCGLY SS+YP KN Sbjct: 181 WGTAWGEKGYIRIKRAPGNSPGVCGLYKSSYYPTKN 216 1 lpenvdwrkk gavtpvrhqg scgscwafsa vatveginki rtgklvelse qelvdcerrs 61 hgckggyppy aleyvakngi hlrskypyka kqgtcrakqv ggpivktsgv grvqpnnegn 121 llnaiakqpv svvveskgrp fqlykggife gpcgtkveha vtavgygksg gkgyilikns 181 wgtawgekgy irikrapgns pgvcglykss yyptkn

  28. Window size 15 • Dot if • 6 matches in window Cacain and Cruzain, a protease from Trypanosoma cruzi

  29. >gi|2624670|pdb|1AIM| Cruzain Inhibited By Benzoyl-Tyrosine-Alanine- Fluoromethylketone Length = 215 Score = 121 bits (303), Expect = 3e-28 Identities = 78/202 (38%), Positives = 107/202 (52%), Gaps = 13/202 (6%) Query: 2 PEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRSY 61 P VDWR +GAVT VK+QG CGSCWAFSA+ +E + L SEQ L+ CD+ Sbjct: 2 PAAVDWRARGAVTAVKDQGQCGSCWAFSAIGNVECQWFLAGHPLTNLSEQMLVSCDKTDS 61 Query: 62 GCNGGYPWSALQLVAQY---GIHYRNTYPY---EGVQRYCRSREKGPYAAKTDGVRQVQP 115 GC+GG +A + + Q ++ ++YPY EG+ C + A T V Q Sbjct: 62 GCSGGLMNNAFEWIVQENNGAVYTEDSYPYASGEGISPPCTTSGHTVGATITGHVELPQD 121 Query: 116 YNQGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYGPN----YI 171 Q A ++ N PV+V + A+ + Y GG+ +DH V VGY + Y Sbjct: 122 EAQIAAWLAV-NGPVAVAVDAS--SWMTYTGGVMTSCVSEALDHGVLLVGYNDSAAVPYW 178 Query: 172 LIKNSWGTGWGENGYIRIKRGT 193 +IKNSW T WGE GYIRI +G+ Sbjct: 179 IIKNSWTTQWGEEGYIRIAKGS 200 Cacain and Cruzain, a protease from Trypanosoma

  30. Window size 15 • Dot if • 6 matches in window Cacain and Cathepsin, a human protease

  31. gi|7546546|pdb|1EF7|B Chain B, Crystal Structure Of Human Cathepsin X Length = 242 Score = 52.0 bits (123), Expect = 2e-07 Identities = 60/231 (25%), Positives = 94/231 (40%), Gaps = 34/231 (14%) Query: 1 IPEYVDWRQKGAV---TPVKNQ---GSCGSCWAFSAVVTIEGIIKIRTGNL---NQYSEQ 51 +P+ DWR V + +NQ CGSCWA ++ + I I+ S Q Sbjct: 1 LPKSWDWRNVDGVNYASITRNQHIPQYCGSCWAHASTSAMADRINIKRKGAWPSTLLSVQ 60 Query: 52 ELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCR--------SREKGPY 103 ++DC C GG S Q+GI Y+ + C + K + Sbjct: 61 NVIDCGNAG-SCEGGNDLSVWDYAHQHGIPDETCNNYQAKDQECDKFNQCGTCNEFKECH 119 Query: 104 AAKTDGVRQVQPYN-----QGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNK-V 157 A + + +V Y + + AN P+S + A + Y GGI+ + Sbjct: 120 AIRNYTLWRVGDYGSLSGREKMMAEIYANGPISCGIMATER-LANYTGGIYAEYQDTTYI 178 Query: 158 DHAVAAVGY----GPNYILIKNSWGTGWGENGYIRI-----KRGTGNSYGV 199 +H V+ G+ G Y +++NSWG WGE G++RI K G G Y + Sbjct: 179 NHVVSVAGWGISDGTEYWIVRNSWGEPWGERGWLRIVTSTYKDGKGARYNL 229 Cacain and Cathepsin, a human protease

  32. Window size 5 • Dot if • 2 matches in window Cacain and Cathepsin, a human protease

  33. Window size 1 • Dot if • 1 match in window Cacain and Cathepsin, a human protease

  34. Dynamic programming

  35. From Dotplots to Alignments Obvious best alignment: DOROTHYCROWFOOTHODGKIN DOROTHY--------HODGKIN D O R O T H Y C R O W F O O T H O D G K I N D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I I N N

  36. From Dotplots to Alignments Find “best” path from top left corner to bottom right Moving “east” corresponds to “-” in the second sequence Moving “south” corresponds to “-” in the first sequence Moving “southeast” corresponds to a match (if the characters are the same) or a mismatch (otherwise) Can we automate this?

  37. From Dotplots to Alignments Algorithm (Dynamic Programming): Insert a row 0 and column 0 initialised with 0 Starting from the top left, move down row by row from row 1 and right column by column from column 1 visiting each cell Consider The value of the cell north The value of the cell west The value of the cell northwest if the row/column character mismatch 1 + the value of the cell northwest if the row/column character match Put down the maximum of these values as the value for the current cell Trace back the path with the highest values from the bottom right to the top left and output the alignment

  38. From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3 C4 T5 G6 A7 T

  39. From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0 Insert a row 0 and column 0 initialised with 0

  40. From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 2 T 03 C 04 T 05 G 06 A 07 T 0 0 0 1 1 1 • Consider • Value north • Value west • Value northwest if the row/column character mismatch • 1 + value northwest if the row/column character match • Put down the maximum of these values for current celll

  41. From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4

  42. Reading the Alignment 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4 -tgcat-a- at-c-tgat

  43. Reading the Alignment: there are more than one possibility 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4 ---tgcata atctg-at-

  44. Formally:Longest Common Subsequence LCS What is the length s(V,W) of the longest common subsequence of two sequencesV=v1..vn and W=w1..wm ? Find sequences of indices1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ msuch that vit = wjt for 1 ≤ t ≤ k How? Dynamic programming: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and si-1,jsi,j = maxsi,j-1si-1,j-1 + 1, if vi = wj Then s(V,W) = sn,m is the length of the LCS {

  45. Example LCS 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3 C4 T5 G6 A7 T

  46. Example LCS: 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0 Initialisation: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m

  47. Example LCS: 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07 T 0 Computing each cell: si-1,jsi,j= maxsi,j-1si-1,j-1 + 1, if vi = wj {

  48. Example LCS: 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4 Computing each cell: si-1,jsi,j= maxsi,j-1si-1,j-1 + 1, if vi = wj {

  49. LCS Algorithm LCS(V,W) For i = 0 to n si,0 = 0 For j = 0 to m s0,j = 0 For i = 1 to n For j = 1 to m If vi = wjand si-1,j-1 +1≥si-1,j and si-1,j-1 +1≥si,j-1 Then si,j= si-1,j-1 +1 bi,j= North West Else if si-1,j ≥si,j-1 Then si,j= si-1,j bi,j= North Else si,j= si,j-1 bi,j= West Return s and b Complexity: LCS has quadratic complexity: O(n m)

  50. Printing the alignment of LCS PRINT-LCS(b,V,i,j) If i=0 or j=0 Then Return If bi,j = North West Then PRINT-LCS(V,b,i-1,j-1) Print vi Else if bi,j = North Then PRINT-LCS(V,b,i-1,j) Else PRINT-LCS(V,b,i,j-1)