1 / 146

Sequence Comparison

Sequence Comparison. Pair-wise Similarities. Sequence Comparison. Graphical Alignments Compare DotPlot Pairwise alignments BestFit Gap. Similarity vs. Homology. Similarity Two sequences which resemble each other Can be measured Degrees of similarity exist Homology

iola
Download Presentation

Sequence Comparison

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Comparison Pair-wise Similarities

  2. Sequence Comparison • Graphical Alignments • Compare • DotPlot • Pairwise alignments • BestFit • Gap

  3. Similarity vs. Homology • Similarity • Two sequences which resemble each other • Can be measured • Degrees of similarity exist • Homology • Two sequences which are similar due to common evolutionary origin • Must be inferred • All or none

  4. Paralogous vs. Orthologus Relationships • Implies Homology • Orthologs • Sequences that have evolved from a common ancestor following speciation • Paralogs • Sequences that have evolved within a single line of descent following gene duplication

  5. Match Criterion • Is there a similarity between two sequences? • Identical symbols (nucleotides or amino acids) • Related symbols (amino acids) • Do gaps/rearrangements allow for a higher degree of similarity?

  6. Dot Plots • Allow comparison of two sequences in all registers • Produces a graph (Dotplot) of sequence similarities • The human brain interprets the results

  7. GCG DotPlots • Compare • Compares the sequences • Output is a text table containing the comparison information • DotPlot • Produces a graph of Compare's results

  8. Simple 1:1 DotPlot R • • E • • • • M • I • • R • • P • S • • • I • • S • • • Y • L • A • • N • • A • • E • • • • C • N • • E • • • • U • Q • E • • • • S • • • S E Q U E N C E A N A L Y S I S P R I M E R

  9. Stringency and Specificity • Degree to which programs parameters are set to detect more distant similarities • Degree to which programs parameters are set to exclude unrelated “background” similarities or “noise”

  10. High Stringency • Low background noise • Only relatively close matches detected

  11. Low Stringency • High background noise • Distant relationships detected

  12. Word Match Comparisons • Identifies short, perfect matches (words) • ktup (k-tuple) • Fast • 1,000 times faster than window/stringency comparison • Less sensitive than window/stringency

  13. Word DotPlot -WordSize=2 Word DotPlot /WordSize=2 R E M I R P S I S Y L A N A E C N E U Q • E E S S S E Q U E N C E A N A L Y S I S P R I M E R

  14. Word DotPlot -WordSize=2 Word DotPlot /WordSize=2 R R • E E • M • I • R • P • S • I • S • Y • L • A • N • A • E • C • N • E • U • Q • E • S S E Q U E N C E A N A L Y S I S P R I M E R

  15. Word DotPlot -WordSize=2

  16. Window/Stringency Comparisons • Identifies a given number of matches (stringency) • Over a given range (window) • Slow • High sensitivity

  17. Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E C N E U U • Q Q 4/4 E E S S S E Q U E N C E A N A L Y S I S P R I M E R

  18. Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E C N E U U • Q 0/0 E Q S E S S E Q U E N C E A N A L Y S I S P R I M E R

  19. Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E C N E E U • U Q 0/4 Q E E S S E Q U E N C E A N A L Y S I S P R I M E R

  20. Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E C N E • E • U U 4/4 Q Q E E S S E Q U E N C E A N A L Y S I S P R I M E R

  21. Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E E • U C 2/4 Q N • E E • U Q E S S E Q U E N C E A N A L Y S I S P R I M E R

  22. Window DotPlot 4/2 Window DotPlot 4/2

  23. Window DotPlot 4/1 Window DotPlot 4/1

  24. Window DotPlot 4/3 Window DotPlot 4/3

  25. Symbol Comparison Tables(Scoring Matrices) • What is a match? • Define match values for all GCG symbols • Nucleotides • Amino acids • Located in GenRunData:*.cmp

  26. Nucleotide Tables • Programs use different tables depending on the alignment algorithm in use • Matches and mismatches receive different values

  27. compardna.cmp • Compare • Match=1 • Mismatch=0 • Ambiguity symbols with any overlap between the sets of nucleotides are considered matches

  28. !!NA_SCORING_MATRIX_RECT 1.0 Default scoring matrix used by COMPARE for the comparision of nucleic acid sequences. This table scores a match for any overlap between any IUB nucleic acid ambiguity symbols EXCEPT X/N. February 20, 1996 14:33 .. A B C D G H K M R S T U V W Y A 1 0 0 1 0 1 0 1 1 0 0 0 1 1 0 B 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 C 0 1 1 0 0 1 0 1 0 1 0 0 1 0 1 D 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 G 0 1 0 1 1 0 1 0 1 1 0 0 1 0 0 H 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 K 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 M 1 1 1 1 0 1 0 1 1 1 0 0 1 1 1 R 1 1 0 1 1 1 1 1 1 1 0 0 1 1 0 S 0 1 1 1 1 1 1 1 1 1 0 0 1 0 1 T 0 1 0 1 0 1 1 0 0 0 1 1 0 1 1 U 0 1 0 1 0 1 1 0 0 0 1 1 0 1 1 V 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 W 1 1 0 1 0 1 1 1 1 0 1 1 1 1 1 Y 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1

  29. nwsgapdna.cmp • Gap • Match=10 • Mismatch= 0 • Gap penalties • Gap Create • Gap Extend

  30. !!NA_SCORING_MATRIX_RECT 1.0 Default scoring matrix used by GAP for the comparison of nucleic acid sequences. { GAP_CREATE 50 GAP_EXTEND 3 } A B C D G H K M N R S T U V W X Y A 10 0 0 10 0 10 0 10 10 10 0 0 0 10 10 10 0 B 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 C 0 10 10 0 0 10 0 10 10 0 10 0 0 10 0 10 10 D 10 10 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 G 0 10 0 10 10 0 10 0 10 10 10 0 0 10 0 10 0 H 10 10 10 10 0 10 10 10 10 10 10 10 10 10 10 10 10 K 0 10 0 10 10 10 10 0 10 10 10 10 10 10 10 10 10 M 10 10 10 10 0 10 0 10 10 10 10 0 0 10 10 10 10 N 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 R 10 10 0 10 10 10 10 10 10 10 10 0 0 10 10 10 0 S 0 10 10 10 10 10 10 10 10 10 10 0 0 10 0 10 10 T 0 10 0 10 0 10 10 0 10 0 0 10 10 0 10 10 10 U 0 10 0 10 0 10 10 0 10 0 0 10 10 0 10 10 10 V 10 10 10 10 10 10 10 10 10 10 10 0 0 10 10 10 10 W 10 10 0 10 0 10 10 10 10 10 0 10 10 10 10 10 10 X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Y 0 10 10 10 0 10 10 10 10 0 10 10 10 10 10 10 10

  31. swgapdna.cmp • BestFit • Match=10 • Mismatch= -9 • Negative numbers prevent extension of an alignment once the sequences diverge

  32. !!NA_SCORING_MATRIX_RECT 1.0 Default scoring matrix used by BESTFIT for the comparison of nucleic acid sequences. February 20, 1996 14:35 .. { GAP_CREATE 50 GAP_EXTEND 3 } A B C D G H K M N R S T U V W X Y A 10 -9 -9 10 -9 10 -9 10 10 10 -9 -9 -9 10 10 10 -9 B -9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 C -9 10 10 -9 -9 10 -9 10 10 -9 10 -9 -9 10 -9 10 10 D 10 10 -9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 G -9 10 -9 10 10 -9 10 -9 10 10 10 -9 -9 10 -9 10 -9 H 10 10 10 10 -9 10 10 10 10 10 10 10 10 10 10 10 10 K -9 10 -9 10 10 10 10 -9 10 10 10 10 10 10 10 10 10 M 10 10 10 10 -9 10 -9 10 10 10 10 -9 -9 10 10 10 10 N 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 R 10 10 -9 10 10 10 10 10 10 10 10 -9 -9 10 10 10 -9 S -9 10 10 10 10 10 10 10 10 10 10 -9 -9 10 -9 10 10 T -9 10 -9 10 -9 10 10 -9 10 -9 -9 10 10 -9 10 10 10 U -9 10 -9 10 -9 10 10 -9 10 -9 -9 10 10 -9 10 10 10 V 10 10 10 10 10 10 10 10 10 10 10 -9 -9 10 10 10 10 W 10 10 -9 10 -9 10 10 10 10 10 -9 10 10 10 10 10 10 X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Y -9 10 10 10 -9 10 10 10 10 -9 10 10 10 10 10 10 10

  33. Amino Acid Tables • Measure of similarity between amino acids • Not a simple match/mismatch relationship • Values vary depending on degree of relatedness • Based on evolution, chemistry, or structure

  34. PAM250 Dayhoff Matrix • Based on evolutionary relationships • Derived empirically by comparing amino acid usage between closely related proteins • At least 85% identical

  35. PAM • Accepted Point Mutations • PAM-1 Matrix • 1 "evolutionary" event • Allow 1 residue out of 100 to change • 1% Difference • What is the probability that that residue will change to any other?

  36. PAM250 Matrix • Allow 250 "evolutionary" events • 80% Difference • Account for more distant relationships • Can construct any PAM-N matrix

  37. pam250.cmp • Matches vary depending on the degree of conservation of any particular amino acid • A - A: 2 • W - W: 17 • Mismatches: vary depending on degree of relatedness between amino acids • Phe - Tyr: 7 • Leu - Ile: 2 • Cys - Leu: -6

  38. PAM250 amino acid substitution matrix. { GAP_CREATE 12 GAP_EXTEND 4 } A B C D E F G H I K L M N P Q R S T V W Y Z A 2 0 -2 0 0 -4 1 -1 -1 -1 -2 -1 0 1 0 -2 1 1 0 -6 -3 0 B 0 2 -4 3 2 -5 0 1 -2 1 -3 -2 2 -1 1 -1 0 0 -2 -5 -3 2 C -2 -4 12 -5 -5 -4 -3 -3 -2 -5 -6 -5 -4 -3 -5 -4 0 -2 -2 -8 0 -5 D 0 3 -5 4 3 -6 1 1 -2 0 -4 -3 2 -1 2 -1 0 0 -2 -7 -4 3 E 0 2 -5 3 4 -5 0 1 -2 0 -3 -2 1 -1 2 -1 0 0 -2 -7 -4 3 F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 -5 -5 -4 -3 -3 -1 0 7 -5 G 1 0 -3 1 0 -5 5 -2 -3 -2 -4 -3 0 -1 -1 -3 1 0 -1 -7 -5 -1 H -1 1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 0 3 2 -1 -1 -2 -3 0 2 I -1 -2 -2 -2 -2 1 -3 -2 5 -2 2 2 -2 -2 -2 -2 -1 0 4 -5 -1 -2 K -1 1 -5 0 0 -5 -2 0 -2 5 -3 0 1 -1 1 3 0 0 -2 -3 -4 0 L -2 -3 -6 -4 -3 2 -4 -2 2 -3 6 4 -3 -3 -2 -3 -3 -2 2 -2 -1 -3 M -1 -2 -5 -3 -2 0 -3 -2 2 0 4 6 -2 -2 -1 0 -2 -1 2 -4 -2 -2 N 0 2 -4 2 1 -4 0 2 -2 1 -3 -2 2 -1 1 0 1 0 -2 -4 -2 1 P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 0 0 1 0 -1 -6 -5 0 Q 0 1 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 1 -1 -1 -2 -5 -4 3 R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 0 -1 -2 2 -4 0 S 1 0 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 1 -1 -2 -3 0 T 1 0 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 0 -5 -3 -1 V 0 -2 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 -6 -2 -2 W -6 -5 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 0 -6 Y -3 -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10 -4 Z 0 2 -5 3 3 -5 -1 2 -2 0 -3 -2 1 0 3 0 0 -1 -2 -6 -4 3

  39. Problems • Constructed in 1978 • Dataset much less complete than today • Used mainly small, globular proteins • Set of proteins used much more closely related than most relationships people are attempting to identify • Assumes all positions are equally mutable • Actually have conserved and unconserved positions

  40. BLOSUM Tables • Blocks substitution matrix • Derived from aligned Blocks of related sequences • 2000 blocks • 500 different protein groups • Created from an all vs. all comparison of the protein database

  41. BLOSUM Reference • Amino acid substitution matrices from protein blocks. Henikoff, S. and Henikoff, J. G. (1992). Proc. Natl. Acad. Sci. USA 89: 10915-10919.

  42. Blocks • Aligned, ungapped conserved region of a protein family • Calculate the frequency with which any amino acid can appear at each position • Compute the probability that any amino acid can substitute for any other

  43. BLOSUM Advantages • Frequencies obtained from protein blocks constructed regardless of evolutionary distance • Blocks represent regions of conserved sequence similarities • Conservation due to functional constraints • Calculated frequencies reflect functional constraints • Much larger data set used than for the PAM matrix

  44. BLOSUM62 Table • Default table for almost all amino acid comparisons • FastA and TFastA use blosum50 • Many other blosum tables are available • In GenMoreData

  45. BLOSUM62 amino acid substitution matrix. { GAP_CREATE 12 GAP_EXTEND 4 } A B C D E F G H I K L M N P Q R S T V W X Y Z A 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1 B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4 D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 E -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3 G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2 H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0 I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2 N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -1 -2 0 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 -1 2 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -1 -2 0 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -1 -2 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 -1 2 -3 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 -2 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5

  46. StructGapPep.cmp • Alternative table • Based upon amino acid substitutions after superpostion of homologous protein structures • Closely related amino acids have alpha-carbon atoms close to one another after superposition of the structures • Useful for finding weak similarities between proteins (?)

  47. Other Tables • Genetic Code Matrix • How many nucleotide changes required to switch between amino acids • Chemical Similarity • Side Chain

  48. Compare

  49. Compare • Compares two sequences for regions of similarity • Uses either a word or window/stringency (default) comparison • Produces a table of overlapping points of similarity • DotPlot plots the points on a graph

More Related