1 / 65

Lesson 3

Lesson 3. Aligning sequences and searching databases. Homology. Similarity between objects due to a common ancestry. Sequence homology. Similarity between sequences that results from a common ancestor. VLS P AV K WAKV G A HA AGHG VLS E AV L WAKV E A DV AGHG. Basic assumption :

lily
Download Presentation

Lesson 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lesson 3 Aligning sequences and searching databases

  2. Homology • Similarity between objects due to a common ancestry

  3. Sequence homology • Similarity between sequences that results from a common ancestor VLSPAVKWAKVGAHAAGHG VLSEAVLWAKVEADVAGHG • Basic assumption: Sequence homology → similar structure/function

  4. Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

  5. G G G G G G G1,G2 G G G Homology • Ortholog – homolog with similar function (via speciation) • Paralog – homolog which arose from gene duplication Orthologs – 2 homologs from different species Paralogs – 2 homologs within the same species

  6. How close? • Rule of thumb: • Proteins are homologous if over 25% identical (length >100) • DNA sequences are homologous if over 70% identical

  7. Twilight zone • < 20% identity in proteins – may be homologous and may not be…. • (Note that 5% identity will be obtained completely by chance!)

  8. Why sequence alignment? Predict characteristics of a protein – use the structure/function of known proteins for predicting the structure/function of an unknown proteins

  9. Sequence modifications Sequences change in the course of evolution due to random mutations Three types of mutations: • Insertion - an insertion of a nucleotide or several nucleotides to the sequence. AAGA AAGTA • Deletion– a deletion of a nucleotide (or more) from the sequence. AAGA AGA • Substitution– a replacement of a nucleotide by another. AAGA AACA Insertion or Deletion ? ->Indel

  10. Local vs. Global Global alignment: forces alignment in regions which differ • Global alignment– finds the best alignment across the entire two sequences. • Local alignment– finds regions of similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ Local alignment will return only regions of good alignment ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

  11. When global and when local?

  12. Global alignment • PTK2 protein tyrosine kinase 2 of human and rhesus monkey

  13. Protein tyrosine kinase domain

  14. Protein tyrosine kinase domain • Human PTK2 and leukocyte tyrosine kinase • Both function as tyrosine kinases, in completely different contexts • Ancient duplication

  15. Global alignment of PTK and LTK X

  16. Local alignment of PTK and LTK

  17. Pairwise alignment AAGCTGAATTCGAA AGGCTCATTTCTGA One possible alignment: AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-

  18. AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2mismatches 4 indels (gap) 10 perfect matches

  19. Choosing an alignment: • Many different alignments are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Which alignment is better?

  20. Alignment scoring - scoring of sequence similarity: • Assumes independence between positions • Each position is considered separately • Scores each position • Positive if identical (match) • Negative if different (mismatch) or gap (indel) • Total score = sum of position scores • Can be positive or negative

  21. Example - naïve scoring system: • Perfect match: +1 • Mismatch: -2 • Indel (gap): -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Score: =(+1)x10 + (-2)x2 + (-1)x4= 2 Score: =(+1)x9 + (-2)x2 + (-1)x6 = -1 Higher score  Better alignment

  22. Scoring system: • The choice of +1,-2, and -1 scores is quite arbitrary • Different scoring systems  different alignments • Scoring systems implicitly represent a particular theory of evolution • Some mismatches are more plausible • Transition vs. Transversion • LysArg ≠ LysCys • Gap extension ≠ Gap opening

  23. Scoring matrix • Representing the scoring system as a table or matrix nn (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids) • symmetric

  24. Match DNA scoring matrices • Uniform substitutions between all nucleotides: Mismatch

  25. DNA scoring matrices Can take into account biological phenomena such as: • Transition-transversion

  26. Amino-acid scoring matrices • Take into account physico-chemical properties

  27. Amino-acid substitutions matrices • Actual substitutions: • Based on empirical data • Commonly used by many bioinformatics programs • PAM & BLOSUM

  28. Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M G Y D E M G Y D E M G Y E E M G Y D E M G Y Q E M G Y D E M G Y E E M G Y E E In the fourth column E and D are found in 7 / 8

  29. PAM Matrix - Point Accepted Mutations • Based on a database of 1,572 changes in 71 groups of closely related proteins (85% identity) • Alignment was easy • Counted the number of the substitutions per amino-acid pair (20 x 20) • Found that common substitutions occurred between chemically similar amino acids

  30. PAM Matrices • Family of matrices PAM 80, PAM 120, PAM 250 • The number on the PAM matrix represents evolutionary distance • Larger numbers are for larger distances

  31. Example: PAM 250 Similar amino acids have greater score

  32. PAM - limitations • Based only on a single, and limited dataset • Examines proteins with few differences (85% identity) • Based mainly on small globular proteins so the matrix is biased

  33. BLOSUM • Henikoff and Henikoff (1992) derived a set of matrices based on a much larger dataset • BLOSUM observes significantly more replacements than PAM, even for infrequent pairs

  34. BLOSUM:BlocksSubstitutionMatrix • Based on BLOCKS database • ~2000 blocks from 500 families of related proteins • Families of proteins with identical function • Blocks are short conserved patterns of 3-60 aa without gaps AABCDA----BBCDA DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC

  35. BLOSUM • Each block represents a sequence alignment with different identity percentage • For each block the amino-acid substitution rates were calculated to create the BLOSUM matrix

  36. BLOSUM Matrices • BLOSUMn is based on sequences that share at least n percent identity • BLOSUM62 represents closer sequences than BLOSUM45

  37. Example : Blosum62 derived from block where the sequences share at least 62% identity

  38. PAM vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences

  39. Scoring system = substitution matrix + gap penalty

  40. Gap penalty • We penalize gaps • Scoring for gap opening & gap extension: • Gap-extension penalty < gap-open penalty

  41. Optimal alignment algorithms • Needleman-Wunsch (global) • Smith-Waterman (local)

  42. Alignment Search Space • The “search space” (number of possible gapped alignments) for optimally aligning two sequences is exponential in the length of the sequences (n). • If n=100, there are 100100 = 10200 = 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 different alignments! • Average protein length is about n=250!

  43. Searching databases

  44. Searching a sequence database • Using a sequence as a query to find homologous sequences in a sequence database

  45. Query sequence: DNA or protein? • For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. • Which is preferable?

  46. Protein is better! • Selection (and hence conservation) works (mostly) on the protein level:CTTTCA = Leu-SerTTGAGT=Leu-Ser

  47. Query type • Nucleotides: a four letter alphabet • Amino acids: a twenty letter alphabet • Two random DNA sequences will, on average, have 25% identity • Two random protein sequences will, on average, have 5% identity

  48. Conclusions • Using the amino-acid sequence is preferable for homology search • Why use a nucleotide sequence after all? • No ORF found, e.g. newly sequenced genome • No similar protein sequences were found • Specific DNA databases are available (EST)

  49. Some terminology • Query sequence - the sequence with which we are searching • Hit– a sequence found in the database, suspected as homologous

  50. How do we search a database? • Assume we perform pairwise alignment of the query against all the sequences in the database • Exact pairwise alignment is O(mn) ≈ O(n2)(m – length of sequence 1, n – length of sequence 2)

More Related