Homology and sequence alignment. - PowerPoint PPT Presentation

homology and sequence alignment n.
Skip this Video
Loading SlideShow in 5 Seconds..
Homology and sequence alignment. PowerPoint Presentation
Download Presentation
Homology and sequence alignment.

play fullscreen
1 / 84
Homology and sequence alignment.
Download Presentation
Download Presentation

Homology and sequence alignment.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Homology and sequence alignment.

  2. Homology Homology = Similarity between objects due to a common ancestry Hund = Dog, Schwein = Pig

  3. Sequence homology Similarity between sequences as a result of common ancestry. VLSPAVKWAKVGAHAAGHG ||| || |||| | |||| VLSEAVLWAKVEADVAGHG

  4. Sequence alignment Alignment:Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

  5. Why align? VLSPAVKWAKV ||| || |||| VLSEAVLWAKV • To detect if two sequence are homologous. If so, homology may indicate similarity in function (and structure). • Required for evolutionary studies (e.g., tree reconstruction). • To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site).

  6. Sequence alignment If two sequences share a common ancestor – for example human and dog hemoglobin, we can represent their evolutionary relationship using a tree VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSEAVLWAKV VLSPAV-WAKV

  7. Perfect match A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case). VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSEAVLWAKV VLSPAV-WAKV

  8. A substitution A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred). VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSEAVLWAKV VLSPAV-WAKV

  9. Indel Option 1: The ancestor had L and it was lost here. In such a case, the event was a deletion. VLSEAVLWAKV VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSEAVLWAKV VLSPAV-WAKV

  10. Indel Option 2: The ancestor was shorter and the L was inserted here. In such a case, the event was an insertion. L VLSEAVWAKV VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSEAVLWAKV VLSPAV-WAKV

  11. Indel Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel. Deletion? Insertion? VLSEAVLWAKV VLSPAV-WAKV

  12. Global vs. Local Global alignment– finds the best alignment across the entire two sequences. Local alignment– finds regions of similarity in parts of the sequences. Global alignment: forces alignment in regions which differ ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ Local alignment will return only regions of good alignment ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

  13. Global alignment PTK2 protein tyrosine kinase 2 of human and rhesus monkey

  14. Proteins are comprised of domains Human PTK2 : Domain A Domain B Protein tyrosine kinase domain

  15. Protein tyrosine kinase domain In leukocytes, a different gene for tyrosine kinase is expressed. Domain A Domain X Protein tyrosine kinase domain

  16. The sequence similarity is restricted to a single domain PTK2 Domain A Protein tyrosine kinase domain Domain B Domain X Protein tyrosine kinase domain Leukocyte TK

  17. Global alignment of PTK and LTK X

  18. Local alignment of PTK and LTK

  19. Conclusions Use global alignment when the two sequences share the same overall sequence arrangement. Use local alignment to detect regions of similarity.

  20. How alignments are computed


  22. AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2mismatches 4 indels (gap) 10 perfect matches

  23. Choosing an alignment for a pair of sequences Many different alignments are possible for 2 sequences: AAGCTGAATTCGAA AGGCTCATTTCTGA A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Which alignment is better?

  24. Scoring system (naïve) Perfect match: +1 Mismatch: -2 Indel (gap): -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Score: =(+1)x10 + (-2)x2 + (-1)x4= 2 Score: =(+1)x9 + (-2)x2 + (-1)x6 = -1 Higher score  Better alignment

  25. Alignment scoring - scoring of sequence similarity: • Assumes independence between positions: • each position is considered separately • Scores each position: • Positive if identical (match) • Negative if different (mismatch or gap) • Total score = sum of position scores • Can be positive or negative

  26. Scoring system • In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary • Different scoring systems  different alignments • We want a good scoring system…

  27. DNA scoring matrices Can take into account biological phenomena such as: Transition-transversion

  28. Amino-acid scoring matrices Take into account physico-chemical properties

  29. Scoring gaps (I) In advanced algorithms, two gaps of one amino-acid are given a different score than one gap of two amino acids. This is solved by giving a penalty to each gap that is opened. Gap extension penalty < Gap opening penalty

  30. Homology versus chance similarity How to check if the score is significant? A. Take the two sequences  Compute score. B. Take one sequence randomly shuffle it -> find score with the second sequence. Repeat 100,000 times. If the score in A is at the top 5% of the scores in B  the similarity is significant.

  31. How close? Rule of thumb: Proteins are homologous if they are at least 25% identical (length >100) DNA sequences are homologous if they are at least 70% identical

  32. Twilight zone < 25% identity in proteins – may be homologous and may not be…. (Note that 5% identity will be obtained completely by chance!)

  33. Searching a sequence database Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs The same idea in short: Use your sequence as a query to find homologous sequences in a sequence database

  34. Some terminology Query sequence - the sequence with which we are searching Hit – a sequence found in the database, suspected as homologous

  35. Query sequence: DNA or protein? For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. Which is preferable?

  36. Protein is better! Selection (and hence conservation) works (mostly) at the protein level:CTTTCA = Leu-SerTTGAGT=Leu-Ser

  37. Query type • Nucleotides: a four letter alphabet • Amino acids: a twenty letter alphabet • Two random DNA sequences will, on average, have 25% identity • Two random protein sequences will, on average, have 5% identity

  38. Conclusion The amino-acid sequence is often preferable for homology search

  39. How do we search a database? If each pairwise alignment takes 1/10 of a second, and if the database contains 107 sequences, it will take 106seconds = 11.5days to complete one search. 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.

  40. Conclusion Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow

  41. Heuristic Definition:a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution

  42. BLAST

  43. BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences

  44. DNA or Protein All types of searches are possible Query: DNA Protein Database: DNA Protein blastn – nuc vs. nucblastp – prot vs. protblastx – translated query vs. protein databasetblastn – protein vs. translated nuc. DBtblastx – translated query vs. translated database Translated databases: trEMBLgenPept

  45. BLAST - underlying hypothesis The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them The heuristic: Discard irrelevant sequences Perform exact local alignment only with the remaining sequences

  46. How do we discard irrelevant sequences quickly? Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA) Save the words in a look-up table that can be searched quickly WTD TDF DFG FGY GYP … WTDFGYPAILKGGTAC

  47. BLAST: discarding sequences When the user enters a query sequence, it is also divided into words Search the database for consecutive neighboring words

  48. Neighbor words neighbor wordsare defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with a certain cutoff level GFC (20) GFB GPC (11) WAC (5)

  49. E-value The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a significant homology.E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous).E-values between 10-2 and 1 do not indicate a good homology

  50. Web servers for pairwise alignment