before we begin n.
Skip this Video
Loading SlideShow in 5 Seconds..
Before we begin… PowerPoint Presentation
Download Presentation
Before we begin…

Loading in 2 Seconds...

play fullscreen
1 / 41

Before we begin… - PowerPoint PPT Presentation

Download Presentation
Before we begin…
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


  2. Pairwise Sequence AlignmentLesson 2

  3. What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE |||| ||||| ||| |||| || MVHLTPEEKTAVNALWGKVNVDAVGGE

  4. Why sequence alignment? Predict characteristics of a protein – use the structure or function information on known proteins with similar sequences available in databases in order to predict the structure or function of an unknown protein Assumptions: similar sequences produce similar proteins

  5. Local vs. Global Global alignment: forces alignment in regions which differ • Global alignment – finds the best alignment across the whole two sequences. • Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ Local alignment concentrates on regions of high similarity ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

  6. Sequence evolution In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes: • Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA Insertion AAG A T

  7. Sequence evolution In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes : • Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA • Deletion – a deletion of a letter (or more) from the sequence. AAGA AGA Deletion A A AG

  8. Evolutionary changes in sequences In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of mutations: • Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA • Deletion - deleting a letter (or more) from the sequence. AAGA AGA • Substitution – a replacement of one (or more) sequence letter by another AAGA AACA Substitution AA A C G Insertion + Deletion Indel

  9. Sequence alignment AAGCTGAATTCGAA AGGCTCATTTCTGA One possible alignment: AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2mismatches 4 indels (gap) 10 perfect matches

  10. Choosing an alignment: • Many different alignments are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better?

  11. Scoring an alignment:example - naïve scoring system: • Match: +1 • Mismatch: -2 • Indel: -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Score: =(+1)x10 + (-2)x2 + (-1)x4= 2 Score: =(+1)x9 + (-2)x2 + (-1)x6 = -1 Higher score  Better alignment

  12. Scoring system: • Different scoring systems can produce different optimal alignments • Scoring systems implicitly represent a particular theory of similarity/dissimilarity between sequence characters: evolution based, physico-chemical properties based • Some mismatches are more plausible • Transition vs. Transversion • LysArg ≠ LysCys • Gap extension Vs. Gap opening

  13. Substitutions Matrices • Nucleic acids: • Transition-transversion • Amino acids: • Evolution (empirical data) based: (PAM, BLOSUM) • Physico-chemical properties based (Grantham, McLachlan)

  14. PAM Matrices • Family of matrices PAM 80, PAM 120, PAM 250 • The number with PAM matrices represent evolutionary distance • Greater numbers denote greater distances

  15. Which PAM matrix to use? • Low PAM numbers: strong similarities • High PAM numbers: weak similarities • PAM120 for general use (40% identity) • PAM60 for close relations (60% identity) • PAM250 for distant relations (20% identity) • If uncertain, try several different matrices • PAM40, PAM120, PAM250

  16. PAM - limitations • Based on only one original dataset • Examines proteins with few differences (85% identity) • Based mainly on small globular proteins so the matrix is biased

  17. BLOSUM Matrices • Different BLOSUMn matrices are calculated independently from BLOCKS • BLOSUMn is based on sequences that share at least n percent identity • BLOSUM62 represents closer sequences than BLOSUM45

  18. Example : Blosum62 derived from blocks of sequences that share at least 62% identity

  19. Which BLOSUM matrix to use? • Low BLUSOM numbers for distant sequences • High BLUSOM numbers for similar sequences • BLOSUM62 for general use • BLOSUM80 for close relations • BLOSUM45 for distant relations

  20. PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences

  21. Gap penalty • We expect to penalize gaps • A different score for gap opening and for extension • Insertions and deletions are rare in evolution • But once they occur, they are easy to extend • Gap-extension penalty < gap-opening penalty

  22. Web servers for pairwise alignment

  23. BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool)engine for local alignment • Does not use an exact algorithm but a heuristic

  24. Back to NCBI

  25. BLAST – bl2seq

  26. Bl2Seq - query • blastn – nucleotide blastp – protein

  27. Bl2seq results

  28. Bl2seq results Dissimilarity Low complexity Gaps Similarity Match

  29. Bl2seq results: • Bits score– A score for the alignment according to the number of similarities, identities, etc. • Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a database of a particular size. The closer the e-value approaches zero, the greater the confidence that the hit is real

  30. BLAST – programs Query: DNA Protein Database: DNA Protein

  31. BLAST – Blastp

  32. Blastp - results

  33. Blastp – results (cont’)

  34. Blastp – acquiring sequences

  35. blastp – acquiring sequences (cont’)


  37. Searching for remote homologs • Sometimes BLAST isn’t enough • Large protein family, and BLAST only finds close members. We want more distant members • PSI-BLAST • Profile HMMs (not discussed in this exercise)

  38. PSI-BLAST • Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results

  39. PSI-BLAST • Advantage: PSI-BLAST looks for seq’s that are close to the query, and learns from them to extend the circle of friends • Disadvantage: if we obtained a WRONG hit, we will get to unrelated sequences (contamination). This gets worse and worse each iteration

  40. BLAST – PSI-Blast

  41. PSI-Blast - results