1 / 37

Sequence Alignments

Sequence Alignments. DNA Replication. Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set DNA polymerase is the enzyme that copies DNA Reads the old strand in the 3 ´ to 5 ´ direction. Over time, genes accumulate mutations.

helen
Download Presentation

Sequence Alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Alignments

  2. DNA Replication • Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set • DNA polymerase is the enzyme that copies DNA • Reads the old strand in the 3´ to 5´ direction

  3. Over time, genes accumulate mutations • Environmental factors • Radiation • Oxidation • Mistakes in replication or repair • Deletions, Duplications • Insertions • Inversions • Point mutations

  4. Deletions • Codon deletion:ACG ATA GCG TAT GTA TAG CCG… • Effect depends on the protein, position, etc. • Almost always deleterious • Sometimes lethal • Frame shift mutation:ACG ATA GCG TAT GTA TAG CCG…ACG ATA GCG ATG TAT AGC CG?… • Almost always lethal

  5. Why align sequences? • The draft human genome is available • Automated gene finding is possible • Gene: AGTACGTATCGTATAGCGTAA • What does it do? • One approach: Is there a similar gene in another species? • Align sequences with known genes • Find the gene with the “best” match

  6. Are there other sequences like this one? 1) Huge public databases - GenBank, Swissprot, etc. 2) Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes 3) Similarity searching is based on alignment 4) BLAST and FASTA provide rapid similarity searching a. rapid = approximate (heuristic) b. false + and - scores

  7. Similarity ≠ Homology 1) 25% similarity ≥ 100 AAs is strong evidence for homology 2) Homology is an evolutionary statement which means “descent from a common ancestor” • common 3D structure • usually common function • homology is all or nothing, you cannot say "50% homologous"

  8. GlobalvsLocalsimilarity 1) Global similarity uses complete aligned sequences - total % matches • GCGGAP program, Needleman & Wunch algorithm 2) Local similarity looks for best internal matching region between 2 sequences • GCGBESTFIT program, • Smith-Waterman algorithm, • BLAST and FASTA 3) dynamic programming • optimal computer solution, not approximate

  9. Search with Protein, not DNA Sequences 1) 4 DNA bases vs. 20 amino acids - less chance similarity 2) can have varying degrees of similarity between different AAs - # of mutations, chemical similarity, PAM matrix 3) protein databanks are much smaller than DNA databanks

  10. Similarity is Based on Dot Plots 1) two sequences on vertical and horizontal axes of graph 2) put dots wherever there is a match 3) diagonal line is region of identity (local alignment) 4) apply a window filter - look at a group of bases, must meet % identity to get a dot

  11. Simple Dot Plot

  12. Dot plot filtered with 4 base window and 75% identity

  13. Dot plot of real data

  14. Indels • Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known:ACGTCTGATACGCCGTATCGTCTATCTACGTCTGAT---CCGTATCGTCTATCT

  15. The Genetic Code Substitutions are mutations accepted by natural selection. Synonymous: CGC CGA Non-synonymous: GAU  GAA

  16. Comparing two sequences • Point mutations, easy:ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATTCGCCCTATCGTCTATCT • Indels are difficult, must align sequences:ACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCTACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTCTATCT

  17. Scoring a sequence alignment • Match score: +1 • Mismatch score: +0 • Gap penalty: –1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT • Matches: 18 × (+1) • Mismatches: 2 × 0 • Gaps: 7 × (– 1) Score = +11

  18. Origination and length penalties • We want to find alignments that are evolutionarily likely. • Which of the following alignments seems more likely to you?ACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-CGT-TA-TCTATCT • We can achieve this by penalizing more for a new gap, than for extending an existing gap  

  19. Scoring a sequence alignment (2) • Match/mismatch score: +1/+0 • Origination/length penalty: –2/–1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT • Matches: 18 × (+1) • Mismatches: 2 × 0 • Origination: 2 × (–2) • Length: 7 × (–1) Score = +7

  20. Scoring Similarity 1) Can only score aligned sequences 2) DNA is usually scored as identical or not 3) modified scoring for gaps - single vs. multiple base gaps (gap extension) 4) AAs have varying degrees of similarity • a. # of mutations to convert one to another • b. chemical similarity • c. observed mutation frequencies 5) PAM matrix calculated from observed mutations in protein families

  21. DNA Scoring Matrix Identity BLAST Transition/Transversion

  22. The PAM 250 scoring matrix

  23. How can we find an optimal alignment? • Finding the alignment is computationally hard:ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG—CATCGTC--T-ATCT • C(27,7) gap positions = ~888,000 possibilities • It’s possible, as long as we don’t repeat our work! • Dynamic programming: The Needleman & Wunsch algorithm

  24. What is the optimal alignment? • ACTCGACAGTAG • Match: +1 • Mismatch: 0 • Gap: –1

  25. G +1 ACTCG ACAGTAG -1 ACTC- ACAGTAG- -1 ACTCGG ACAGTA The dynamic programming concept • Suppose we are aligning:ACTCGACAGTAG • Last position choices:

  26. We can use a table • Suppose we are aligning:A with A… A 0 -1 A -1

  27. Needleman-Wunsch: Step 1 • Each sequence along one axis • Mismatch penalty multiples in first row/column • 0 in [1,1]

  28. Needleman-Wunsch: Step 2 • Vertical/Horiz. move: Score + (simple) gap penalty • Diagonal move: Score + match/mismatch score • Take the MAX of the three possibilities

  29. Needleman-Wunsch: Step 2 (cont’d) • Fill out the rest of the table likewise…

  30. Needleman-Wunsch: Step 2 (cont’d) • Fill out the rest of the table likewise… • The optimal alignment score is calculated in the lower-right corner

  31. But what is the optimal alignment • To reconstruct the optimal alignment, we must determine of where the MAX at each step came from…

  32. A path corresponds to an alignment • = GAP in top sequence • = GAP in left sequence • = ALIGN both positions • One path from the previous table: • Corresponding alignment (start at the end):AC--TCG ACAGTAG Score = +2

  33. Semi-global alignment • Suppose we are aligning:GCGGGCG • Which do you prefer?G-CG -GCGGGCG GGCG • Semi-global alignment allows gaps at the ends for free.

  34. Semi-global alignment • Semi-global alignment allows gaps at the ends for free. • Initialize first row and column to all 0’s • Allow free horizontal/vertical moves in last row and column

  35. Local alignment • Global alignments – score the entire alignment • Semi-global alignments – allow unscored gaps at the beginning or end of either sequence • Local alignment – find the best matching subsequence • CGATGAAATGGA • This is achieved by allowing a 4th alternative at each position in the table: zero, if alternative neg. • Smith-Waterman Algorithm (1981).

  36. Local alignment • Mismatch = –1 this time CGATGAAATGGA

  37. Practice Problem • Find an optimal alignment for these two sequences: GCGGTT GCGT • Match: +1 • Mismatch: 0 • Gap: –1

More Related