1 / 38

Pairwise alignment

Learn the general concepts in alignment, how to read a dotplot, scoring matrices and gap penalties, basic dynamic programming algorithms, and more in this comprehensive guide. Explore the applications of sequence alignment, including sequence/genome assembly, functional annotation, and molecular evolution. Understand different types of alignments, such as global and local alignments, and the differences between amino acid and DNA alignments. Discover scoring methods and matrices like PAM and BLOSUM.

curley
Download Presentation

Pairwise alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pairwise alignment Biology 162 Computational Genetics Todd VisionFall 2004 31 Aug 2004

  2. Preview • General concepts in alignment • How to read a dotplot • Scoring matrices and gap penalties • Basic dynamic programming algorithms • Needleman-Wunsch • Smith-Waterman • Using more realistic gap penalties

  3. Homology • Two sequences are homologous if they aredescended from a common ancestral sequence. • Homology is either all or nothing • Only similarity can be quantitative • An alignment is a model of positional homology between nucleotide or amino acid residues

  4. Some applications of sequence alignment • Sequence/genome assembly • Locating exons within genomic sequences • Functional annotation by homology search against a database • Identification of conserved signatures/motifs/domains • Molecular evolution and phylogenetics • Structural homology modeling

  5. Alignments classified by • Span • Global, encompassing full-length sequences • Local, restricted to conserved segments • Number of sequences • Pairwise, involving only two sequences • Multiple, involving more than two • Algorithm • Optimality guarantee • Heuristic

  6. Amino acids versus DNA • DNA sequences give much worse alignments than amino acid sequences • Fewer letters • Less realistic scoring matrices • Some applications can align codons • How large would that scoring matrix be? • If that’s not possible • Use aa alignment to guide DNA alignment of coding sequences • Use conceptual translations (6 potential coding frames) for database searches

  7. B C Dotplots:phage l cI vs. P22 c2 repressor A Window size 1 11 25 Stringency 1 7 15

  8. Internal repeats:human LDL protein Window size = 23 bp Stringency = 7 bp

  9. Inversions

  10. Hanging ends y y x x Overlap Nesting x y

  11. Twilight Zone 100 90 80 70 60 50 40 30 20 10 0 Percent amino acid identity

  12. Scoring an alignment • Possible relationships at a position • Match (identity) • Mismatch (substitution) • Gap (insertion/deletion, or indel) • A scoring matrix is used for matches and mismatches • Typically binary for nucleic acids • PAM, BLOSUM, & others for amino acids • Gap penalties must be “tacked on” • The alignment score is the sum of the scores at each position in the alignment, including gaps

  13. LOD scores • Let pabbe the expected frequency of aligned residue pair a and b among all aligned residues • Let qa be the frequency of individual amino acid a

  14. Point Accepted Mutation (PAM) matrices • Trained on alignments of closely related proteins • PAM1 implies 1 substitution per 100 amino acids • PAM250 = (PAM1)250 • Training set strongly biased toward globular proteins (more suitable matrices are available for more specific protein classes)

  15. Choosing the right PAM matrix • Low PAM values discriminate among amino acids more dramatically • As the exponent increases, values within a row converge on amino acid frequencies • Choice of matrix typically depends on observed % identity • Classic chicken and egg problem • PAM250 corresponds to 20% identity • Assumes substitution rate is equal among sites (Poisson model), which we know to be false

  16. BLOSUM scoring matrices • Trained on ungapped alignments (blocks) of divergent sequences to capture ‘long-term’ substitution patterns • Named BLOSUMx, where x (from 0 to 100) is the minimum percent identity of the sequences in the alignment. (The smaller the value of x, the more divergent the sequences). • Note that numbers have opposite meanings for PAM and BLOSUM! • BLOSUM62 is in wide use (eg it is the default in BLAST)

  17. BLOSUM62 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 –4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 –4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 –4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 –4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 –4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 –4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 –4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 –4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 –4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 –4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 –4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 –4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 –4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 –1 1 5 -2 –2 0 -1 -1 0 –4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 –4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 –4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 –4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 –4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 –4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 BLOSUM62

  18. Gap penalties • Naïve score • Each gap position receives independent penalty of d • Affine scores • Score depends on length of contiguous gap • Gap opening penalty d • Gap extension penalty e

  19. Dynamic programming • A problem solving technique that employs recursion to solve a larger problem by solving a nested set of similar subproblems

  20. Application to pairwise alignment • Imagine • We know the score for the first i-1 and j-1 residues of sequences x and y • i-1 and j-1 are aligned in the optimal alignment • There are three possibilities for the next position in the alignment • A gap in sequence x • A gap in sequence y • A match or mismatch between i and j • The maximum scoring alignment among these has to be in the optimal global alignment

  21. Overview of algorithm • We can use this fact to recursively fill out a matrix containing the score F(i,j) of the optimal alignment for every pair of residues i and j • We also store a pointer to one of three previously filled out cells in the matrix, forming a path graph • The optimal global alignment must be a path within the path graph • It can be found by performing a traceback from the final cell in the recursion

  22. Path Graph Diagonal moves represent matches and mismatches Horizontal and vertical moves represent gaps (indels)

  23. xi, yj match xialigned to gap yj aligned to gap A G A G 3 3 2 3 A C A C 3 2 ?C ?- ?- ?G Needleman-Wunsch recursion Let s(C,C)=1, s(C,G)=s(G,C)=-2, and d=-1 y A C 3 3 x A C 3 x: AC y: AC

  24. Initialize F(0,0) = 0 Use pointers to remember path Match=+1, Mismatch=-1, Gap=-2 arbitrary order of precedence: , ,  Needleman-Wunsch c g t g c g t c t gtg a t

  25. Needleman-Wunsch: completed path graph c g t g c g t c t gtg a t

  26. Needleman-Wunsch: traceback c g t g c g t c t gtg a t cgtgcg-t | || | | c-tgtgat optimal global alignment:

  27. Complexity of algorithm • For sequences of length m and n • We consider 3 options at each cell • We store mn scores and pointers • We trace back no more than m+n steps • 3mn +m+n in time, 3mn in memory • O(mn) in both time and memory • If m=n, O(n2)

  28. Smith-Waterman algorithm • Local pairwise alignment • Cells with negative scores are set to zero • Traceback from highest scoring cell • Stop when 0 is encountered • Also O(nm)

  29. score≤ 0 xi, yj match xialigned to gap yj aligned to gap Smith-Waterman recursion

  30. Smith-Waterman algorithm c g t g c g t c t gtg a t cgtgcgt ||| ctgtgat optimal local alignment:

  31. Guaranteeing a local alignment • Use of SW algorithm alone does not guarantee “local” behavior • Sensitive to the scoring function (should be negative for random sequences) • Use of LOD scores help ensure this • Gap penalties must also be chosen carefully • If it is cheaper to insert a gap than to tolerate a mismatch, then gaps will be inserted where no alignment is possible

  32. More realistic gap penalties • General gap function g(g) • Requires O(n3) operations

  33. Affine gap penalties • Gap score: g(g)=-d-(g-1)e • Can be done in O(mn) • We need to keep track of three scores (and pointers) at each cell

  34. A theme with variations • Overlapping or nested sequences • Do not penalize hanging ends • Repeated sequences • Asymmetric algorithm can find multiple local alignments of x in y or y in x • The basic idea admits many variations

  35. Things to keep in mind about pairwise alignments • There may be multiple optima • Optimality is only guaranteed with respect to the scoring function – the alignment may still be biologically wrong! • O(mn) is still too big when n is the size of a major sequence database

  36. Summary • Dotplots are an excellent visual tool to decide whether and what kind of alignment is appropriate • PAM and BLOSUM series matrices provide empirical LOD values for scoring alignments • Different flavors of alignment are produced by variants on a basic dynamic programming algorithm • Needleman-Wunsch for global alignments • Smith-Waterman for local alignments • Affine gap penalties balance biological realism with computational feasibility

  37. Reading assignment • Nicholas et al. (2002) Strategies for multiple sequence alignment. Biotechniques 32, 572-591 (handout)

More Related