1 / 38

Introduction to the theory of sequence alignment

Introduction to the theory of sequence alignment. Yves Moreau Master of Artificial Intelligence Katholieke Universiteit Leuven 2003-2004. References.

sbland
Download Presentation

Introduction to the theory of sequence alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to the theory of sequence alignment Yves Moreau Master of Artificial Intelligence Katholieke Universiteit Leuven 2003-2004

  2. References • R. Durbin, A. Krogh, S. Eddy, G. Mitchinson, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Oxford University Press, 1998 • S.F. Altschul et al., Basic Local Alignment Search Tool, J. Mol. Biol., No. 215, pp. 403-410, 1990 • S.F. Altschul et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Research, Vol. 25, No. 17, pp. 3389-3402, 1997

  3. Overview • Alignment of two sequences • DNA • Proteins • Similarity vs. homology • Similarity • Homology • Orthology • Paralogy • Elements of an alignment • Dynamic programming

  4. Overview • Global alignment • Needleman-Wunsch algorithm • Local alignment • Smith-Waterman algorithm • Affine gap penalty • Substitution matrices • PAM • BLOSUM • Gap penalty • Significance evaluation • BLAST

  5. Biological basis for alignment

  6. BLAST for discovery

  7. Evolution of sequence databases • Genbank • SWISSProt

  8. Molecular evolution • Genomes through imperfect replication and natural selection • Gene duplications create gene families

  9. Similarity vs. homology • Sequences are similar if they are sufficiently resembling at the sequence level (DNA, protein, …) • Similarity can arise from • Homology • Convergence (functional constraints) • Chance • Sequences are homologous if they arise from a common ancestor • Homologous sequences are paralogous if their differences involve a gene duplication event • Homologous sequences are orthologous iftheir differences are not related to a gene duplication

  10. leghemoglobin - lupin myoglobin - whale b-globin - chicken a-globin - mouse b-globin - human d-globin - human b-globin - mouse a-globin - chimp Orthology vs. paralogy

  11. Phylogeny • Relationships between genes and proteins can be inferred on the basis of their sequences • Reconstruction of molecular evolution = phylogeny

  12. Homology for the prediction of structure and function • Homologous proteins have comparable structures • Homologous proteins potentially have similar functions (ortholog: similar cellular role; paralog: similar biochemical function)

  13. Homology for prediction with DNA • Conserved regions arise from evolutionary pressure and are therefore functionally important • Genes • Control regions • Comparative genomics • Genes can be predicted by comparing genomes at an appropriate evolutionary distance (e.g., mouse and human)

  14. Principles of pairwise alignment

  15. Elements of an alignment • Types of alignments • DNA vs. protein • Pairwise va. multiple alignment • Global alignment • Local alignment • Scoring model for alignments • Substitutions • Gaps (insertions, deletions) • Substitution matrix and gap penalty • Algorithm • Dynamic programming • Heuristic • Significance evaluation HEAGAWGHE-E --P-AW-HEAE

  16. Global alignment x y

  17. Strong homology Low similarity / structural homology Chance similarity Global alignment • Alignment of ‘human alpha globin’ against ‘human beta globin’, ‘lupin leghemoglobin’ and ‘glutathionine S-transferase homolog F11G11.2’(‘+’ for good substitutions) HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L+ L+++H+ K LGB2_LUPLU NNPEFQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL GS+ + G + +D L ++ H+ D+ A +AL D ++AH+ F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE

  18. Local alignment x y

  19. Substitution matrix and gap penalty • The alignment of two residues can be more or less likely • To compute the quality of an alignment, we assign a gain or a penalty to the alignment of two residues • Gaps also have a penalty BLOSUM50 substitution matrix HEAGAWGHE-E --P-AW-HEAE

  20. Substitution matrix for DNA • Standard

  21. -8 -8 -8 -8 -8 Dynamic programming • To align is to find the minimum penalty / maximum score path through the penalty table = DYNAMIC PROGRAMMING • Substitution matrix = BLOSUM 50 • Gap penalty = -8 HEAGAWGHE-E --P-AW-HEAE

  22. Shortest path from C1 to C8 Shortest path from C5 to C8 Shortest path from C1 to C5 3 C6 5 C4 C2 5 5 2 4 3 2 C8 C1 3 C5 C7 5 7 C3 6 4 Dynamic programming • Belman’s optimality principle • Example: finding the shortest train route between two cities

  23. Alignment algorithms

  24. Global alignment • Needleman-Wunsch algorithm • Progressively complete the table F(i,j) (!!! column, row) that keeps track of the maximum score for the alignment of sequence xup to xi to sequence y up to yj • Substitution matrix s(x, y) and gap penalty d • Recurrence I G A xi A I G A xi G A xi - - L G V yj G V yj - - S L G V yj { F(i-1,j-1) + s(xi, yj) substitution F(i,j) = max { F(i-1,j) – d deletion { F(i,j-1) – d insertion F(0,0) = 0

  25. Start from to left Complete progressively by recurrence Use traceback pointers s(xi, yj) – d { F(i-1,j-1) + s(xi, yj)F(i,j) = max { F(i-1,j) – d { F(i,j-1) – d – d

  26. Local alignment • Smith-Waterman algorithm • Best alignment between subsequences of x and y • If the current alignment has a negative score, it is better to start a new alignment { 0 restart{ F(i-1,j-1) + s(xi, yj) substitution F(i,j) = max { F(i-1,j) – d deletion { F(i,j-1) – d insertion F(0,0) = 0

  27. Start from top left Complete progressively by recurrence Traceback from the highest score and stop at zero AWGHE AW-HE

  28. Significance analysis • When is the score of an alignment statistically significant? • Let us look at the distribution of N alignment scores S w.r.t. random sequences • For an ungapped alignment, the score of a match is the sum of many i.i.d. random contributions and follows a normal distribution • For a normal distribution, the distribution of the maximumMN of a series of N random samples follows the extreme value distribution (EVD)P(MN <= x) = exp(–KNel(x-m))

  29. Significance analysis • For gapped alignments the EVD has the following form (even though the random contributions are not normally distributed)P(S<=x) = exp(Kmne-lS)withn length of the query, m length of the database • Ungapped alignement: parameters derived from Pi and s(i,j) • Gapped alignment: parameters estimated by regression • An alignment is significant if its probability is sufficiently small (e.g., P<0.01)

  30. Substitution matrices • How can we choose a reasonable substitution matrix? • Look at a set of confirmed alignments (with gaps) and compute the amino acid frequences qa, the substitution frequences pab, and the gap function f(g) • Likelihood model (drop the gapped positions) • Random sequences: P(x,y|R) = PiqxiPjqyj • Alignment: P(x,y|M) = Pipxiyi • Odds ratios: P(x,y|M)/P(x,y|R) = Pipxiyi/(PiqxiPjqyj ) • Log-odds score: S(x,y) = Sis(xi,yi) with s(a,b) = log(pab/qaqb) • Substitution matrix s(a,b) = log(pab/qaqb)

  31. PAM matrix • Point Accepted Mutations matrix • Problems • Alignments are not independent for related proteins • Different alignments correspond to different evolution times • PAM1 matrix • Tree of protein families • Estimate ancestral sequences • Estimate mutations at short evolutionary distance • Scale to a substitution matrix • 1% Point Accepted Mutations (PAM1) • PAM250 is 250% Point Accepted Mutations (~20% similarity) = 250ste power of PAM1

  32. BLOSUM matrix • BLOCKS SUbstitution Matrix • PAM does not work so well at large evolutionary distances • Ungapped alignments of protein families from the BLOCKS database • Group sequences with more than L% identical amino acids (e.g., BLOSUM62) • Use the substitution frequency of amino acids between the different groups (with correction for the group size) to derive the substitution matrix

  33. BLAST • For large databases, Smith-Waterman local alignment is too slow • Basic Local Alignment Search Tool (BLAST) is a fast heuristic algorithm for local alignment (http://www.ncbi.nlm.nih.gov/Entrez) • BLASTP – protein query on protein database • BLASTN – nucleotide query on nucleotide database • BLASTX – translated nucleotide query on protein database (translation into the six reading frames) • TBLASTN – protein query on translated nucleotide db • TBLASTX – translated nucleotide query on translated nucleotide db

  34. BLASTP • Step 1: Find all words of length w (e.g., w=3) for which there is a match in the query sequence with score at least T (e.g., T=11) for the chosen substitution matrix (e.g., BLOSUM62 with gap penalty 10+g) • Step 2: Use a finite state automaton to find all matches with the word list in the database (hits)

  35. BLASTP • Step 3: Check which hits have another hit without overlap within a distance of A (e.g., A=40) (the distance must be identical on the query and on the target) (two-hits) • Step 4: Extend the left hit of the two-hits in both directions by ungapped alignment ; stop the extension when the score drops by Xg (e.g., Xg=40) under the best score so far (high scoring segment pair HSP)

  36. BLASTP • Step 5: Extend the HSPs with normalized score above Sg (Sg=22 bits) by ungapped alignment ; stop the extension when the score drops by Xg (e.g., Xg=40) under the best score so far ; select the best gapped local alignment • Step 6: Compute the significance of the alignments ; for the significant alignments, repeat the gapped alignment with a higher dropoff parameter Xg for more accuracy

  37. + + + + + + + + + + + Hits + + + + Two-hits + + + + + + + + Local alignment + + + + + + + + + + + + BLASTP Query Target

  38. BLASTP example

More Related