1 / 46

Module A: Fundamental Algorithms in Sequence Analysis

Module A: Fundamental Algorithms in Sequence Analysis. Section 1: Sequence Alignments Srinivas Aluru. Biology easily has 500 years of exciting problems to work on -Donald E. Knuth. Biological Data. DNA: Self-replicating

elina
Download Presentation

Module A: Fundamental Algorithms in Sequence Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Module A: Fundamental Algorithms in Sequence Analysis Section 1: Sequence Alignments Srinivas Aluru

  2. Biology easily has 500 years of exciting problems to work on -Donald E. Knuth

  3. Biological Data DNA: • Self-replicating • Codes for proteins Proteins: • Perform most functions in living organisms BBSI Summer School - Iowa State University

  4. O O C O P O HN C CH2 O O C CH C C N O C C H OH H DNA: Sequence of nucleotides Nucleotide: Deoxyribose sugar + Phosphate + Base Nucleotides: A, T, G, and C CH3 5’ 1’ 4’ 3’ 2’ BBSI Summer School - Iowa State University

  5. 5’ 3’ 5’ P P P 3’ A C G T G C 3’ P P P 5’ 3’ 5’ BBSI Summer School - Iowa State University

  6. BBSI Summer School - Iowa State University

  7. For computational purposes, DNA = A sequence over alphabet {A,C,G,T} 5’ A T T C G G G A A T G C A T G C C A 3’ 3’ T A A G C C C T T A C G T A C G G T 5’ BBSI Summer School - Iowa State University

  8. Proteins: Chains of amino acid residues. There are 20 different amino acids. Functions: • Tissue building blocks (Structure proteins) • Catalysts (enzymes) • Oxygen transport • Antibody defense BBSI Summer School - Iowa State University

  9. BBSI Summer School - Iowa State University

  10. BBSI Summer School - Iowa State University

  11. Example RNA: AUG GGA GAG CUA UGA Protein: Met Gly Glu Leu STOP BBSI Summer School - Iowa State University

  12. BBSI Summer School - Iowa State University

  13. Challenges in Computational Biology • Obtain the genome of an organism. • Identify and annotate genes. • Find the sequences, three dimensional structures, and functions of proteins. • Find sequences of proteins that have desired three dimensional structures. • Compare DNA sequences and proteins sequences for similarity. • Study the evolution of sequences and species. BBSI Summer School - Iowa State University

  14. Sequence Comparison Caveats Magenta regions are structurally equivalent with enterotoxin (top left). http://www.sbg.bio.ic.ac.uk/AH/explanation.html BBSI Summer School - Iowa State University

  15. Pairwise Sequence Alignment Problem: Find similarity between two sequences. Variations: • Given two sequences, find if parts of them are similar (local alignment). • Given a large sequence and a short sequence, find if the short sequence is similar to a stretch of the long sequence. BBSI Summer School - Iowa State University

  16. Alignments • Show one sequence placed above another such that similarity is revealed A: C A T - T C A - C B: C - T C G C A G C Example: BBSI Summer School - Iowa State University

  17. Measuring Similarity Score: A measure of alignment quality C A T - T C A - C C - T C G C A G C -------------------------------- 10 -5 10 -5 -2 10 10 -5 10 Total = 33 BBSI Summer School - Iowa State University

  18. Pairwise Global Alignment T[i,j] = Score of optimally aligning first i bases of s with first j bases of t. BBSI Summer School - Iowa State University

  19. Calculating Alignments Case 1: Match s[i] w/ t[j] i - 1 i s: C A T T C A C t: C - T T C A G j -1 j Case 2: Match t[j] w/ gap i s: C A T T C A C - t: C - T T C A - G j -1 j Case 3: Match s[i] w/ gap i - 1 i s: C A T T C A - C t: C - T T C A G - j BBSI Summer School - Iowa State University

  20. -5 -10 -15 -20 -25 -30 -35 λ C T C G C A G C 0 -5 -10 -15 -20 -25 -30 -35 -40 λ 10 5 C A T T C A C +10 for match, -2 for mismatch, -5 for gap BBSI Summer School - Iowa State University

  21. * * λ C T C G C A G C λ C A T T C A C Traceback yields both optimal alignments in this example BBSI Summer School - Iowa State University

  22. End-gap free alignment • We often don’t want to penalize gaps at the start or end of the alignment, especially when comparing short and long sequences • Same as global alignment, except: • Initialize with zeros (free gaps at start) • Locate max in the last row/column (free gaps at end) BBSI Summer School - Iowa State University

  23. 0 0 0 0 0 0 0 0 0 0 0 5 8 5 8 5 20 15 10 0 0 15 10 5 6 15 18 13 0 -2 10 13 8 3 10 13 16 0 10 5 20 15 18 13 8 23 5 8 15 18 13 28 23 18 0 0 0 3 10 25 20 23 38 33 λ C T C G C A G C λ 10 5 10 5 10 5 0 10 C A T T C A G +10 for match, -2 for mismatch, -5 for gap BBSI Summer School - Iowa State University

  24. Local Alignment T [i, j] = Score of optimally aligning a suffix of s with a suffix of t. Initialize top row and leftmost column to zero. BBSI Summer School - Iowa State University

  25. λ C T C G C A G C λ C A T T C A C +1 for a match, -1 for a mismatch, -5 for a gap BBSI Summer School - Iowa State University

  26. Some Results • Most pairwise sequence alignment problems can be solved in O(mn) time. • Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88]. • Two highly similar sequences can be aligned in O(dn) time, where d is a measure of the distance between the sequences [Landau86]. BBSI Summer School - Iowa State University

  27. Reducing space requirements • O (mn) tables are often the limiting factor in computing large alignments • There is a linear space technique that only doubles the time required [Hirschberg77] BBSI Summer School - Iowa State University

  28. 0 5 8 5 8 5 20 15 10 λ C T C G C A G C 0 0 0 0 0 0 0 0 0 λ 0 10 5 10 5 10 5 0 10 C A T T C A G IDEA: We only need the previous row to calculate the next BBSI Summer School - Iowa State University

  29. Linear-space Alignments mn + ½ mn + ¼ mn + 1/8 mn + 1/16 mn + … = 2 mn BBSI Summer School - Iowa State University

  30. Affine Gap Penalty Functions Gap penalty = h + gk where k = length of a maximal sequence of gaps h = gap opening penalty g = gap continuation penalty BBSI Summer School - Iowa State University

  31. PAM matrices • Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify evolutionary change within a protein sequence [Dayhoff78]. • A PAM unit is the amount of evolution which will on average change 1% of the amino acids within a protein sequence. BBSI Summer School - Iowa State University

  32. PAM250 scoring matrix BBSI Summer School - Iowa State University

  33. BLOSUM matrices • Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff92]. • For example BLOSUM62 is derived from sequence alignments with no more than 62% identity. BBSI Summer School - Iowa State University

  34. Comparison • PAM is based on an evolutionary model using phylogenetic trees • BLOSUM assumes no evolutionary model, but rather conserved “blocks” of proteins BBSI Summer School - Iowa State University

  35. Multiple Sequence Alignment VTISCTGSSSNIGAGNHVKWYQQLPG VTISCTGTSSNIGSITVNWYQQLPG LRLSCSSSGFIFSSYAMYWVRQAPG LSLTCTVSGTSFDDYYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG ATLVCLISDFYPGAVTVAWKADS ATLVCLISDFYPGAVTVAWKADS AALGCLVKDYFPEPVTVSWNSG- VSLTCLVKGFYPSDIAVEWESNG- BBSI Summer School - Iowa State University

  36. Induced Pairwise Alignment S1 S - T I S C T G - S - N I S2 L - T I – C N G S S - N I S3 L R T I S C S G F S Q N I Induced pairwise alignment of S1andS2: S1 S T I S C T G - S N I S2 L T I – C N G S S N I BBSI Summer School - Iowa State University

  37. Sum-of-Pairs Scoring Function Score of multiple alignment where BBSI Summer School - Iowa State University

  38. Multiple Alignment Run-time of dynamic programming solution = O(2k nk) where n = length of each sequence k = number of sequences Space, O(nk), is prohibitively large! Example: 6 sequences of length 100  6.4X1013 calculations! BBSI Summer School - Iowa State University

  39. Carillo-Lippman Heuristic L = Lower bound on multiple alignment score If Then T[i1,i2,…,ik] cannot be on an optimal path. BBSI Summer School - Iowa State University

  40. Multiple Alignment to a Phylogenetic Tree • A tree showing the evolutionary relationship between sequences is available. • Compute multiple alignment such that for each edge (i,j) in the tree Induced alignment between Siand Sj. = Optimal alignment between Siand Sj. BBSI Summer School - Iowa State University

  41. Examples Primates Darwin’s Finches http://members.aol.com/darwinpage/trees.htm BBSI Summer School - Iowa State University

  42. Multiple Alignment to a Tree • Build the multiple alignment incrementally. • To add a new sequence, an edge should connect it in the tree to a sequence already incorporated in the multiple alignment. • Insert the new sequence according to its optimal alignment with the other sequence connected by the edge. • Adjust other sequences in the multiple alignment. • Run-time = time for k pairwise alignments. BBSI Summer School - Iowa State University

  43. Searching Biological Databases BLAST (Basic Local Alignment Search Tool) http://www.ncbi.nlm.nih.gov • BLASTN (DNA) • BLASTP (Protein) • BLASTX (DNA against Protein) • PSI-BLAST (Position Specific Iterative BLAST) BBSI Summer School - Iowa State University

  44. Multiple Alignment Software • Clustalw (http://www.ebi.ac.uk/clusalw) • MSA (http://softlib.rice.edu/softlib/msa.html) • HMMER (http://hmmer.wustl.edu/) • SAM (http://www.cse.ucsc.edu/research/ compbio/sam.html) BBSI Summer School - Iowa State University

  45. References • M. O. Dayhoff, R.M. Schwartz, and B.C. Orcutt, A model of evolutionary change in proteins, Atlas of Protein Sequence and Structure, 5:345-352, 1978. • S. Henikoff and J. G. Henikoff, Amino acid substitution matrices from protein blocks, Proc. Natl. Academy Science, 89:10915-10919, 1992. • D.S. Hirschberg, Algorithms for the longest common subsequence problem, J. ACM, 24:664-675, 1977. • G.M. Landau and U. Vishkin, Efficient string matching with k mismatches, Theoretical Computer Science, 43:239-249, 1986. • E. Myers and W. Miller, Optimal alignments in linear space. Computer Applications in the Biosciences, 4(1):11–17, 1988. BBSI Summer School - Iowa State University

More Related