1 / 70

Applied Bioinformatics

Applied Bioinformatics. Week 3. Theory I. Similarity Dot plot. 3.2 On sequence alignment Sequence alignment is the most important task in bioinformatics!.

Download Presentation

Applied Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applied Bioinformatics Week 3

  2. Theory I • Similarity • Dot plot

  3. 3.2 On sequence alignment Sequence alignment is the most important task in bioinformatics! Introduction to Bioinformaticshttp://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htmLECTURE 3: SEQUENCE ALIGNMENT

  4. 3.2 On sequence alignment Sequence alignment is important for: * prediction of function * database searching * gene finding * sequence divergence * sequence assembly http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htmLECTURE 3: SEQUENCE ALIGNMENT

  5. 3.3 On sequence similarity Homology: genes that derive from a common ancestor-gene are called homologs Orthologous genes are homologous genes in different organisms Paralogous genes are homologous genes in one organism that derive from gene duplication Gene duplication: one gene is duplicated in multiple copies that therefore free to evolve and assume new functions http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htmLECTURE 3: SEQUENCE ALIGNMENT

  6. http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htmHOMOLOGOUS and PARALOGOUS

  7. http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htmHOMOLOGOUS and PARALOGOUS

  8. http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htmHOMOLOGOUS and PARALOGOUS versus ANALOGOUS

  9. plants ? globin Ath-g analogs

  10. Causes for sequence (dis)similarity mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA) insertion: at a certain location one new nucleotide is inserted inbetween two existing nucleotides (e.g.: AA → AGA) deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G) indel: an insertion or a deletion http://www.personeel.unimaas.nl/Westra/Education/BioInf/slides_of_bioinformatics.htmLECTURE 3: SEQUENCE ALIGNMENT: sequence similarity

  11. Similarity • We can only measure current similarity • We can form hypothesi

  12. Similarity Searching • DotPlot • Needleman-Wunsch • Smith-Waterman • FASTA • BLAST

  13. Dot Plot • Writing one sequence horizontally • Writing the other vertically • At each intersection with equal nucleotides make a dot in the matrix

  14. Dot Plot

  15. Dot Plot • Messy? • Strong similarities can be visually enhanced • Select a window size and a similarity score for that window (e.g. 10 and 8) • Create a new matrix with dots where the window score >= 8

  16. Dot Plot

  17. Dot Plot Interpretation

  18. Creating a Dot Plot

  19. End Theory I • Mindmapping • 10 min break

  20. Practice I • Dot plot

  21. Dot Plot • ACGTGTGCGTTTGAAC • GGGTGTTCGTTTAAAC • Make a Dot plot for the two sequences above • Use a window of 3 to refine the view • Can you use Excel? • Get any two DNA sequences and try the tool below • http://www.vivo.colostate.edu/molkit/dnadot/

  22. Similarity Searching • DotPlot • Needleman-Wunsch • Smith-Waterman • FASTA • BLAST

  23. Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically meaningful. Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences. Local alignment - Smith-Waterman (1981) gives the highest scoring local match between two sequences.

  24. How can we find an optimal alignment? • ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG-CATCGTC--T-ATCT • How many possible alignments? C(27,7) gap positions = ~888,000 possibilities • Dynamic programming: The Needleman & Wunsch algorithm 27 1

  25. = (2n)!/(n!)2 = (22n /n ) = (2n)   2n n Time Complexity Consider two sequences: AAGT AGTC How many possible alignments the 2 sequences have? = 70

  26. Scoring a sequence alignment • Match/mismatch score: +1/+0 • Open/extension penalty: –2/–1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT • Matches: 18 × (+1) • Mismatches: 2 × 0 • Open: 2 × (–2) • Extension: 5 × (–1) Score = +9

  27. Pairwise Global Alignment • Computationally: • Given: a pair of sequences (strings of characters) • Output: an alignment that maximizes the similarity

  28. Needleman-Wunsch Alg

  29. Needleman-Wunsch Alg

  30. Needleman-Wunsch Alg

  31. Needleman-Wunsch Alg

  32. Needleman-Wunsch Alg

  33. Needleman-Wunsch Alg

  34. Needleman-Wunsch Alg

  35. Needleman-Wunsch Alg

  36. Needleman-Wunsch Alg • Which Alignment is better? • For scoring use: • Match 1 • Mismatch 0 • Gap open -2 • Gap extension -1 • How can substitution matrices be integrated?

  37. Needleman & Wunsch • Place each sequence along one axis • Place score 0 at the up-left corner • Fill in 1st row & column with gap penalty multiples • Fill in the matrix with max value of 3 possible moves: • Vertical move: Score + gap penalty • Horizontal move: Score + gap penalty • Diagonal move: Score + match/mismatch score • The optimal alignment score is in the lower-right corner • To reconstruct the optimal alignment, trace back where the max at each step came from, stop when hit the origin.

  38. Three steps in Needleman-Wunsch Algorithm • Initialization • Scoring • Trace back (Alignment) • Consider the two DNA sequences to be globally aligned are: ATCG (x=4, length of sequence 1) TCG (y=3, length of sequence 2) Pooja Anshul Saxena, University of Mississippi

  39. Scoring Scheme • Match Score = +1 • Mismatch Score = -1 • Gap penalty = -1 • Substitution Matrix Pooja Anshul Saxena, University of Mississippi

  40. Initialization Step • Create a matrix with X +1 Rows and Y +1 Columns • The 1st row and the 1st column of the score matrix are filled as multiple of gap penalty Pooja Anshul Saxena, University of Mississippi

  41. Scoring • The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(I, j) scoreup = C(i-1, j) + g scoreleft = C(i, j-1) + g where S(I, j) is the substitution score for letters i and j, and g is the gap penalty Pooja Anshul Saxena, University of Mississippi

  42. Scoring …. • Example: The calculation for the cell C(2, 2): scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1 scoreup = C(i-1, j) + g = -1 + -1 = -2 scoreleft = C(i, j-1) + g = -1 + -1 = -2 Pooja Anshul Saxena, University of Mississippi

  43. Scoring …. • Final Scoring Matrix Pooja Anshul Saxena, University of Mississippi

  44. Trace back • The trace back step determines the actual alignment(s) that result in the maximum score • There are likely to be multiple maximal alignments • Trace back starts from the last cell, i.e. position X, Y in the matrix • Gives alignment in reverse order Pooja Anshul Saxena, University of Mississippi

  45. Trace back …. • There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left • Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors Pooja Anshul Saxena, University of Mississippi

  46. Trace back …. • The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of Seq 1: G | Seq 2: G Pooja Anshul Saxena, University of Mississippi

  47. Trace back …. • Final Trace back Best Alignment: A T C G | | | | _ T C G Pooja Anshul Saxena, University of Mississippi

  48. Similarity Searching • DotPlot • Needleman-Wunsch • Smith-Waterman • FASTA • BLAST

  49. Local Alignment • Problem first formulated: • Smith and Waterman (1981) • Problem: • Find an optimal alignment between a substring of s and a substring of t • Algorithm: • is a variant of the basic algorithm for global alignment

  50. Motivation • Searching for unknown domains or motifs within proteins from different families • Proteins encoded from Homeobox genes (only conserved in 1 region called Homeo domain – 60 amino acids long) • Identifying active sites of enzymes • Comparing long stretches of anonymous DNA • Querying databases where query word much smaller than sequences in database • Analyzing repeated elements within a single sequence

More Related