1 / 108

Design and creation of multiple sequence alignments Unit 13

Design and creation of multiple sequence alignments Unit 13. BIOL221T : Advanced Bioinformatics for Biotechnology. Irene Gabashvili, PhD. Dot Plot (Matrix) for Sequence comparison. Reminders from Previous Lectures. DOTPLOTS. DOROTHYCROWFOOTHODGKIN. DOROTHYHODGKIN.

jake
Download Presentation

Design and creation of multiple sequence alignments Unit 13

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and creation of multiple sequence alignmentsUnit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD

  2. Dot Plot (Matrix) for Sequence comparison Reminders from Previous Lectures

  3. DOTPLOTS • DOROTHYCROWFOOTHODGKIN DOROTHYHODGKIN

  4. Dot Matrix: Self Comparison

  5. Dot Matrix: Self Comparison Identity diagonal

  6. Dot Matrix: Self Comparison

  7. Dot Matrix: Self Comparison Direct Repeat Identity diagonal

  8. Dot Matrix: Point Mutation

  9. Dot Matrix: Point Mutation Point mutation Main diagonal

  10. Dot Matrix: Gap

  11. Dot Matrix: Gap Deletion/Insertion Main diagonal

  12. Dot Matrix: Rearrangement

  13. Dot Matrix: Rearrangement Main diagonal

  14. Dot Plot Analysis • Advantages • Simple and fast. • Can detect DNA rearrangement • Disadvantages • No numerical values produced • Subjective interpretation

  15. Problems of Sequence Alignment • How to score? Match, Mismatch and Gap • Example: +1 for each match, 0 for mismatch and -2 for each internal gap (gap penalty), 0 for terminal gap (similarity score).

  16. Computational measures • Distance measure • 0 for a match • 1 for a mismatch or gap • Lowest best • Another measure • 2 for a match • -1 for a mismatch, -2 for a gap • highest best

  17. Gap Penalties • Gap penalties • Linear score f(g) = - gd • Affine score f(g) = - d – (g-1) e • d = gap open penalty e = gap extend penalty • g = gap length • Example Gap penalty values used: • d = 500 • e = 50

  18. Example from Lab-Feb20: -1 for terminal gap, -2 for for each internal gap (gap penalty)Blosum(A,A) = 4; Blosum(A,P) = -1; Blosum(A,W) = -3; Blosum(P,P) = 7; Blosum(P,W) = -4  • AWAP-1-3-1+7=2 (one terminal gap, 2 mismatches) - APP • AWAP - -3+4+7=8 (3 terminal gaps, no mismatches) - -APP best if gap penalty (inside) is high • AWAP -2+4-1+7=8 (one internal gap, 1 mismatch) A - PP best if terminal gap is high

  19. How to find the alignment with the best score?

  20. Finding alignment with best score • Brute force approach= calculating scores of all possible alignment and select the best ones. • For two 1000-bp DNA sequence, the number of possible alignment is 10600. Brute force approach is impossible.

  21. Dynamic programming Methods • Finding the best alignment without calculating all possible alignment. • The method is EXACT. • Original method by Needleman&Wunsch performs global alignment. • Modification by Smith&Waterman performs local alignment.

  22. Needleman&Wunsch Methods (match=1, mismatch=0, gap=-2)

  23. Local Alignment with Smith-Waterman Algorithm • Adding one modification: Any negative score are changed to 0. That is alignment will not be done unless the score is positive

  24. Smith-Waterman Methods (match=1, mismatch=0, gap=-2)

  25. Smith-Waterman Methods (match=1, mismatch=0, gap=-2)

  26. Scoring schemes Although dynamic programming guarantee correct results for each scoring scheme. The biological basis of scoring scheme is weak, except for the fact that insertion/deletion is rarer than substitutions and scored accordingly

  27. Match-Mismatch score • DNA • Transition is more frequent than transversion (e.g., for M. tuberculosis SNP ~ 2:1)and can be scored accordingly. • In practice base transition and transversion are usually scored equally. • Proteins • Substitution matrix such as PAM or BLOSUM

  28. Transitions & Transversions • Transition: A nucleotide substitution from one purine to another purine (eg, A->G), or from one pyrimidine to another pyrimidine (eg, T->C). • Transversion: A nucleotide substitution from a purine to a pyrimidine (eg, A->C), or vice versa (eg, T->G).

  29. Transitions & Transversions • Purines • Pyrimidines

  30. Gap penalty • Linear model = ak • Affine model = a0+ a1k, a0= gap opening penallty, a1k= gap extension penalty. a1<a0 • More biologically realistic modelsneed exponentially decrease gap penalty functions such as a0+ a1Logk. Computational complexity prohibits its common use.

  31. More advance scoring system • Position dependent scores, use different matrix (and penalty) at different position in proteins. Functional importance of protein regions affect divergence • Structure dependent scores.

  32. Software providing ALIGNMENT tools • MATLAB: Bioinformatics toolbox [GlobalScore, GlobalAlignment] = nwalign(humanProtein,... mouseProtein) … swalign showalignment(GlobalAlignment) • ORACLE 10g BLAST functions: blastn, blastp, blastx, etc

  33. Types of Algorithms • Heuristic A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee. In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs. • Dynamic Programming The algorithm for finding optimal alignments given an additive alignment score dynamically These type of algorithms are guaranteed to find the optimal scoring alignment or set of alignments. • HMM - Based on Probability Theory – very versatile.

  34. http://www.soe.ucsc.edu/research/compbio/HMM-apps/HMM-applications.htmlhttp://www.soe.ucsc.edu/research/compbio/HMM-apps/HMM-applications.html

  35. Hidden Markov Model (HMM)

  36. Markov chain • Chain of events, in which the probability of each event depends only on apreceding event. • Assumption: DNA can be viewed as a Markov chain. Probability of A, T, G, or C appearing in each position depend on kind of nucleotide in the preceding position.

  37. Markov chain is defined by • P(A|A) = probability of a base being A if the preceding base is A. • P(T|G) = probability of a base being T if the preceding base is G. • And so on.So a DNA Markov chain is defined by 16 probabilities.

  38. Markov Chain Model of DNA. Each arrow is defined by a transition probability. G A T C

  39. Hidden Markov Model • Hidden: State path e.g.,NNNNNNNNCCCCCCCCCCCNNNNN • Not hidden: DNA sequence e.g.,attactggcggccgcgtcgatctg • The question is to find the most probable (hidden) state path when the (non-hidden) sequence is known.

  40. Algorithm to find Most Probable State Path (Decoding) • If parameters are known, • Viterbi algorithm. • Posterior decoding

  41. Estimation of parameters • Usually a “training set” of sequences are required. • The “training set” may be • Sequences of known state • Sequences of unknown state. Parameters are arbitrarily set and reiterated until state changes are minimal.

  42. G A T C HMM for identifying coding DNA Sequences G A T C Coding (exon) Non-Coding (intron)

  43. Hidden Markov Model for Coding Sequence predictions • Hidden: State path(I=intron, X=exon) e.g.,IIIIIIIIXXXXXXXXXXXXIIIIIIIIIIIIIIIIIIIIIIIIXXXXXXXXXXXX • Not hidden: DNA sequence e.g.,attactggcggccgcgtcgatctgggtcttaggtadtgtacggcccctcgtaggca • The question is to find the most probable (hidden) state path when the (non-hidden) sequence is known.

  44. Training Sets for HMM coding sequences prediction • Best come from experimental works • Best come from the same species

  45. G/G A/A T/T C/C HMM for Spliced Alignment (between genomic and EST sequences) G A T C Paired (exon) Unpaired (intron)

  46. Selections of Alignment Programs • Global vs Local • Pairwise (1-1), database searching (1-many), module searching (1-1 many loci), mulitiple • Distance between query and database • Number of query, size of databases • Exact vs Heuristic

  47. Multiple sequence alignment • Multiple sequence alignment • Dynamic programming: restricted to 3-4 sequences at most. • Progressive sequence alignment: ClustalW, X. • Divide and conquer methodology • HMM • Others • Constructing common patterns • Consensus: TATAAT • Weight matrix • Input (from training set) for HMM methods • Input for PSI-BLAST

  48. Multiple Sequence Alignments: Creation and Analysis Chapter 12, B&O – Protein Alignment • What is a Multiple Alignment? • Structural or Evolutionary? (not necessarily correspond, not really possible) • How to multiply align? • How to generate alignments? • Tools

  49. Significance of an Alignment Score • Statistical methods used to evaluate the significance of an alignment score • Z-score, P-value and E-value • Significance of Score • Z- score = (score – mean)/std. dev • Measures how unusual our original match is. Z  5 are significant. • P- value measures probability that the alignment is no better than random. (Z and P depends on the distribution of the scores) • P  10-100 exact match. • E- value is the expected number of sequences that give the same Z- score or better. (E = P x size of the database) • E  0.02 sequences probably homologous

  50. Aligning more than 2 sequences Sequences should not be very different in length Should be edited down to regions that are most similar (PSI-BLAST does it automatically, but not all tools do) Random alignment of pairs of sequences helps assessing similarities

More Related