Design and creation of multiple sequence alignmentsUnit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD
Dot Plot (Matrix) for Sequence comparison Reminders from Previous Lectures
DOTPLOTS • DOROTHYCROWFOOTHODGKIN DOROTHYHODGKIN
Dot Matrix: Self Comparison Identity diagonal
Dot Matrix: Self Comparison Direct Repeat Identity diagonal
Dot Matrix: Point Mutation Point mutation Main diagonal
Dot Matrix: Gap Deletion/Insertion Main diagonal
Dot Matrix: Rearrangement Main diagonal
Dot Plot Analysis • Advantages • Simple and fast. • Can detect DNA rearrangement • Disadvantages • No numerical values produced • Subjective interpretation
Problems of Sequence Alignment • How to score? Match, Mismatch and Gap • Example: +1 for each match, 0 for mismatch and -2 for each internal gap (gap penalty), 0 for terminal gap (similarity score).
Computational measures • Distance measure • 0 for a match • 1 for a mismatch or gap • Lowest best • Another measure • 2 for a match • -1 for a mismatch, -2 for a gap • highest best
Gap Penalties • Gap penalties • Linear score f(g) = - gd • Affine score f(g) = - d – (g-1) e • d = gap open penalty e = gap extend penalty • g = gap length • Example Gap penalty values used: • d = 500 • e = 50
Example from Lab-Feb20: -1 for terminal gap, -2 for for each internal gap (gap penalty)Blosum(A,A) = 4; Blosum(A,P) = -1; Blosum(A,W) = -3; Blosum(P,P) = 7; Blosum(P,W) = -4 • AWAP-1-3-1+7=2 (one terminal gap, 2 mismatches) - APP • AWAP - -3+4+7=8 (3 terminal gaps, no mismatches) - -APP best if gap penalty (inside) is high • AWAP -2+4-1+7=8 (one internal gap, 1 mismatch) A - PP best if terminal gap is high
Finding alignment with best score • Brute force approach= calculating scores of all possible alignment and select the best ones. • For two 1000-bp DNA sequence, the number of possible alignment is 10600. Brute force approach is impossible.
Dynamic programming Methods • Finding the best alignment without calculating all possible alignment. • The method is EXACT. • Original method by Needleman&Wunsch performs global alignment. • Modification by Smith&Waterman performs local alignment.
Local Alignment with Smith-Waterman Algorithm • Adding one modification: Any negative score are changed to 0. That is alignment will not be done unless the score is positive
Scoring schemes Although dynamic programming guarantee correct results for each scoring scheme. The biological basis of scoring scheme is weak, except for the fact that insertion/deletion is rarer than substitutions and scored accordingly
Match-Mismatch score • DNA • Transition is more frequent than transversion (e.g., for M. tuberculosis SNP ~ 2:1)and can be scored accordingly. • In practice base transition and transversion are usually scored equally. • Proteins • Substitution matrix such as PAM or BLOSUM
Transitions & Transversions • Transition: A nucleotide substitution from one purine to another purine (eg, A->G), or from one pyrimidine to another pyrimidine (eg, T->C). • Transversion: A nucleotide substitution from a purine to a pyrimidine (eg, A->C), or vice versa (eg, T->G).
Transitions & Transversions • Purines • Pyrimidines
Gap penalty • Linear model = ak • Affine model = a0+ a1k, a0= gap opening penallty, a1k= gap extension penalty. a1<a0 • More biologically realistic modelsneed exponentially decrease gap penalty functions such as a0+ a1Logk. Computational complexity prohibits its common use.
More advance scoring system • Position dependent scores, use different matrix (and penalty) at different position in proteins. Functional importance of protein regions affect divergence • Structure dependent scores.
Software providing ALIGNMENT tools • MATLAB: Bioinformatics toolbox [GlobalScore, GlobalAlignment] = nwalign(humanProtein,... mouseProtein) … swalign showalignment(GlobalAlignment) • ORACLE 10g BLAST functions: blastn, blastp, blastx, etc
Types of Algorithms • Heuristic A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee. In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs. • Dynamic Programming The algorithm for finding optimal alignments given an additive alignment score dynamically These type of algorithms are guaranteed to find the optimal scoring alignment or set of alignments. • HMM - Based on Probability Theory – very versatile.
Markov chain • Chain of events, in which the probability of each event depends only on apreceding event. • Assumption: DNA can be viewed as a Markov chain. Probability of A, T, G, or C appearing in each position depend on kind of nucleotide in the preceding position.
Markov chain is defined by • P(A|A) = probability of a base being A if the preceding base is A. • P(T|G) = probability of a base being T if the preceding base is G. • And so on.So a DNA Markov chain is defined by 16 probabilities.
Markov Chain Model of DNA. Each arrow is defined by a transition probability. G A T C
Hidden Markov Model • Hidden: State path e.g.,NNNNNNNNCCCCCCCCCCCNNNNN • Not hidden: DNA sequence e.g.,attactggcggccgcgtcgatctg • The question is to find the most probable (hidden) state path when the (non-hidden) sequence is known.
Algorithm to find Most Probable State Path (Decoding) • If parameters are known, • Viterbi algorithm. • Posterior decoding
Estimation of parameters • Usually a “training set” of sequences are required. • The “training set” may be • Sequences of known state • Sequences of unknown state. Parameters are arbitrarily set and reiterated until state changes are minimal.
G A T C HMM for identifying coding DNA Sequences G A T C Coding (exon) Non-Coding (intron)
Hidden Markov Model for Coding Sequence predictions • Hidden: State path(I=intron, X=exon) e.g.,IIIIIIIIXXXXXXXXXXXXIIIIIIIIIIIIIIIIIIIIIIIIXXXXXXXXXXXX • Not hidden: DNA sequence e.g.,attactggcggccgcgtcgatctgggtcttaggtadtgtacggcccctcgtaggca • The question is to find the most probable (hidden) state path when the (non-hidden) sequence is known.
Training Sets for HMM coding sequences prediction • Best come from experimental works • Best come from the same species
G/G A/A T/T C/C HMM for Spliced Alignment (between genomic and EST sequences) G A T C Paired (exon) Unpaired (intron)
Selections of Alignment Programs • Global vs Local • Pairwise (1-1), database searching (1-many), module searching (1-1 many loci), mulitiple • Distance between query and database • Number of query, size of databases • Exact vs Heuristic
Multiple sequence alignment • Multiple sequence alignment • Dynamic programming: restricted to 3-4 sequences at most. • Progressive sequence alignment: ClustalW, X. • Divide and conquer methodology • HMM • Others • Constructing common patterns • Consensus: TATAAT • Weight matrix • Input (from training set) for HMM methods • Input for PSI-BLAST
Multiple Sequence Alignments: Creation and Analysis Chapter 12, B&O – Protein Alignment • What is a Multiple Alignment? • Structural or Evolutionary? (not necessarily correspond, not really possible) • How to multiply align? • How to generate alignments? • Tools
Significance of an Alignment Score • Statistical methods used to evaluate the significance of an alignment score • Z-score, P-value and E-value • Significance of Score • Z- score = (score – mean)/std. dev • Measures how unusual our original match is. Z 5 are significant. • P- value measures probability that the alignment is no better than random. (Z and P depends on the distribution of the scores) • P 10-100 exact match. • E- value is the expected number of sequences that give the same Z- score or better. (E = P x size of the database) • E 0.02 sequences probably homologous
Aligning more than 2 sequences Sequences should not be very different in length Should be edited down to regions that are most similar (PSI-BLAST does it automatically, but not all tools do) Random alignment of pairs of sequences helps assessing similarities