Design and creation of multiple sequence alignments Unit 13

Design and creation of multiple sequence alignmentsUnit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD

Dot Plot (Matrix) for Sequence comparison Reminders from Previous Lectures

DOTPLOTS • DOROTHYCROWFOOTHODGKIN DOROTHYHODGKIN

Dot Matrix: Self Comparison

Dot Matrix: Self Comparison Identity diagonal

Dot Matrix: Self Comparison

Dot Matrix: Self Comparison Direct Repeat Identity diagonal

Dot Matrix: Point Mutation

Dot Matrix: Point Mutation Point mutation Main diagonal

Dot Matrix: Gap

Dot Matrix: Gap Deletion/Insertion Main diagonal

Dot Matrix: Rearrangement

Dot Matrix: Rearrangement Main diagonal

Dot Plot Analysis • Advantages • Simple and fast. • Can detect DNA rearrangement • Disadvantages • No numerical values produced • Subjective interpretation

Problems of Sequence Alignment • How to score? Match, Mismatch and Gap • Example: +1 for each match, 0 for mismatch and -2 for each internal gap (gap penalty), 0 for terminal gap (similarity score).

Computational measures • Distance measure • 0 for a match • 1 for a mismatch or gap • Lowest best • Another measure • 2 for a match • -1 for a mismatch, -2 for a gap • highest best

Gap Penalties • Gap penalties • Linear score f(g) = - gd • Affine score f(g) = - d – (g-1) e • d = gap open penalty e = gap extend penalty • g = gap length • Example Gap penalty values used: • d = 500 • e = 50

Example from Lab-Feb20: -1 for terminal gap, -2 for for each internal gap (gap penalty)Blosum(A,A) = 4; Blosum(A,P) = -1; Blosum(A,W) = -3; Blosum(P,P) = 7; Blosum(P,W) = -4 • AWAP-1-3-1+7=2 (one terminal gap, 2 mismatches) - APP • AWAP - -3+4+7=8 (3 terminal gaps, no mismatches) - -APP best if gap penalty (inside) is high • AWAP -2+4-1+7=8 (one internal gap, 1 mismatch) A - PP best if terminal gap is high

How to find the alignment with the best score?

Finding alignment with best score • Brute force approach= calculating scores of all possible alignment and select the best ones. • For two 1000-bp DNA sequence, the number of possible alignment is 10600. Brute force approach is impossible.

Dynamic programming Methods • Finding the best alignment without calculating all possible alignment. • The method is EXACT. • Original method by Needleman&Wunsch performs global alignment. • Modification by Smith&Waterman performs local alignment.

Needleman&Wunsch Methods (match=1, mismatch=0, gap=-2)

Local Alignment with Smith-Waterman Algorithm • Adding one modification: Any negative score are changed to 0. That is alignment will not be done unless the score is positive

Smith-Waterman Methods (match=1, mismatch=0, gap=-2)

Scoring schemes Although dynamic programming guarantee correct results for each scoring scheme. The biological basis of scoring scheme is weak, except for the fact that insertion/deletion is rarer than substitutions and scored accordingly

Match-Mismatch score • DNA • Transition is more frequent than transversion (e.g., for M. tuberculosis SNP ~ 2:1)and can be scored accordingly. • In practice base transition and transversion are usually scored equally. • Proteins • Substitution matrix such as PAM or BLOSUM

Transitions & Transversions • Transition: A nucleotide substitution from one purine to another purine (eg, A->G), or from one pyrimidine to another pyrimidine (eg, T->C). • Transversion: A nucleotide substitution from a purine to a pyrimidine (eg, A->C), or vice versa (eg, T->G).

Transitions & Transversions • Purines • Pyrimidines

Gap penalty • Linear model = ak • Affine model = a0+ a1k, a0= gap opening penallty, a1k= gap extension penalty. a1<a0 • More biologically realistic modelsneed exponentially decrease gap penalty functions such as a0+ a1Logk. Computational complexity prohibits its common use.

More advance scoring system • Position dependent scores, use different matrix (and penalty) at different position in proteins. Functional importance of protein regions affect divergence • Structure dependent scores.

Software providing ALIGNMENT tools • MATLAB: Bioinformatics toolbox [GlobalScore, GlobalAlignment] = nwalign(humanProtein,... mouseProtein) … swalign showalignment(GlobalAlignment) • ORACLE 10g BLAST functions: blastn, blastp, blastx, etc

Types of Algorithms • Heuristic A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee. In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs. • Dynamic Programming The algorithm for finding optimal alignments given an additive alignment score dynamically These type of algorithms are guaranteed to find the optimal scoring alignment or set of alignments. • HMM - Based on Probability Theory – very versatile.

http://www.soe.ucsc.edu/research/compbio/HMM-apps/HMM-applications.htmlhttp://www.soe.ucsc.edu/research/compbio/HMM-apps/HMM-applications.html

Hidden Markov Model (HMM)

Markov chain • Chain of events, in which the probability of each event depends only on apreceding event. • Assumption: DNA can be viewed as a Markov chain. Probability of A, T, G, or C appearing in each position depend on kind of nucleotide in the preceding position.

Markov chain is defined by • P(A|A) = probability of a base being A if the preceding base is A. • P(T|G) = probability of a base being T if the preceding base is G. • And so on.So a DNA Markov chain is defined by 16 probabilities.

Markov Chain Model of DNA. Each arrow is defined by a transition probability. G A T C

Hidden Markov Model • Hidden: State path e.g.,NNNNNNNNCCCCCCCCCCCNNNNN • Not hidden: DNA sequence e.g.,attactggcggccgcgtcgatctg • The question is to find the most probable (hidden) state path when the (non-hidden) sequence is known.

Algorithm to find Most Probable State Path (Decoding) • If parameters are known, • Viterbi algorithm. • Posterior decoding

Estimation of parameters • Usually a “training set” of sequences are required. • The “training set” may be • Sequences of known state • Sequences of unknown state. Parameters are arbitrarily set and reiterated until state changes are minimal.

G A T C HMM for identifying coding DNA Sequences G A T C Coding (exon) Non-Coding (intron)

Hidden Markov Model for Coding Sequence predictions • Hidden: State path(I=intron, X=exon) e.g.,IIIIIIIIXXXXXXXXXXXXIIIIIIIIIIIIIIIIIIIIIIIIXXXXXXXXXXXX • Not hidden: DNA sequence e.g.,attactggcggccgcgtcgatctgggtcttaggtadtgtacggcccctcgtaggca • The question is to find the most probable (hidden) state path when the (non-hidden) sequence is known.

Training Sets for HMM coding sequences prediction • Best come from experimental works • Best come from the same species

G/G A/A T/T C/C HMM for Spliced Alignment (between genomic and EST sequences) G A T C Paired (exon) Unpaired (intron)

Selections of Alignment Programs • Global vs Local • Pairwise (1-1), database searching (1-many), module searching (1-1 many loci), mulitiple • Distance between query and database • Number of query, size of databases • Exact vs Heuristic

Multiple sequence alignment • Multiple sequence alignment • Dynamic programming: restricted to 3-4 sequences at most. • Progressive sequence alignment: ClustalW, X. • Divide and conquer methodology • HMM • Others • Constructing common patterns • Consensus: TATAAT • Weight matrix • Input (from training set) for HMM methods • Input for PSI-BLAST

Multiple Sequence Alignments: Creation and Analysis Chapter 12, B&O – Protein Alignment • What is a Multiple Alignment? • Structural or Evolutionary? (not necessarily correspond, not really possible) • How to multiply align? • How to generate alignments? • Tools

Significance of an Alignment Score • Statistical methods used to evaluate the significance of an alignment score • Z-score, P-value and E-value • Significance of Score • Z- score = (score – mean)/std. dev • Measures how unusual our original match is. Z  5 are significant. • P- value measures probability that the alignment is no better than random. (Z and P depends on the distribution of the scores) • P  10-100 exact match. • E- value is the expected number of sequences that give the same Z- score or better. (E = P x size of the database) • E  0.02 sequences probably homologous

Aligning more than 2 sequences Sequences should not be very different in length Should be edited down to regions that are most similar (PSI-BLAST does it automatically, but not all tools do) Random alignment of pairs of sequences helps assessing similarities

Design and creation of multiple sequence alignments Unit 13