Create Presentation
Download Presentation

Download Presentation

Design and creation of multiple sequence alignments Unit 13

Download Presentation
## Design and creation of multiple sequence alignments Unit 13

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Design and creation of multiple sequence alignmentsUnit 13**BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD**Dot Plot (Matrix) for Sequence comparison**Reminders from Previous Lectures**DOTPLOTS**• DOROTHYCROWFOOTHODGKIN DOROTHYHODGKIN**Dot Matrix: Self Comparison**Identity diagonal**Dot Matrix: Self Comparison**Direct Repeat Identity diagonal**Dot Matrix: Point Mutation**Point mutation Main diagonal**Dot Matrix: Gap**Deletion/Insertion Main diagonal**Dot Matrix: Rearrangement**Main diagonal**Dot Plot Analysis**• Advantages • Simple and fast. • Can detect DNA rearrangement • Disadvantages • No numerical values produced • Subjective interpretation**Problems of Sequence Alignment**• How to score? Match, Mismatch and Gap • Example: +1 for each match, 0 for mismatch and -2 for each internal gap (gap penalty), 0 for terminal gap (similarity score).**Computational measures**• Distance measure • 0 for a match • 1 for a mismatch or gap • Lowest best • Another measure • 2 for a match • -1 for a mismatch, -2 for a gap • highest best**Gap Penalties**• Gap penalties • Linear score f(g) = - gd • Affine score f(g) = - d – (g-1) e • d = gap open penalty e = gap extend penalty • g = gap length • Example Gap penalty values used: • d = 500 • e = 50**Example from Lab-Feb20:**-1 for terminal gap, -2 for for each internal gap (gap penalty)Blosum(A,A) = 4; Blosum(A,P) = -1; Blosum(A,W) = -3; Blosum(P,P) = 7; Blosum(P,W) = -4 • AWAP-1-3-1+7=2 (one terminal gap, 2 mismatches) - APP • AWAP - -3+4+7=8 (3 terminal gaps, no mismatches) - -APP best if gap penalty (inside) is high • AWAP -2+4-1+7=8 (one internal gap, 1 mismatch) A - PP best if terminal gap is high**Finding alignment with best score**• Brute force approach= calculating scores of all possible alignment and select the best ones. • For two 1000-bp DNA sequence, the number of possible alignment is 10600. Brute force approach is impossible.**Dynamic programming Methods**• Finding the best alignment without calculating all possible alignment. • The method is EXACT. • Original method by Needleman&Wunsch performs global alignment. • Modification by Smith&Waterman performs local alignment.**Local Alignment with Smith-Waterman Algorithm**• Adding one modification: Any negative score are changed to 0. That is alignment will not be done unless the score is positive**Scoring schemes**Although dynamic programming guarantee correct results for each scoring scheme. The biological basis of scoring scheme is weak, except for the fact that insertion/deletion is rarer than substitutions and scored accordingly**Match-Mismatch score**• DNA • Transition is more frequent than transversion (e.g., for M. tuberculosis SNP ~ 2:1)and can be scored accordingly. • In practice base transition and transversion are usually scored equally. • Proteins • Substitution matrix such as PAM or BLOSUM**Transitions & Transversions**• Transition: A nucleotide substitution from one purine to another purine (eg, A->G), or from one pyrimidine to another pyrimidine (eg, T->C). • Transversion: A nucleotide substitution from a purine to a pyrimidine (eg, A->C), or vice versa (eg, T->G).**Transitions & Transversions**• Purines • Pyrimidines**Gap penalty**• Linear model = ak • Affine model = a0+ a1k, a0= gap opening penallty, a1k= gap extension penalty. a1<a0 • More biologically realistic modelsneed exponentially decrease gap penalty functions such as a0+ a1Logk. Computational complexity prohibits its common use.**More advance scoring system**• Position dependent scores, use different matrix (and penalty) at different position in proteins. Functional importance of protein regions affect divergence • Structure dependent scores.**Software providing ALIGNMENT tools**• MATLAB: Bioinformatics toolbox [GlobalScore, GlobalAlignment] = nwalign(humanProtein,... mouseProtein) … swalign showalignment(GlobalAlignment) • ORACLE 10g BLAST functions: blastn, blastp, blastx, etc**Types of Algorithms**• Heuristic A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee. In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs. • Dynamic Programming The algorithm for finding optimal alignments given an additive alignment score dynamically These type of algorithms are guaranteed to find the optimal scoring alignment or set of alignments. • HMM - Based on Probability Theory – very versatile.**http://www.soe.ucsc.edu/research/compbio/HMM-apps/HMM-applications.html**http://www.soe.ucsc.edu/research/compbio/HMM-apps/HMM-applications.html**Markov chain**• Chain of events, in which the probability of each event depends only on apreceding event. • Assumption: DNA can be viewed as a Markov chain. Probability of A, T, G, or C appearing in each position depend on kind of nucleotide in the preceding position.**Markov chain is defined by**• P(A|A) = probability of a base being A if the preceding base is A. • P(T|G) = probability of a base being T if the preceding base is G. • And so on.So a DNA Markov chain is defined by 16 probabilities.**Markov Chain Model of DNA. Each arrow is defined by a**transition probability. G A T C**Hidden Markov Model**• Hidden: State path e.g.,NNNNNNNNCCCCCCCCCCCNNNNN • Not hidden: DNA sequence e.g.,attactggcggccgcgtcgatctg • The question is to find the most probable (hidden) state path when the (non-hidden) sequence is known.**Algorithm to find Most Probable State Path (Decoding)**• If parameters are known, • Viterbi algorithm. • Posterior decoding**Estimation of parameters**• Usually a “training set” of sequences are required. • The “training set” may be • Sequences of known state • Sequences of unknown state. Parameters are arbitrarily set and reiterated until state changes are minimal.**G**A T C HMM for identifying coding DNA Sequences G A T C Coding (exon) Non-Coding (intron)**Hidden Markov Model for Coding Sequence predictions**• Hidden: State path(I=intron, X=exon) e.g.,IIIIIIIIXXXXXXXXXXXXIIIIIIIIIIIIIIIIIIIIIIIIXXXXXXXXXXXX • Not hidden: DNA sequence e.g.,attactggcggccgcgtcgatctgggtcttaggtadtgtacggcccctcgtaggca • The question is to find the most probable (hidden) state path when the (non-hidden) sequence is known.**Training Sets for HMM coding sequences prediction**• Best come from experimental works • Best come from the same species**G/G**A/A T/T C/C HMM for Spliced Alignment (between genomic and EST sequences) G A T C Paired (exon) Unpaired (intron)**Selections of Alignment Programs**• Global vs Local • Pairwise (1-1), database searching (1-many), module searching (1-1 many loci), mulitiple • Distance between query and database • Number of query, size of databases • Exact vs Heuristic**Multiple sequence alignment**• Multiple sequence alignment • Dynamic programming: restricted to 3-4 sequences at most. • Progressive sequence alignment: ClustalW, X. • Divide and conquer methodology • HMM • Others • Constructing common patterns • Consensus: TATAAT • Weight matrix • Input (from training set) for HMM methods • Input for PSI-BLAST**Multiple Sequence Alignments: Creation and Analysis**Chapter 12, B&O – Protein Alignment • What is a Multiple Alignment? • Structural or Evolutionary? (not necessarily correspond, not really possible) • How to multiply align? • How to generate alignments? • Tools**Significance of an Alignment Score**• Statistical methods used to evaluate the significance of an alignment score • Z-score, P-value and E-value • Significance of Score • Z- score = (score – mean)/std. dev • Measures how unusual our original match is. Z 5 are significant. • P- value measures probability that the alignment is no better than random. (Z and P depends on the distribution of the scores) • P 10-100 exact match. • E- value is the expected number of sequences that give the same Z- score or better. (E = P x size of the database) • E 0.02 sequences probably homologous**Aligning more than 2 sequences**Sequences should not be very different in length Should be edited down to regions that are most similar (PSI-BLAST does it automatically, but not all tools do) Random alignment of pairs of sequences helps assessing similarities