1 / 88

creativecommons/licenses/by-sa/2.0/

http://creativecommons.org/licenses/by-sa/2.0/. Multiple Alignments & Molecular Evolution. Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques Basiques, 1st Floor, Room 1.08 Website of the Course: http://web.udl.es/usuaris/pg193845/Courses/Bioinformatics_2007/

uyen
Download Presentation

creativecommons/licenses/by-sa/2.0/

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://creativecommons.org/licenses/by-sa/2.0/

  2. Multiple Alignments & Molecular Evolution Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques Basiques, 1st Floor, Room 1.08 Website of the Course:http://web.udl.es/usuaris/pg193845/Courses/Bioinformatics_2007/ Course: http://10.100.14.36/Student_Server/

  3. Part I: Multiple Alignments

  4. Pairwise Alignment • We have seen how pairwise alignments are made. • Dynamic programming is an efficient algorithm for finding the optimal alignment. • Break problem into smaller subproblems • Solve subproblems optimally, recursively • Use optimal solutions to construct an optimal solution for the original problem • Alignments require a substitution (scoring) matrix that accounts for gap penalties.

  5. Sub. Matrix: Basic idea • Probability of substitution (mutation)

  6. PAM Matrices • A family of matrices (PAM-N) • Based upon an evolutionary model • The score for a substitution of nucleotides/amino acids is based on how much we expect that substitution to be observed after a certain length of evolutionary time • The scores are derived by a Markov model – i.e., the probability that one amino acid will change to another is not affected by changes that occurred at an earlier stage of evolutionary history

  7. Nucleic acid PAM matrices • PAM = point accepted mutation • 1 PAM = 1% probability of mutation at each sequence position. • A uniform PAM1 matrix for a familiy of closely related proteins:

  8. How did they get the values for PAM-1? • Look at 71 groups of protein sequences where the proteins in each group are at least 85% similar (Why these groups?) • Compute relative mutability of each amino acid – probability of change • From relative mutability, compute mutability probability for each amino acid pair X,Y– probability that X will change to Y over a certain evolutionary time • Normalize the mutability probability for each pair to a value between 0 and 1

  9. Transitions and transversions • Transitions (A  G or C  T) are more likely than transversions (A  T or G  C) • Assume that transitions are three times as likely:

  10. PAM-N Matrices • N is a measure of evolutionary distance • PAM-1 is modeled on an estimate of how long in evolutionary time it would take one amino acid out of 100 to change. That length of time is called 1 PAM unit, roughly 10 million years (abbreviated my). • Values in a PAM-1 matrix show the probability that an amino acid will change over 10 my. • To get the PAM-N matrix for any N, multiply PAM-(N-1) by PAM-1.

  11. Distant relatives • If a family of proteins is say, 80% homologous use a PAM 2.

  12. Computing Relative Mutability – A Measure of the Likelihood that an Amino Acid Will Mutate For each amino acid • changes = number of times the amino acid changed into something else • exposure to mutation = (percentage occurrence of the amino acid in the group of sequences being analyzed) * (frequency of amino acids changes in the group) • relative mutability = (changes/exposure to mutation) / 100

  13. Computing Relative Mutability of A: changes = # times A changes into something else = 4 % occurrence of A in group = 10 / 63 = 0.159 frequency of all amino acid changes in group = 6 * 2 = 12 (Note: Count changes backwards and forwards.) exposure to mutation = (% occurrence of A in group) * (frequency of all amino acid changes in group) = 12 * 0.159 relative mutability = (changes / exposure to mutation) / 100 = (4 / (12 * 0.159)) = 2.09 / 100 = 0.0209 Example from Fundamental Concepts of Bioinformatics by Krane and Raymer.

  14. How can we understand relative mutability intuitively? relative mutability = changes / exposure to mutation = the number of times A changed in proportion to the the probability that it COULD have changed exposure to mutation – that were 6 times when something changed in the tree. Each time, that change could have been A changing to something else, or something else changing to A – 12 chances for a change involving A. But A appears in a sequence only .159 of the time.

  15. Computing Mutability Probability Between Amino Acid Pairs For each pair of amino acids X and Y: r = relative mutability of X c = num times X becomes Y or vice versa p = num changes involving X mutability probability of X to Y = (r * c) / p

  16. Computing Mutability Probability that A will change to G: r = relative mutability of A = .0209 c = num times A becomes G or vice versa = 3 p = num changes involving A = 4 mutability probability of A to G = (r * c) / p = (0.0209 * 3) / 4 = 0.0156

  17. Normalizing Mutability Probability, X to Y • For each Y among all amino acids, compute mutability probability of X to Y as described above • Get a total of these 20 probabilities. Divide them by a normalizing factor such that the probability that X will NOT change is 99% and the sum of probabilities that it will change to any other amino acid is 1% • These are the numbers that go in the PAM-1 matrix!

  18. Converting Mutability Probabilities to Log Odds Score for X to Y • Compute the relative frequency of change for X to Y as follows: • Get the X to Y mutability probability • Divide by the % frequency of X in the sequence data • Convert to log base 10, multiply by 10 • In our example, we get log10(0.0156/0.1587) = log10(.098) • To compute log10(.098) solve for x: • 10x = 0.098 x = -1.01 10-1.01 = 1/101.01 = 0.098 • Compute log odds score for Y to X • Take the average of these two values

  19. Usefulness of Log Odds Scores • A score of 0 indicates that the change from one amino acid to another is what is expected by chance • A negative score means that the change is probably due to chance • A positive score means that the change is more than expected by chance • Because the scores are in log form, they can be added (i.e., the chance that X will change to Y and then Y to Z)

  20. Disadvantages of PAM Matrices • An alignment tree must be constructed first, implying some circularity in the analysis • The original PAM-1 matrix was based on a limited number of families, not necessarily representative of all protein families • The Markov model does not take into account that multi-step mutations should be treated differently from single-step ones

  21. Most Commonly-Used Amino Acid Subtitution Matrices • PAM (Percent Accepted Mutation, also called Dayhoff Amino Acid Substitution Matrix) • BLOSUM (BLOcks amino acid SUbstitution Matrix)

  22. BLOSUM Scoring Matrices • Based on a larger set of protein families than PAM (about 500 families). The proteins in the families are known to be biochemically related. • Focuses on blocks of conserved amino acid patterns in these families • Designed to find conserved domains in protein families • BLOSUM matrices with lower numbers are more useful for scoring matches in pairs that are expected to be less closely related through evolution – e.g., BLOSUM50 is used for more distantly-related proteins than BLOSUM62. (This is the opposite of the PAM matrices.)

  23. BLOSUM Matrices • Target frequencies are identified directly and not by extrapolation • Sequences more than x% identical are collapsed into a single sequence • BLOSUM 50: >=50% Identity • BLOSUM 62: >=62% Identity

  24. Building a BLOSUM Matrix • BLOSUM 62: • Collapse Sequences that have more than 62% identity into one • Calculate probability of a given pair of AAs being in same column (qij) • Calculate the frequency of a given AA (fi) • Calculate log odds ratio sij=log2(qij/fi). This is the value that goes into the BLOSUM matrix

  25. BLOSUM50

  26. What matrix to choose? • BLOSUM Matrices perform better in local similarity searches • BLOSUM 62 is the default matrix used for database searching

  27. Gap Penalty (Gap Scoring)

  28. Gap Penalties • Gaps in the alignment are necessary to increase score. • They must be penalized; however if penalty is to high no gaps will appear • On the other hand if they are too low, gaps everywhere!!! • The default settings of programs are usually ok for their default scoring matrices

  29. Once a gap, can we widen it? >gi|729942|sp|P40601|LIP1_PHOLU Lipase 1 precursor (Triacylglycerol lipase) Length = 645 Score = 33.5 bits (75), Expect = 5.9 Identities = 32/180 (17%), Positives = 70/180 (38%), Gaps = 9/180 (5%) Query: 2038 IYSLYGLYNVPYENLFVEAIASYSDNKIRSKSRRVIATTLETVGYQTANGKYKSESYTGQ 2097 +++ YGL+ Y+ ++ Y D K +R ++ + N + G+ Sbjct: 441 VFTAYGLWRY-YDKGWISGDLHYLDMKYEDITRGIVLNDW----LRKENASTSGHQWGGR 495 Query: 2098 LMAGYTYMMPENINLTPLAGLRYSTIKDKGYKETGTTYQNLTVKGKNYNTFDGLLGAKVS 2157 + AG+ + + +P+ + KGY+E+G + + Y++ G LG ++ Sbjct: 496 ITAGWDIPLTSAVTTSPIIQYAWDKSYVKGYRESGNNSTAMHFGEQRYDSQVGTLGWRLD 555 Query: 2158 SNINVNEIVLTPELYAMVDYAFKNKVSAIDARLQGMTAPLPTNSFKQSKTSFDVGVGVTA 2217 +N P ++ F +K I + + + S KQ + +G+ A Sbjct: 556 TNFG----YFNPYAEVRFNHQFGDKRYQIRSAINSTQTSFVSESQKQDTHWREYTIGMNA 611 Real gaps are often more than one letter long.

  30. Affine gap penalty LETVGY W----L • Separate penalties for gap opening and gap extension. • This requires modifying the DP algorithm to store three values in each box. -5 -1 -1 -1

  31. Scoring Gap Penalties • Linear Gap Penalty Score • Affine Penalty Score • Opening a gap is costly; extending it not so much (open=12; extension=1)

  32. Multiple Sequence Alignment

  33. MSA Introduction • Goal of protein sequence alignment: • To discover “biological” (structural / functional) similarities • If sequence similarity is weak, pairwise alignment can fail to identify important features (eg interaction residues) • Simultaneous comparison of many sequences often find similarities that are invisible in PA.

  34. Why do we care about sequence alignment? • Identify regions of a gene (or protein) susceptible to mutation and regions where residue replacement does not change function. • Information about the evolution of organisms. • Orthologs are genes that are evolutionarily related, have a similar function, but now appear in different species. • Homologous genes (genes with share evolutionary origin) have similar sequences. • Paralogs are evolutionarily related (share an origin) but no longer have the same function. • You can uncover either orthologs or paralogs through sequence alignment.

  35. Multiple Sequence Alignment • Often applied to proteins (not very good with DNA) • Proteins that are similar in sequence are often similar in structure and function • Sequence changes more rapidly in evolution than does structure and function.

  36. Work with proteins!If at all possible — • Twenty match symbols versus four, plus similarity! Way better signal to noise. • Also guarantees no indels are placed within codons. So translate, then align. • Nucleotide sequences will only reliably align if they are very similar to each other. And they will require extensive hand editing and careful consideration.

  37. Overview of Methods • Dynamic programming – too computationally expensive to do a complete search; uses heuristics • Progressive – starts with pair-wise alignment of most similar sequences; adds to that (LOCAL OPTIMIZATION) • Iterative – make an initial alignment of groups of sequences, adds to these (e.g. genetic algorithms) (GLOBAL OPTIMIZATION) • Locally conserved patterns • Statistical and probabilistic methods

  38. Dynamic Programming • Computational complexity – even worse than for pair-wise alignment because we’re finding all the paths through an n-dimensional hyperspace (Remember matrix, now add many dimensions) • Can align less than 20 relatively short (200-300) protein sequences in a reasonable amount of time; not much beyond that

  39. A Heuristic for Reducing the Search Space in Dynamic Programming • Consider the pair-wise alignments of each pair of sequences. • Create alignments from these scores. • Consider a multiple sequence alignment built from the individual pairwise alignments. • These alignments circumscribe a space in which to search for a good (but not necessarily optimal) alignment of all n sequences.

  40. The details • Create an “alignment of alignments” (AOA) based on pair-wise alignments (Pairs of sequences that have the best scores are paired first in the tree.) • Do a “first-cut” msa by incrementally doing pair-wise alignments in the order of “alikeness” of sequences as indicated by the AOA. Most alike sequences aligned first. • Use the pair-wise alignments and the “first-cut” msa to circumscribe a space within which to do a full msa that searches through this solution space. • The score for a given alignment of all the sequences is the sum of the scores for each pair, where each of the pair-wise scores is multiplied by a weight є indicating how far the pair-wise score differs from the first-cut msa alignment score.

  41. Heuristic Dynamic Programming Method for MSA • Does not guarantee an optimal alignment of all the sequences in the group. • Does get an optimal alignment within the space chosen.

  42. Progressive Methods • Similar to dynamic programming method in that it uses the first step (i.e., it creates an AOA, aligns the most-alike pair, and incrementally adds sequences to the alignment.) • Differs from dynamic programming method for MSA in that it doesn’t refine the “first-cut” MSA by doing a full search through the reduced search space. (This is the computationally expensive part of DP MSA.)

  43. Progressive Method: the details • Generally proceeds as follows: • Choose a starting pair of sequences and align them • Align each next sequence to those already aligned, one at a time • Heuristic method – doesn’t guarantee an optimal alignment • Details vary in implementation: • How to choose the first sequence to align? • Align all subsequence sequences cumulatively or in subfamilies? • How to score?

  44. ClustalW • Based on phylogenetic analysis • A AOA is created using a pairwise distance matrix and nearest-neighbor algorithm • The most closely-related pairs of sequences are aligned using dynamic programming • Each of the alignments is analyzed and a profile of it is created • Alignment profiles are aligned progressively for a total alignment • W in ClustalW refers to a weighting of scores depending on how far a sequence is from the root on the AOA

  45. ClustalW Procedure AOA

  46. “Once a gap, always a gap”

  47. Basic Steps in Progressive Alignment “Once a gap, always a gap”

More Related