1 / 50

Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx

Multiple Sequence Alignments and Phylogeny. Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx. Within a protein sequence, some regions will be more conserved than others. As more conserved, more important . for function for 3D structure for localization for modification

prince
Download Presentation

Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignments and Phylogeny BioinformaticsDr. Víctor Treviñovtrevino@itesm.mx

  2. Within a protein sequence, some regions will be more conserved than others. As more conserved, more important. • for function • for 3D structure • for localization • for modification • for interaction • for regulation/control • for transcriptional regulation (in DNA) SEQUENCE SIMILARITY REASONS TO PERFORM SEQUENCE SIMILARITY ANALYSIS AND SEARCHES

  3. Procedure for comparing two (pair-wise alignment) or more (multiple sequence alignment) sequences by searching for similar patterns that are in the same order in the sequences • Identical residues (nt or aa) are placed in the same column • Non-identical residues can be placed in the same column or indicated as gaps SEQUENCE ALIGNMENT Overall similitude Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press Wikipedia, http://www-personal.umich.edu/~lpt/fgf/fgfrcomp.htm

  4. Interesting regions Promoter regions Consensus sequence for probe design MULTIPLE SEQUENCE ANALYSIS – ADDITIONAL USES

  5. Multiple Sequence Alignment - MSA Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  6. Dynamical programming is designed for two sequences • It would take quite a long time for three or more (see MSA program) Multiple Sequence Alignment - MSA Sequence B Sequence C Sequence A

  7. RELATION MSA & EvoluionaryTreeReconstruction

  8. Extenstions of sequence pair alignment • MSA • Progressive Methods • CLUSTALW • Iterative Methods • Hidden Markov Models (HMM) MULTIPLE SEQUENCE ALIGNMENT – METHODS

  9. Algorithm • Calculate all pair-wise alignment scores (alignment costs). • Use the scores (costs) to predict a tree. • Calculate pair weights based on the tree. • Produce a heuristic msa based on the tree. • Calculate the maximum for each sequence pair. • Determine the spatial positions that must be calculated to obtain the optimal alignment. • Perform the optimal alignment. • Report the epsilon found compared to the maximum epsilon. Multiple Sequence Alignment - MSA epsilon for a given sequence pair is the difference between the score of the alignment of that pair in the msa and the score of the optimal pair-wise alignment. The bigger the value of , the more divergent the msa from the pair-wise alignment and the smaller the contribution of tht alignment to the msa. For example, if an extra copy of one of the sequences is added to the alignment project, then for sequence pairs that do not include that sequence will increase, indicating a lesser role because the contributions of that pair have been out-voted by the alike sequences.

  10. S1 S2 S3 S4 S5 Dynamical programming is designed for two sequences • It would take quite a long time for three or more (see MSA program) Therefore… • Pair-wise all sequences • Determine "distances between each one" • Align the two most similar then get the alignment • Get the next more similar and perform the same steps until all sequences has been included • E.G. • (S3+S4)=c1, • (S1+S2)=c2 • (c1+c2)=c3 • (c3+S5)=final Progressive Multiple Sequence Alignment

  11. Progressive Multiple Sequence Alignment - CLUSTALW CLUSTALW METHOD (then normalized to largest = 1) Alignment Score for column Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  12. Progressive Multiple Sequence Alignment - CLUSTALW 3 1 2 Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  13. Dependency on the most similar sequences • Nested problems when most similar sequences are actually different • So, for closely related sequence, CLUSTALW is the best • Choice of suitable scoring matrices Progressive Multiple Sequence Alignment - Problems Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  14. Try to correct for the dependency on the most similar sequences in progressive methods • Repeatedly realigning subgroups, then aligning these on the global alignment • Based in tree ordering, separation of sequences, or random grupo selection Iterative Multiple Sequence Alignment Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  15. Iterative Multiple Sequence Alignment Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  16. HMM Multiple Sequence Alignment D1 Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  17. Multiple Sequence Alignment - Programs Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  18. Multiple Sequence Alignment - Overview

  19. Determination of how the family might have been derived during evolution Sequences is depicted as branches on a tree Very similar sequences are located as neighbours in a branch The goal is to discover all the branching relationships and the branch lengths Phylogeny Analysis and Prediction from DNA/Protein Sequences Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  20. Phylogenetic relationships among the genes can help to predict which ones might have an equivalent function. • Phylogenetic analysis may also be used to follow the changes occurring in a rapidly changing species, such as a virus • Important for discovering • function, 3D structure, localization, modification, interaction, regulation/control, transcriptional regulation Phylogeny Analysis and Prediction from DNA/Protein Sequences Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  21. Related to SEQUENCE ALIGNMENT Phylogeny Analysis and Prediction from DNA/Protein Sequences Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  22. SEQUENCE SIMILARITY – EVOLUTIONARY RELATIONSHIP Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  23. Genome Complexity Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  24. Genome Complexity Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  25. An evolutionary tree is a two-dimensional graph showing evolutionary relationships among organisms The separate sequences are referred to as taxa (singular taxon), defined as phylogenetically distinct units on the tree The tree is composed of outer branches (or leaves) representing the taxa and nodes and branches representing relationships among the taxa Evolutionary Tree Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  26. Evolutionary Tree • A and B are derived from a common ancestor • each node in the tree represents a splitting of the evolutionary path of the gene into two different species that are isolated reproductively Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  27. Evolutionary Tree • Beyond spliting, any further evolutionary changes in each new branch are independent of those in the other new branch • The length of each branch to the next node represents the number of sequence changes that occurred prior to the next level of separation Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  28. Evolutionary Tree • Uniform mutation rate  Molecular Clock Hypothesis, suitable for closely related species • Special cases could use non-uniform rates • The root is defined by including a taxon that we are reasonably sure branched off earlier than the other Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  29. Evolutionary Tree • The sum of all the branch lengths in a tree is referred to as the tree length. • The tree is also a bifurcating or binary tree, in that only two branches emanate from each node. • Trees can have more than one branch emanating from a node if the events separating taxa are so close that they cannot be resolved, or to simplify the tree. • The unrooted tree also shows the evolutionary relationships among sequences A–D, but it does not reveal the location of the oldest ancestry. Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  30. The number of possible rooted trees increases very rapidly with the number of sequences or taxa Evolutionary Tree Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  31. To find the evolutionary tree or trees that best account for the observed variation in a group of sequences • Maximum Parsimony • Distance • Maximum Likelihood Methods to Build Evolutionary Trees

  32. Method Selection Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  33. Not Large number of gaps • Phylogenetic methods analyze conserved regions that are represented in all the sequences (Local Alignments) Considerations

  34. Predicts the evolutionary tree by minimizing the number of steps required to generate the observed sequence changes • Requires a multiple sequence alignment • Method revise each informative position and each possible tree • same residue in at least two sequences but not all • Used for sequences that are quite similar and for small number of sequences Maximum Parsimony (or Minimum Evolution)

  35. Maximum Parsimony (or Minimum Evolution) Non informative Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  36. Employs the number of changes between each pair Sequence pairs that have the smallest number of sequence changes are "neighbours" sharing a node in the tree Very related to Multiple sequence alignment method (CLUSTALW) which produced DISTANCE MATRICES then analysed by distance methods Remember Distance vs Similarity (and gaps) Distance Methods Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  37. Distance Methods "Idealized" Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  38. Fitch and Margoliash Method Neighbor-joining Method Unweighted Pair Group Method with Arithmetic Mean (UPGMA) Distance Algorithms

  39. Choosing a outgroup (Grupo Fuera) improves prediction because methods are informed about the "order" of the outgroup Distance Algorithm

  40. Uses probability of the number of sequence changes Analysis is performed for each informative residue (like in maximum parsimony) All possible trees are considered (so, for small number of sequences) Consider variations in mutation rates, so it can be used for most distant sequences Main disadvantage: Computation Time Maximum Likelihood

  41. Needs a model that provides estimates of substitution rates for each residue pair Maximum Likelihood

  42. Bootstrap method randomly resampling residues within columns (robustness test) • Good evidence if more than 70% predictions are conserved then • Collapse branches and confirm tree length • Compare distinct methods and parameters Reliability of Phylogenetic Predictions

  43. PHYLIP http://evolution.genetics.washington.edu/phylip.html • PAUP http://paup.csit.fsu.edu/downl.html • Phylemon http://phylemon.bioinfo.cipf.es/cgi-bin/tools.cgi "Classic" Programs

  44. Phylemon WEB Service

  45. Programs – Web Services http://bioinformatics.ca/links_directory/index.php?search=phylogeny&submit=Search+Directory

  46. Programs – Web Services http://bioinformatics.ca/links_directory/index.php?search=phylogeny&submit=Search+Directory

  47. Book

  48. Select a gene Get the sequence in at least 7 species Select a site (Phylemon) Perform the multiple sequence alignment (ClustalW) Perform Phylogeny to obtain a tree At least 2 tree methods At least 3 parameter(s) changes Take DNA/Protein Report results and discussion Exercise/Homework 12 MSA+Trees

  49. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis – Loytynoja, Goldman, Science 2008 • Insertions and deletions treated as different events Papers to revise

  50. Papers Pending for This Session

More Related