Msa multiple sequence alignment
Download
1 / 98

MSA- multiple sequence alignment - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on
  • Presentation posted in: General

MSA- multiple sequence alignment. Aligning many sequences is often preferable to pairwise comparisons. Problem- Computational complexity of multiple alignments grows rapidly with the number of sequences being aligned. .

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

MSA- multiple sequence alignment

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Msa multiple sequence alignment
MSA- multiple sequence alignment

  • Aligning many sequences is often preferable to pairwise comparisons.

  • Problem- Computational complexity of multiple alignments grows rapidly with the number of sequences being aligned.


Msa multiple sequence alignment

“Even using supercomputers or networks of workstations, multiple sequence alignment is an intractable problem for more than 20 or so sequences of average length and complexity.”


Msa multiple sequence alignment

As a result, alignment methods using heuristics have been developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.


Clustalw
CLUSTALW developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

  • Developed in 1988

  • Begins by aligning closely related sequences and then adds increasingly divergent sequences to produce a complete msa.


Msa multiple sequence alignment

  • http://www.ncbi.nlm.nih.gov/ developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

  • http://www.ebi.ac.uk/clustalw/


Introduction to molecular phylogeny
Introduction to Molecular Phylogeny* developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

*Phylogeny- the evolutionary history of a group


Mutations happen
Mutations Happen! developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

3 types possible:

  • Deleterious

  • Advantageous

  • ???


Important point
Important Point: developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

  • Much of variation that is observed among individuals must have little beneficial or detrimental effect and be essentially selectively neutral.

  • Deleterious mutations are screened out. Advantageous mutations are rare.


Functional constraints
Functional Constraints? developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

  • Portions of genes that especially important are said to be under functional constraint and tend to accumulate changes very slowly.

  • Ex. = histone proteins- practically every amino acid is important. A yeast histone can replace a human histone.


Relative rate of change within globin gene 4 mammals
Relative Rate of Change within developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.-globin gene (4 mammals)


Basis of molecular phylogenetics
Basis of Molecular Phylogenetics developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

  • The evolution of species can be modeled as a bifurcating process- speciation is initiated when two populations become reproductively isolated.


Basis of molecular phylogenetics1
Basis of Molecular Phylogenetics developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

  • Once these two populations cease to interbreed, it is inevitable that they diverge due to random mutational processes.


Basis of molecular phylogenetics2
Basis of Molecular Phylogenetics developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

  • Over time, this branching process may repeat itself.

  • A species is said to be related to some other species with which it shares a direct common ancestor.


Basis of molecular phylogenetics3
Basis of Molecular Phylogenetics developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

  • The amount of DNA sequence difference between a pair of organisms should indicate how recently those two organisms shared a common ancestor.


Basis of molecular phylogenetics4
Basis of Molecular Phylogenetics developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

  • The longer two populations remain reproductively isolated, the more DNA divergence will occur.

  • The longer two populations remain reproductively isolated, the more protein divergence will occur.


Molecular phylogeny is relatively new
Molecular Phylogeny is relatively new. developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

  • Evolution by Natural Selection- Darwin/Wallace 1858

  • Molecular Phylogeny 1960s ??


How it started
How it started . . .. developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

  • In 1959, scientists determined the three-dimensional structures of two proteins that are found in almost every animal: hemoglobin and myoglobin.

  • During the next two decades, myoglobin and hemoglobin sequences were determined for dozens of mammals, birds, reptiles, amphibians, fish, etc.


What they found
What they found . . . developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

  • “This tree agreed completely with observations derived from paleontology and anatomy about the common descent of the corresponding organisms.”*

  • *fromScience and Creationism: A View from the National Academy of Sciences, 2nd Ed., 1999.


Msa multiple sequence alignment
Organisms with high degrees of molecular similarity are expected to be more closely related than those that are dissimilar.


Advantages of molecular phylogeny
Advantages of Molecular Phylogeny expected to be more closely related than those that are dissimilar.

  • Can be used to decipher relationships between all living things

  • Relying on anatomy can be misleading- Similar traits can evolve in organisms that are not closely related (i.e. convergent evolution lead to eyes in vertebrates, insects, and molluscs).


Word of caution
Word of Caution expected to be more closely related than those that are dissimilar.

Phylogenetic analysis is controversial. There are a wide variety of different methods for analyzing the data, and even the experts often disagree on the best method for analyzing the data.


Why so controversial
Why so controversial?? expected to be more closely related than those that are dissimilar.

2 Reasons:


1 molecular vs classical
#1 - Molecular expected to be more closely related than those that are dissimilar.vs. Classical

  • How much weight is given to molecular phylogenetic data, when it contrasts the findings of the traditional taxonomist??


Msa multiple sequence alignment

The phylogeny of whales : expected to be more closely related than those that are dissimilar.

. . .


Msa multiple sequence alignment

The phylogeny of whales: expected to be more closely related than those that are dissimilar.


How many cars changed spaces during this 2 hour interval

Parking lot “A” at 2:00 expected to be more closely related than those that are dissimilar.

Parking lot “A” at 4:00 

How many cars changed spaces during this 2 hour interval?


2 molecular phylogeny requires statistical estimations

Parking lot “A” at 2:00 expected to be more closely related than those that are dissimilar.

Parking lot “A” at 4:00 

#2- Molecular Phylogeny requires statistical estimations.


Phylogenetic data analysis requires 4 steps
Phylogenetic Data Analysis requires 4 steps expected to be more closely related than those that are dissimilar.

  • 1) Alignment

  • 2) Determine the substitution model

  • 3) Tree Building

  • 4) Tree Evaluation


Step 1 alignment
STEP 1- Alignment expected to be more closely related than those that are dissimilar.

  • Molecular phylogenetic analysis is dependent on a good alignment. An evolutionary tree based on an improper alignment is an erroneous tree.


Homology
Homology expected to be more closely related than those that are dissimilar.

It is critical to phylogenetic analysis that homologous characters be compared across species.

Webster’s New Collegiate- Fundamental similarity of structure due to descent from a common ancestral form.


Compare homologous genes and homologous characters
Compare homologous genes and homologous characters: expected to be more closely related than those that are dissimilar.

  • For DNA and proteins, this means that gaps must be placed correctly in multiple alignments to ensure that the same position is being compared for each species.


Homologous genes when could you accidentally compare nonhomologous genes
Homologous Genes? When could you accidentally compare nonhomologous genes?

  • Be careful if you comparing genes that are members of a gene family.

  • Comparing a tubulin-3 from one species with a tubulin-6 from another will not generate accurate results.


What to align
What to align? nonhomologous genes?

  • Phylogenetic trees are generated by comparing DNA or protein. The molecule of choice depends on the question you are attempting to answer.


Msa multiple sequence alignment
DNA nonhomologous genes?

  • contains more evolutionary information than protein :

  • ATT GCG AAA CAC

  • * * * *

  • ATA GCC AAG CTC


Protein
Protein nonhomologous genes?

(same region analyzed  only 1 difference)

  • Ile-Ala-Lys- His

  • Ile-Ala-Lys- Leu


Msa multiple sequence alignment
DNA nonhomologous genes?

  • high rate of base substitution makes DNA best for very short term studies, e.g. closely-related species


Homoplasy
* nonhomologous genes?Homoplasy

  • Return of a character to its original state, thus masking intervening mutational events. Every fourth mutation should result in a homoplasy.


Protein1
Protein nonhomologous genes?

  • more reliable alignment than DNA:

    fewer homoplasies than DNA

  • lower rate of substitution than DNA; better for wide species comparisons


Rrna ribosomal rna
rRNA= ribosomal RNA nonhomologous genes?

  • Best for very long term evolutionary studies spanning biological kingdoms

  • Selective processes constraining sequence evolution should be roughly the same across species boundaries




Step 3 tree building
Step 3- Tree Building nonhomologous genes?


Msa multiple sequence alignment

Step 3- Tree Building nonhomologous genes?

Tree terminology:

Nodes: branching points Branches: lines

Topology: branching pattern



Msa multiple sequence alignment
Unrooted trees explain phylogenetic relationships; they say nothing about the directions of evolution- the order of descent


There are two main tree drawing methods

- Character Methods nothing about the directions of evolution- the order of descent

- Distance Methods

Both approaches are widely used and work well with most data sets.

There are two main tree drawing methods.


Distance methods
Distance methods nothing about the directions of evolution- the order of descent

Distance- a measure of the overall pairwise difference between two data sets.

The raw material for tree reconstruction is tabular summaries of the pairwise differences between all data sets to be analyzed


Msa multiple sequence alignment
In distance methods, the first step is to calculate a matrix of all pairwise differences between a set of sequences.


Distance methods1
Distance methods of all pairwise differences between a set of sequences.

  • Identify the sequence pairs that have the smallest number of sequence changes between them and are identified as ‘neighbors’. On a tree, these sequences share a common ancestor and are joined by a short branch.


Upgma pairwise distance and neighbor joining are distance methods
UPGMA, pairwise distance and neighbor joining are of all pairwise differences between a set of sequences.distance methods.

  • They progressively group sequences, starting with those that are most alike.

  • UPGMA = unweighted-pair-group method with arithmetic mean


Phylogenetic trees based on distance methods
Phylogenetic trees based on distance methods. of all pairwise differences between a set of sequences.

  • The two sequences that are closest together are connected at a node.

  • The process is repeated until all sequences are joined.

  • Addition of the last sequence defines the root of the tree.


The branch lengths may reflect the degree of similarity and theoretically reflect evolutionary time
The branch lengths may reflect the degree of similarity (and theoretically reflect evolutionary time).

  • Scaled trees- when branch length are proportional to the differences between base pairs.

  • In the best of cases, scaled trees are additive (the physical length of branches connecting any two nodes is an accurate representation of their accumulated differences).


Phylogenetic trees based on distance methods1
Phylogenetic trees based on distance methods. theoretically reflect evolutionary time).

  • Relatively simple.

  • Problem:

    • May not be accurate!!


Character methods
Character Methods theoretically reflect evolutionary time).

“There is no denying that distance-based methods “look at the big picture” and pointedly ignore much potentially valuable information.”


Character methods1
Character Methods theoretically reflect evolutionary time).

Analysis of individual characters are translated into evolutionary trees.

Character- a well-defined feature that can exist in a limited number of different states. (Ex. DNA and protein sequences)


Msa multiple sequence alignment
The concept of parsimony is at the heart of all character-based methods of phylogenetic reconstruction.

  • The process of attaching preference to one evolutionary pathway over another on the basis of which pathway requires the invocation of the smallest number of mutational events.


Character based methods of phylogenetic reconstruction
Character-based methods of phylogenetic reconstruction. character-based methods of phylogenetic reconstruction.

  • “The relationship that requires the fewest number of mutations to explain the current state of affairs is most likely to be correct”




Final step
Final step: each informative site:

After sequences are aligned, algorithms model each tree.


Maximum parsimony is a character method
Maximum parsimony is a character method each informative site:

  • Character methods require a multiple sequence align. Analysis of informative ‘characters’ is used to construct an evolutionary tree.


Msa multiple sequence alignment

Maximum Parsimony each informative site:: General scientific criterion for choosing among competing hypotheses states that we should accept the hypothesis that explains the data most simply and efficiently.

  • The tree requiring the _______ number of nucleic acid or amino acid substitutions is selected.


Maximum parsimony
Maximum Parsimony each informative site::

  • The algorithm searches for a tree that requires the smallest number of changes to explain the differences observed among the groups under study.


Character methods are best suited for
Character methods are best suited for . . . each informative site:

  • Sequences that are quite similar.

  • Small number of sequences

    The method is computationally time consuming as all possible trees are examined.


Phylogenetic trees based on maximum likelihood
Phylogenetic trees based on maximum likelihood: each informative site:

The aim is to find the tree (among all possible trees)

that has the highest likelihood of producing the observed data (statistical methods).


Phylogenetic trees based on maximum likelihood1
Phylogenetic trees based on maximum likelihood each informative site:

are similar to maximum parsimony methods but also take into account the likelihood of specific mutations (ex. A  G).


Mutation rates vary
Mutation Rates Vary: each informative site:

  • Transitions (purine to purine or pyrimidine to pyrimidine) occur more frequently than transversions (purine to pyrimidine or pyrimidine to purine).




Programs take shortcuts
Programs take shortcuts. computer time.

  • When a large number of tree is being compared, it is impossible to score each tree. A shortcut algorithm establishes an upper limit. As it evaluates other trees, it throws out any tree exceeding the upper bound before the calculation is completed.


Msa multiple sequence alignment


Tree evaluation
Tree Evaluation servers, that I know about. Updates to these pages are made about twice a year.

Every ‘tree drawing program’ will generate a tree. The important question is whether or not the tree drawn is the right one.

  • In some cases, there are many trees of similar probabilities.


Vertebrate b globins
Vertebrate servers, that I know about. Updates to these pages are made about twice a year.b-globins:


Bootstrap method of assessing tree reliability
Bootstrap method servers, that I know about. Updates to these pages are made about twice a year. of assessing tree reliability:

Inferred tree is constructed from data set.

Re-run the calculation on subsets of the data (resampling).

Resampling is repeated several (100-1000) times.


Bootstrap method
Bootstrap method servers, that I know about. Updates to these pages are made about twice a year.

Bootstrap trees are constructed from the resampled data sets.

Bootstrap tree is compared to original inferred tree.

% of bootstrap trees supporting a node are determined for each node in the tree.


Molecular clock
Molecular Clock servers, that I know about. Updates to these pages are made about twice a year.

Addition of time to phylogenetic tree. Units of time are often in millions of years.

Assumption- substitution rates are constant over millions of years.


Molecular clock1
Molecular Clock servers, that I know about. Updates to these pages are made about twice a year.

Rates of molecular evolution for genes with similar functional constraints can be quite uniform. (Clock may run at different rates in different proteins.)


The end
The End servers, that I know about. Updates to these pages are made about twice a year.


Msa multiple sequence alignment

  • Evolutionary biology also has benefited greatly from genome-sequencing projects. The wealth of new genome data is helping to better resolve the tree of life, particularly its major branches. This has been especially true for prokaryotes, where more than 80 genomes have been sequenced so far and the results have greatly improved our view of the early history of life.



Phylogenetic trees based on neighbor joining
Phylogenetic trees based on neighbor joining. trees increases dramatically

  • Also utilizes a ‘distance matrix’

  • Neighbor joining algorithm searches for sets of neighbors that minimize the total length of the tree.

  • Can produce reasonable trees, especially when evolutionary distances are short.


Msa multiple sequence alignment

  • For vertebrates, many thorny issues remain to be resolved, such as the phylogeny of families and other major groups in the tree of life. For example, it is not yet known whether humans are closer to mice or to cattle because different results have been obtained with different gene analyses. On the other hand, there is no guarantee that complete genome sequences will immediately solve all phylogenetic questions, as evidenced by the continuing debate over the relationships among humans, flies, and nematodes. We will need to develop new statistical methods and bioinformatics tools to handle the greater volume of data and to unravel the complexities of molecular evolution.


Msa multiple sequence alignment

  • Choice of individual genes or proteins. such as the phylogeny of families and other major groups in the tree of life. For example, it is not yet known whether humans are closer to mice or to cattle because different results have been obtained with different gene analyses. On the other hand, there is no guarantee that complete genome sequences will immediately solve all phylogenetic questions, as evidenced by the continuing debate over the relationships among humans, flies, and nematodes. We will need to develop new statistical methods and bioinformatics tools to handle the greater volume of data and to unravel the complexities of molecular evolution.


Distance matrices
Distance matrices: such as the phylogeny of families and other major groups in the tree of life. For example, it is not yet known whether humans are closer to mice or to cattle because different results have been obtained with different gene analyses. On the other hand, there is no guarantee that complete genome sequences will immediately solve all phylogenetic questions, as evidenced by the continuing debate over the relationships among humans, flies, and nematodes. We will need to develop new statistical methods and bioinformatics tools to handle the greater volume of data and to unravel the complexities of molecular evolution.

  • Scoring matrices include values for all possible substitutions. Each mismatch between two sequences adds to the distance, and each identity subtracts from the distance.


ad
  • Login