Download Presentation
## Phylogenetic Analysis

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**General comments on phylogenetics**• Phylogenetics is the branch of biology that deals with evolutionary relatedness • Uses some measure of evolutionary relatedness: e.g., morphological features**Phylogenetics on sequence data is an attempt to reconstruct**the evolutionary history of those sequences • Relationships between individual sequences are not necessarily the same as those between the organisms they are found in**The ultimate goal is to be able to use sequence data from**many sequences to give information about phylogenetic history of organisms • Phylogenetic relationships usually depicted as trees, with branches representing ancestors of “children”; the bottom of the tree (individual organisms) are leaves. Individual branch points are nodes.**Phylogenetic trees**C A D time B A B C D An unrooted tree A rooted tree time?**We will only consider binary trees: edges split only into**two branches (daughter edges) • rooted trees have an explicit ancestor; the direction of time is explicit in these trees • unrooted trees do not have an explicit ancestor; the direction of time is undetermined in such trees**Types of phylogenetic analysis methods**• Phenetic: trees are constructed based on observed characteristics, not on evolutionary history • Cladistic: trees are constructed based on fitting observed characteristics to some model of evolutionary history Distance methods Parsimony and Maximum Likelihood methods**Similarity and Homology**• The evolutionary relationship between sequences is inferred from the similarity of the sequences • Similarity is a measurable quantity (e.g., % identity, alignment score, etc.) • Homology is the inference from sequence similarity data that sequences are evolutionarily related**Sequence alignments**• Aligning sequences gives information about • Similarity • Areas of sequences that are conserved through evolution**The real problem …**• How do we compare sequences? • Seq 1: CTGCACTA • Seq 2: CACTA • or C---ACTA**The real problem …**• How do we compare sequences? • Seq 1: CTGCACTA • Seq 2: CACTA • or C---ACTA • Scoring tries to approximate evolution: scores for substitutions and for gaps (insertions/deletions) • Scores = sum of terms for substitutions and for gaps (sequence as character string) 41 17**Sequence alignment I**• Simplest scoring: 1 for match, 0 for no match • CTGCACTA • CACTA • CTGCACTA • C---ACTA Score = 5 Score = 5**Sequence alignment II**• Slightly more advanced scoring: +1 for match, 0 for no match, -1 for gap • CTGCACTA • CACTA • CTGCACTA • C---ACTA Score = 5 Score = 2**G C A T**G 1 0 0 0 C 0 1 0 0 A 0 0 1 0 T 0 0 0 1 G C A T G 1 -1 -1 -1 C -1 1 -1 -1 A -1 -1 1 -1 T -1 -1 -1 1 Identity scoring matrices: top, simple form; below, with mismatch penalty**In-class exercise II**• Using the “advanced scoring method” calculate the scores for the following pairs of nucleotide sequences:**What about proteins?**• Chemistry of amino acids means that some substitutions in the sequence are better than others • Substitution matrix: empirically derived scores for frequency of substitution of each amino acid for all 19 others.**In-class exercise III**• Using the BLOSUM62 substitution matrix and a gap penalty of -2, score the following pairs of protein sequences (do not penalize end gaps)**Dynamic programming: strategy**• Break alignment problem into small pieces • Optimize first piece • Then extend into second piece; since first piece is optimized already, program only needs to optimize extension • Continue until end of comparison**Why multiple alignments?**• Alignment of more than two sequences • Usually gives better information about conserved regions and function (more data) • Better estimate of significance when using a sequence of unknown function • Must use multiple alignments when establishing phylogenetic relationships**Dynamic programming extended to many dimensions?**• No – uses up too much computer time and space • E.g. 200 amino acids in a pairwise alignment – must evaluate 4 x 104 matrix elements • If 3 sequences, 8 x 106 matrix elements • If 6 sequences, 6.4 x 1013 matrix elements**Need to find more efficient method**• Sacrifice certainty of optimum alignment for certainty of good alignment but faster**Feng-doolittle algorithm**• Does all pairwise alignments and scores them • Converts pairwise scores to “distances” • D = -logSeff = -log [(Sobs –Srand)/(Smax –Srand)] • Sobs = pairwise alignment score • Srand = expected score for random alignment • Smax = average of self-alignments of the two sequences**As Smax approaches Srand (increasing evolutionary distance),**Seff goes down; to make the distance measure positive, use the -log**Once the distances have been calculated, construct a guide**tree (more in the phylogeny class) – tells what order to group the sequences • Sequences can be aligned with sequences or groups; groups can be aligned with groups**Sequence-sequence alignments: dynamic programming**• Sequence-group alignments: all possible pairwise alignments between sequence and group are tried, highest scoring pair is how it gets aligned to group • Group-group alignments: all possible pairwise alignments of sequences between groups are tried; highest scoring pair is how groups get aligned**Example**Seq5 Seq3 Seq4 Seq1 Seq2 Alignment 2 Alignment 1 Alignment 3 Final alignment**Notice that this method does not guarantee the optimum**alignment; just a good one. Gaps are preserved from alignment to alignment: “once a gap, always a gap”**Distance methods**• Measuring distance -- just like when we talked about multiple alignment, distance represents all the differences at the various positions; these differences can be treated as equal or weighted according to empirical knowledge of substitution rates**Another way to say this is that there are a set of distances**dij between each pair of sequences i,j in the dataset. dij can be the fraction f of sites u where residues xi and xj differ; or dij can be such a fraction but weighted in some way (e.g. Jukes-Cantor distance)**Clustering algorithms**• UPGMA -- this is the distance clustering method that is used in pileup to make the guide tree • dij is the average distance between pairs of sequences found in two clusters, Ci and Cj. • Text’s notation: |Ci| = number of sequences in Ci**The algorithm in the text means just what we said before:**find the closest distance between two sequences, cluster those; then find the next closest distance, cluster those; as sequences are added to existing clusters find the average distance between existing clusters • Work through the notation! • UPGMA assumes a molecular clock mechanism of evolution**Neighbor-joining: corrects for UPGMA’s assumption of the**same rate of evolution for each branch by modifying the distance matrix to reflect different rates of change. • The net difference between sequence i and all other sequences is • ri = Sdik k**The rate-corrected distance matrix is then**• Mij = dij - (ri + rj)/(n - 2) • Join the two sequences whose Mij is minimal; then calculate the distance from this new node to all other sequences using • dkm = (dim + djm - dij)/2 • Again correct for rates and join nodes.**In-class exercise I**• Retrieve the file named phylo2 from bioinfI.list in my directory • Open it in the editor, select all the sequencs • Select Functions Evolution PAUPSearch; in Tree Optimality Criterion choose distance; in Method for Obtaining Best Tree choose heuristic. Leave everything else as default (make sure bootstrap option is not selected) • Select Run. Inspect output**Parsimony methods**• Parsimony methods are based on the idea that the most probable evolutionary pathway is the one that requires the smallest number of changes from some ancestral state • For sequences, this implies treating each position separately and finding the minimal number of substitutions at each position**Example of parsimonious tree building**• Tree on left requires only one change, tree on left requires two: left tree is most parsimonious**Parsimony methods assign a cost to each tree available to**the dataset, then screen trees available to the dataset and select the most parsimonious • Screening all the trees available to even a smallish dataset would take too much time; branch and bound method builds trees with increasing numbers of leaves but abandons the topology whenever the current tree has a bigger cost than any complete tree**In-class exercise II**• Use same data set and program as in exercise I, but choose maximum parsimony. Use heuristic for the tree building method. • Inspect your tree. Compare it to the distance generated tree.**Maximum likelihood methods**• Maximum likelihood reconstructs a tree according to an explicit model of evolution. For the given model, no other method will work as well • But, such models must be simple, because the method is computationally intensive**Actually, all the other methods discussed implicitly use a**simple model of evolution similar to the typical model made explicit in maximum likelihood: • All sites selectively neutral • All mutate independently, forward and reverse rates equal, given by m**Also assume discrete generations and sites change**independently • Given this model, can calculate probability that a site with initial nucleotide I will change to nucleotide j within time t: • Ptij = dije-mt + (1 - e-mt)gj, where dij = 1 if i = j and dij = 0 otherwise, and where gj is the equilibrium frequency of nucleotide j**The likelihood that some site is in state i at the kth node**of a tree is Li(k) • The likelihoods for all states for each site for each node are calculated separately; the product of the likelihoods for each site gives the overall likelihood for the observed data • Different tree topologies are searched to find the highest overall likelihood**Maximum likelihood is maybe the “gold standard” for**phylogenetic analysis; but because of its computational intensity it can only be used for select data and only after much initial fine tuning of many parameters of sequence alignments • Often used to distinguish between several already generated trees**Assessing trees**• The bootstrap: randomly sample all positions (columns in an alignment) with replacement -- meaning some columns can be repeated -- but conserving the number of positions; build a large dataset of these randomized samples**Then use your method (distance, parsimony, likelihood) to**generate another tree • Do this a thousand or so times • Note that if the assumptions the method is based on hold, you should always get the same tree from the bootstrapped alignments as you did originally • The frequency of some feature of your phylogeny in the bootstrapped set gives some measure of the confidence you can have for this feature