Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures Chapter 17.1: Strings and Evolutionary Trees Lecturer: Dr. Rose Slides by: Dr. Rose April 8, 2003

Strings and Evolutionary Trees “…the great Tree of Life fills with its dead and broken branches the crust of the earth, and covers the surface with is ever-branching and beautiful ramifications.” - Darwin

Strings and Evolutionary Trees There are three competing theories to creating classification trees: • Evolutionary taxonomy • Numerical taxonomy • Cladistics

Strings and Evolutionary Trees • Evolutionary taxonomy • classification informed by evolutionary theory • Fills in internal nodes corresponding to common ancestors.

Strings and Evolutionary Trees • Numerical taxonomy (Phenetics) • Studies the relationship between groups of organisms based on the degree of similarity • Similarity can be in terms of molecular, phenotypic or anatomical data. • The resulting graph, which is a tree-like network is called a phenogram. • Maximum Likelihood method.

Strings and Evolutionary Trees • Cladistics • Characterized by character-state methods, e.g., maximum parsimony. • Guiding principle: • not all character states shared by organisms provide evolutionary information • Important to restrict consideration to evolutionarily significant states.

Strings and Evolutionary Trees • Cladistics continued • Willi Hennig Society: • http://www.cladistics.org/

Strings and Evolutionary Trees Tree building algorithms: • Distance-based methods • Input: distance data such as sequence edit distance • Output: weighted tree with pairwise distances matching evolutionary distance • We will consider data that is: • Ultrametric (section 17.1) • Additive but not ultrametric (section 17.2) • Nonadditive data (no section)

Strings and Evolutionary Trees Tree building algorithms: continued • Maximum-parsimony methods • Character-based methods • Input: character data (often aligned sequences) • Output: tree with • input taxa at leaves • Inferred taxa at internal nodes • Goal: minimize the total cost of mutations • maximize parsimony. • Seeks a tree that has the minimum cost over all possible trees

Ultrametric trees and distances Before discussing ultrametric distances, consider additive distances: Defn.Additive distances are distances which can be fitted to an unrooted tree such that all pairwise taxa distances are equal to the sum of the branch lengths connecting them. (Table and figure from http://imbs.massey.ac.nz/Research/MolEvol/Farside/DNA/00312.html)

Ultrametric trees and distances Ultrametric distances are more constrained than additive distances. Defn. Ultrametric distances are distances that: • fit a tree so that the distance between any two taxa is equal to the sum of the branches joining them. • for any three taxa i, j and k, the two largest distances are equal, i.e., • If dik>djkthen dik = dij • Else if dik>dijthen dik = djk • Else dij = djk

Ultrametric trees and distances Q: What is an ultrametric tree? An ultrametric tree T for n-by-n symmetric distance matrix D has the following properties: • T has n leaves, one per unique row of D. • Internal nodes are labeled by an entry from D and have two children. • The numbers labeling internal nodes strictly decrease along any path from the root to a leaf. • D(i, j) denotes the label of the least common ancestor of leaves i and j in T. • The distances in D must be ultrametric.

Ultrametric trees and distances Consider the following example from the textbook: • Verify the ultrametric condition, i.e., for any three taxa, two of the distances will be the same and larger than the third distance.

Ultrametric trees and distances Interpretation of ultrametric as evolutionary trees: • The leaves are the existing OTUs • The internal nodes are the divergence events A divergence event is a point where the evolutionary histories of two OTUs split.

Ultrametric trees and distances Q: If taxa A and B diverge at time t, which statements are implied by the meaning of divergence? • A is the ancestor of B • B is the ancestor of A • Neither A nor B is an ancestor of the other. • Neither A nor B have a living ancestor.

Ultrametric trees and distances If the branching order & time of each divergence is known: • The label at each internal node is the time of the divergent event corresponding to that node. • The labels from the root to leaves must be strictly increasing. • D(i, j) is the time that taxa i and j diverged. • The author calls T a min-ultrametric tree.

Ultrametric trees and distances Equivalently, if the evolutionary history is known: • The label at each internal node is the time that has passed since the divergent event corresponding to that node. • The labels from the root to leaves must be strictly decreasing. • D(i, j) is the timesince taxa i and j diverged. • T is an ultrametric tree for D.

Ultrametric trees and distances Defn.The symmetric matrix D defines an ultrametric distance iff for any three indices i, j and k, the two largest distances are equal, i.e., • If dik>djkthen dik = dij • Else if dik>dijthen dik = djk • Else dij = djk Call D ultrametric if it defines ultrametric distances. Thm. Distance matrix D has an ultrametric tree iff D is an ultrametric matrix. (Proof page 451)

Ultrametric trees and distances Proof. • (if T ultrametic then D ultrametric) If T is ultrametric: (draw T) • each internal node v is labeled D(i, j) where i and j are leaves and v is the least common ancestor. • For any three leaves, i, j, k, in T, let u be the least common ancestor, then: • u is labeled by two of {D(i, j), D(i, k), D(j, k)}, i.e., two of these are equal. • Further more one of {D(i, j), D(i, k), D(j, k)} is smallest Therefore D is ultrametric.

Ultrametric trees and distances Proof.  (if D ultrametic then there is an ultrametric T) If D is ultrametric: • The number of distinct entries d in each row i defines the number of nodes from the root to leaf i. • Each node in this path is labeled in decreasing order with a distinct label. • Any node v on this path labeled D(i, j) must be the least common ancestor of leaves i and j.

Ultrametric trees and distances Proof. continued • The path to leaf i partitions the n-1 remaining leaves in d-1 classes. • Each distinct node on the path to i is labeled by the distance from i to to the leaves in that partition. Example:

Ultrametric trees and distances Proof. continued • We want to recursively find the ultrametric tree for each of the d-1 partitions and then combine them. • Consider the partition defined by internal node v. • Let j be a leaf contained in this partition. • Let l be some other leaf. There are three cases: • l is in the same partition as j. • l is in a partition between i and node v. • l is in a partition between node v and the root.

Ultrametric trees and distances Proof. continued • The three cases: Let i = A, j = F • l is in the same partition as j. example: l = B • l is in a partition between i and v. example: l = D • l is in a partition between v and the root. example: l = C

Ultrametric trees and distances Proof. continued Case 1: (i = A, j = F, l = B) D(i, j) = D(i, l) thus D(j, l)  D(i, j). Why?  So we can add the subtree containing j & l. knowing that D(j, l) is correct.

Ultrametric trees and distances Typos in text Proof. continued Case 2: (i = A, j = F, l = D) D(i, l) <D(i, j) thus D(i, j) = D(j, l)  So we can add the subtree at v containing j knowing that D(j, l) is correct, i.e. v is labeled correctly.

Ultrametric trees and distances Typos in text Proof. continued Case 3: (i = A, j = F, l = C) D(i, l) >D(i, j) thus D(i, l) = D(j, l)  So we can add the subtree at v containing j knowing that D(j, l) is correct, i.e., it labels their least common ancestor .

Ultrametric trees and distances Proof. continued In each of the three cases, the ultrametric tree defined by v can be correctly attached to v. Hence, using recursion, we can construct the ultrametric tree T for D.

Ultrametric trees and distances Gusfield presents two related theorems: • Thm. If D is ultrametric, then the ultrametric tree for D is unique. This is a consequence of the fact that the nodes that appear on path to a given node i must appear in every ultrametric tree for D. • Thm. If D is ultrametric, then the ultrametric tree for D can be constructed in O(n2) time.

Ultrametric trees and distances Given ultrametic data we can: • reconstruct evolutionary history. • Find the relative divergence times • Find the exact tree topology Q:How do we get ultrametric data? Consider the molecular clock theory.

Molecular Clock Theory • Proposed by Emile Zucker and Linus Pauling. • Idea: accepted mutations occur at a constant rate for a given protein. • There are three important issues: • Accepted mutations are those that still allow the protein to function properly. Lethal mutations will be selected against and should not accumulate.

Molecular Clock Theory • The theoretical clock rate is protein specific. • Different proteins will have different clocks. • Gusfield mentions hemoglobin and cytochrome c. • Both are stable and similar in all mammals. • However, hemoglobin mutates faster than cytochrome c.

Molecular Clock Theory • The implication is that the number of mutations is proportional to length of the time interval. • Requirement: the interval must be “long enough”. • The length of an interval can be measured by the number of mutations. • This requires that the clock be calibrated.

Molecular Clock Theory Assumptions: • all DNA mutates at the same rate • Observed accepted rate differences are due to different constraints: • Natural selection at the organism level • Physical chemistry at the molecular level

Molecular Clock Theory Q: How would we use the molecular clock to collect ultrametric data? • Find a common protein for two taxa of interest • Determine the number of accepted mutational differences, say k. • By molecular clock theory each taxa contributed k/2 accepted mutations.

Molecular Clock Theory • Do this for each pair of n taxa. • Result: n choose 2 numbers satisfying the requirement for an ultrametric tree. • The ultrametric tree will describe the true evolutionary history for the n taxa. Great huh?

Molecular Clock Theory If only the molecular clock theory was correct  • In the real world, the situation is complicated. • Molecular clock rates can and do diverge. • Sometimes there are common mutation rates. • Sometimes the mutation rates diverge.

Additive Distance Trees

Bioinformatics Algorithms and Data Structures