Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures Chapter 17.2-3: Strings and Evolutionary Trees Lecturer: Dr. Rose Slides by: Dr. Rose April 5, 2007

Next Homework: Due 4/19/07 • #3 • #8 Note: there are three parts to this question. Only answer the first two parts, i.e., • show that if D is ultrametric and D(i, i)=0 for each i, then D is also additive. • Show the converse is not true. • #14 (Grad students only.)

Additive Distance Trees • Real data is rarely ultrametric. • A weaker constraint is that data be additive. • Recall: Additive distances are distances which: • can be fitted to an unrooted tree such that • pairwise taxa distances are equal to the sum of the branch lengths connecting them.

Additive Distance Trees • Consider the relationship between additive and ultrametric trees: • Q: Are all ultrametric trees additive? • A: Yes. • Q: Are all additive trees ultrametric? • A: No.

Additive Distance Trees • Assuming that: • D is a symmetric n by n distance matrix • D contains only zero values on the diagonal • D contains only positive off-diagonal values • T is an n node tree, then: • Defn.T is an additive tree for D if, for every pair of labeled nodes (i, j), the path from i to j has total weight exactly D(i, j).

Additive Distance Trees Additive tree problem, given: • symmetric matrix D • zero entries on the diagonal • Positive off-diagonal values Find additive tree T for D or determine that one does not exist. Imagine that you have a distance matrix D representing evolutionary distance between pairs of taxa. Q: Do you expect the additive tree for D to be unique?

Additive Distance Trees Q: Is there a unique additive tree for D? If you think the answer is yes, why? If you think the answer is no, why? Consider what we know about D and T: • Is T’s branching pattern is consistent with D? (y/n) • Are the edge lengths in T consistent with D? (y/n) • Does D specifies directed edges? (y/n) • Does D imply directed edges? (y/n)

Additive Distance Trees (Table and figure from http://imbs.massey.ac.nz/Research/MolEvol/Farside/DNA/00312.html)

Additive Distance Trees Concept: Additive tree problem Given: n by n symmetrical matrix D with zero diagonal entries positive off-diagonal values Find: additive tree or determine none exists.

Additive Distance Trees Concept: Compact Additive tree problem Given: n by n symmetrical matrix D with zero diagonal entries positive off-diagonal values Find: additive tree with exactly n nodes. Q: What does this definition say about the topology of the tree? A: For every node, there must be a corresponding row in D.

Additive Distance Trees Consider the symmetrical matrix D above and the tree T. Q: Is T an additive tree for D? Q: Is T a compact additive tree for D?

Additive Distance Trees Consider the symmetrical matrix D above. Q: Does D have a additive tree? Q: Does D have a compact additive tree?

Additive Distance Trees Defn. Let G(D) be the n-node complete graph corresponding to D where nodes are labeled 1 – n and edges have weight D(i, j). Thm. If there is a compact additive tree T for D, then T must be the unique minimum spanning tree of G(D).

Additive Distance Trees Proof. Let: T be a compact additive tree for D. e = (x, y) be any edge not in T. We know: The path from x to y in T is D(x, y) The edge weight for e is also D(x, y) Since e is not in T, e is strictly greater than any edge in the path from x to y in T.

Additive Distance Trees Proof.continued Assume that there is some other minimum spanning tree T´ containing e. Removing e splits T´ into two sets of nodes, S & S´. WLOG, S contains x & S´ contains y. In T there is an edge e´ that connects the nodes in S & S´ Furthermore, e´ is on the path from x to y in T. Hence e´ < e.

Additive Distance Trees Proof.continued Create a new spanning tree T´´ by removing e from T´ and adding e´. The edge weight of T´´ is less than that of T´. This contradicts the assumption that T´ is a minimum spanning tree. T must itself be the unique minimum spanning tree of G(D).

Additive Distance Trees How can we use this theorem to solve the compact additive tree problem in O(n2) time? Answer: • Construct G(D) from D. • Use an O(n2) mst algorithm, such as Prim’s algorithm, that extends a single growing tree T. • When an edge e = (x, y) is added to T, and x is already in T. • Compute d(i, y) = d(i, x) + D(x, y) for all i in T. This takes O(n) per iteration and O(n2) for all of T. • Verify d(i, y) = D(i, y)

Parsimony Q: What is parsimony? A: Parsimony: extreme or excessive frugality. Q: So what does frugal mean? A: Frugal: thrifty, economical. In this chapter, parsimony is a character-based method for reconstructing evolutionary history. Characters are attributes, traits In this section we will look at highly constrained trees that express evolutionary history.

Parsimony • Can be used to deduce evolutionary trees • Specifies branching order • Does not specify divergence times • Can be used as basis for a taxonomy This section is a limited introduction to maximum parsimony problems: • Binary-character problems • Focus on perfect phylogeny problem

Parsimony Defn. Let M be an n by m, binary matrix representing n objects with m character traits. • Since M is binary, each character trait has two possible states, 1 or 0. • Cell (p, i) of M has value 1 iff object p has character i. • M has a flavor similar to the old chestnut animal guessing program that uses a binary tree.

Parsimony Defn. a phylogenetic tree for M is a rooted tree T with exactly n leaves such that: • Each of the n leaves is labeled by exactly one object. • Each of the m character-traits labels exactly one edge of T. • For any object p, the character-traits labeling the edges along the path from the root to p are exactly those character-traits whose state is one.

Parsimony Consider the matrices below: • do either M1 or M2 have a phylogenetic tree? • If so, what does the tree look like?

Parsimony Q: What is the interpretation of the phylogenetic tree? A: It is an estimate of the divergent evolutionary history of the objects. (does not give time) • The root represents an ancestor with none of the m character-traits. • Each character-trait transitions from 0 to 1 only once. • No character-trait ever transitions from 1 to 0.

Parsimony Q: In what sense are phylogenetic trees parsimonious? A: Each character-trait labels exactly 1 edge of the tree. The biological assumptions are: • The root represents an ancestor with none of the mcharacter-traits. • Each character-trait transitions from 0 to 1 only once. • No character-trait ever transitions from 1 to 0.

Parsimony Q: What character-traits can be used? Morphological features: • (from: http://anthro.palomar.edu/hominid/australo_2.htm) • (Also see: http://www.cfsan.fda.gov/~frf/rfe3pc00.html )

Parsimony Q: What character-traits can be used? • Morphological features: • Gross anatomical features • OTU-specific esoterica • DNA-based characters • specific substring patterns • Specific nucleotides in fixed positions • See pages 460 & 461 for more discussion

Parsimony Defn. perfect phylogeny problem: given the binary matrix M, determine if there is a phylogenetic tree for M, if there is one, build it. • We will discuss an O(nm)-time algorithm • First we need to preprocess M. • Consider each column as a binary number • msb in row 1 • sort columns in decreasing order. • Let M´ denote the reordered matrix M.

Parsimony Example.

Parsimony Defn. for any column k of M´, let Ok be the set of objects with a one in column k. Obs. If Oj is a proper subset of Ok, then column k must be to the left of column j in M´.

Parsimony Thm. Matrix M´has a phylogenetic tree iff for every pair of columns i, j, either Oi and Oj are disjoint or one contains the other. Proof. (Sketch starting on next slide)

Parsimony Proof.  Let T be the phylogenetic tree for M´. Consider characters i, j. Let ej be the edge that character j transitions from 0 to 1. Let ei be the edge that character i transitions from 0 to 1. Objects with character i are below ei in T. Objects with character j are below ej in T.

Parsimony Proof.  There are 4 possible cases: • ei = ej • ei is on the path from the root to ej. • ej is on the path from the root to ei. • The paths diverge before reaching ei or ej. In case 1, Oi = Oj. In case 2,OjOi since all objects possessing j possess i. In case 3, OiOj since all objects possessing i possess j. In case 4, OiOj = 

Parsimony Proof. for alli, jOi & Oj are disjoint or one contains the other • Consider objects p and q. • Let k be the largest character common to both. • All characters i < k possessed by p are also possessed by q • All characters i < k possessed by q are also possessed by p • So they have share exactly the same characters up till k, and none thereafter.

Parsimony Proof. for alli, jOi & Oj are disjoint or one contains the other • Label each p with the string that is the concatenation of the column numbers for which it has nonzero entries. Likewise for q. • Append $ to the string so that no string is a prefix of any other. • p & q have a common prefix but diverge after k • The keyword tree (sans failure links) for the n objects in M´ specifies a perfect phylogeny for M´.

Parsimony O(nm) alg. for the perfect phylogeny problem: • Reorder columns of M in descending order using radix sort. • Let M´ be the resulting matrix. • Label each column by its column position in M´. Q: Why do you think we are using radix sort? A: radix sort is O(nm). Also it can be applied to a number with an arbitrary number of digits.

Parsimony • For each row p of M´, construct the string consisting of the characters, in sorted (increasing) order, that p possesses. • Recall that in step 1 we labeled each character by its column position. • The string for a given row will be the concatenation of the column labels for which the row has the value one.

Parsimony • Build the keyword tree T for the n strings from step 2. Recall that the keyword tree for set P is a rooted directed tree K satisfying: • Each edge is labeled with one character • Any two edges out of the same node have distinct labels. • Every pattern Pi in P maps to some node v of K s.t. the path from the root to v spells out Pi • Every leaf in K is mapped by some pattern in P.

Keyword Trees Example: From textbook P = {potato, poetry, pottery, science, school}

Parsimony • Test whether T is a perfect phylogeny for M. • Verify that T has exactly n leaves such that: • Each of the n leaves is labeled by exactly one object. • Each of the m character-traits labels exactly one edge of T. • For any object p, the character-traits labeling the edges along the path from the root to p are exactly those character-traits whose state is one.

Tree Compatibility Suppose you have two different phylogenetic trees. Note: even for the same set of taxa we can derive different trees by basing the comparison on different proteins. Q: How can we determine if they describe a consistent evolutionary history? Q: How can we combine them into a single tree? This section addresses these questions.

Tree Compatibility Defn.Phylogenetic tree refinement: A phylogenetic tree T is a refinement of T if T can be obtained by a series of contractions of edges of T. Nutshell: T agrees with T, but expresses additional evolutionary history.

Tree Compatibility Tree refinement: T1 & T2? T1 & T3? T1 & T4? Etc?

Tree Compatibility Defn.Phylogenetic tree compatibility: Trees T1 and T2 are compatible if there exists a phylogenetic tree T3 refining both T1 and T2.

Tree Compatibility Tree compatibility problem: Given two trees, T1 and T2: • determine if they are compatible. • if so, return the refinement tree T3. We will consider a matrix method for finding T3.

Tree Compatibility Consider a binary matrix representation of a phylogenetic tree: • There is one row for each object (OTU) • There is one column for each internal node • Entry (i, j) is one iff the leaf for object i is in the subtree rooted at j. Q: Would an example help? A: Ok, then suggest a simple phylogenetic tree.

Tree Compatibility Let M1 be the matrix representation of T1 and similarly M2 for T2. Let M3 be the matrix formed by taking the union of the columns of M1 and M2. Q: What is meant by taking the union of columns? A: M3 will contain: • all columns found only in M1 • all columns found only in M2 • One copy of all columns appearing in both M1 and M2 • Obviously, columns will have a different order

Tree Compatibility Q: What should M3 look like? What about T3?

Tree Compatibility Q: Do you agree?

Tree Compatibility Note: In refining T3 to produce T4, in M4 there is no impact wrt to the preceding columns in M3

Tree Compatibility Theorem: T1 and T2 are compatible iff there is a phylogenetic tree for M3. A phylogenetic tree T3 for M3 is a refinement of both T1 and T2.

Bioinformatics Algorithms and Data Structures