CSCI6904 Genomics and Biological Computing

CSCI6904Genomics and Biological Computing Phylogenetics

Phylogeny • A non-biological example • Howe, CJ., Barbrook, AC., Spencer, M., Robinson, P., Bordalejo, B. and Mooney, LR., 2001, Manuscript Evolution, Trends in Genetics, 17(3), 147-152

An analogy that works well… Phylogeny

An analogy that works well… • Thanks to Gutenberg and his invention of the printing press, the rate at which manuscripts are evolving have decreased by many orders of magnitude (next to 0, actually). • Raw data • The encoding of the data has to be done in a slightly different manner as it is preferable to treat words as characters. Consequently, the alphabet is of an un-manageable size. Phylogeny

The data is collected from extant manuscripts: Phylogeny

… and aligned so all characters are homologous: Phylogeny

What can be discovered: Phylogeny • Which manuscript is the closest to the original draft? • Are all know manuscript found in, say, Belgium are descendent of a single copy of the manuscript? • What would be the most likely text in the (long lost) original version?

What can be discovered: • Whatever happened to the first chapter? • In the case to the left, there is evidence from “phylogenetic” analysis that the first half of the prologue of the manuscript El was taken from a different source than for the rest of the text. • In genomics, if a gene gets misplaced in a tree, it may indicate that the gene was acquired by transfer rather than heredity.

There are evidences that the transfer of a single gene transformed a benign bacteria: Yersinia Pestis, into the agent of the “black death”. “The study, published in the April 26 issue of Science, shows that an enzyme called phospholipase D (PLD), previously known as Yersinia murine toxin, allows Y pestis to survive in the midgut of the rat flea. By acquiring the gene that encodes PLD, "the bacterium gradually changed from a germ that causes a mild human stomach illness acquired via contaminated food or water to the flea-borne agent of the 'Black Death,' which in the 14th century killed one-fourth of Europe's population," the NIH said in a news release.” Hinnebusch BJ, Rudolph AE, Cherepanov P, et al. Role of Yersinia murine toxin in survival of Yersinia pestis in the midgut of the flea vector. Science 2002;296(5568):733-5

Strategies Discrete character approaches Parsimonious criterion Model likelihood criterion Hypothesis likelihood criterion Distance-based clustering Least-square Neighbor-Joining / UPGMA (Implicit topology) Minimum Evolution Phylogeny

“Molecular Clock” assumption can be rejected in most cases. In this example, un-equally evolving sequences are clustering according to their rate of evolution rather than according to the history of the genes. UPGMA’s shortcoming

Neighbor-Joining algorithm Guarantee to recover the true tree if the distance matrix is an exact reflection of the tree. How realistic is it to assume that these distances are behaving as such?

Distance metrics between sequences Triangle inequality Rarely respected. Especially if any of D(A,B), D(B,C) are large. The reason: Saturation.

What is saturation? Time 1 2 3 4 5 6 7 A -------------------------------- P A  F  A  H  K  H  P A B In both cases, if only the time step 1 and 7 are known, the most likely distance will be the same. 4 3 5 2 6 1 7

Saturation is theoretically expected

Maximum likelihood distances The following describe the evaluation of distances using the maximum likelihood criterion. This is the best method to evaluate distances between biological sequences.

For nucleotides, there are a limited number of substitutions Jukes-Cantor Model A G C T Matrix with 1 expected substitution per 100 sites.

For nucleotides there are a limited number of substitutions Jukes-Cantor Model Given two (short) sequences C C A T C C G T P1 = The Likelihood of that these two sequences are related is then:

For nucleotides there are a limited number of substitutions Jukes-Cantor Model Given two (short) sequences C C A T C C G T P1 = What if the distance implied by P1 are not realistic/representative?

As we have seen for the PAM matrix a few weeks back. We can obtain a pij for any multiple of PAM1 by doing a matrix multiplication. Extrapolation of probability matrices.

There will be then a different probability associated to each possible distances Extrapolation of probability matrices.

For branch length l over k sites The probability is as a function of the distance between two sequences. There is thus a value of distance (l) that maximizes the probability of observing two related sequence. In other words, there is a t values that maximize the likelihood that two sequences are related. Extrapolation of probability matrices.

Arbitrary P matrices from Q is the log(P) matrix for an arbitrary unit of distance.

Vector of frequencies for each character (can be estimated from input dataset) A matrix of relative rate of substitution (large amount of empirical data (PAM, JTT) or optimized (WAG)) In Practice, the model can be custom built for an input dataset

Now, imagine that two sequences are un-related. • The real Branch Length (t) is equal to + • The BL estimate will converge to a value necessarily smaller due to the presence of some site being identical by coincidence. Extrapolation of probability matrices.

Even random sequences are going to have “matches” Although Likelihood distance should tend to large values in this case.

Even random sequences are going to have “matches” • Saturation should be compensated for in ML distances. • However, and because of: • Non-homogenous frequencies • Rate heterogeneity • Change in the P matrix over time • Non-independence of characters in a sequence. • Long distances still a bit contentious to evaluate.

Time reversibility is also assumed Time 1 2 3 4 5 6 7 A  A  A  P  P  P  P A  A  A  H  K  H  P A B Without time reversibility assumed, it would be impossible to measure a distance between two sequences without involving an undefined bifurcation. 4 3 5 2 6 1 7

Time reversibility is also assumed In practice, this means that the entries in our matrices of substitution have to be symmetrical such that : This is also practical from a bioinformatics perspective since there it cut in ½ the number of parameters in the model. 4 3 5 2 6 1 7

Another distance-based method that intuitively make sense Least Square method Weight Sum of all t along the path from i to j. D matrix entry

One last distance-based method that we would intuitively use Once abstracted : We are looking for an acyclic, binary graph with n terminal vertices that conforms the best to a set of n2 constraints. t345 t12 t45 t2 t3 t4 t5 t1 There is a danger of time traveling with some tk < 0 i o f j x

One last distance-based method that we would intuitively use One last distance-based method that we would intuitively use One last distance-based method that we would intuitively use One last distance-based method that we would intuitively use One last distance-based method that we would intuitively use One last distance-based method that we would intuitively use Once abstracted: Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree) Once abstracted: Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree) Once abstracted: Although there is n terminal nodes, there will be 2n-1 nodes in the tree/graph (rooted tree) Once abstracted: Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree) Once abstracted: Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree) Once abstracted: Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree) t345 t345 t345 t345 t345 t345 t12 fjx fjx fjx fjx fjx fjx t45 t45 t45 t45 t45 t45 io io io io io io jx jx jx jx jx jx In the path Not in path In the path Not in path In the path Not in path In the path Not in path In the path Not in path In the path Not in path t2 t2 t2 t3 t3 t3 t4 t4 t4 t4 t4 t4 t5 t5 t5 t5 t5 t5 t1 t1 t1 i o f j x i o f j x i o f j x i o f j x i o f j x i o f j x

One last distance-based method that we would intuitively use There is a straightforward solution to this linear algebra problem. t345 t12 fjx t45 io jx In the path Not in path t2 t3 t4 t5 t1 i o f j x

One last distance-based method that we would intuitively use Minimum Evolution Can be used as a selection criterion between Least-Square tree topologies. This is done by selecting the topology amongst a collection of suitable topology that minimizes : t345 t12 fjx t45 io jx t2 t3 t4 t5 t1 i o f j x

Tree spaceUnlike UPGMA and NJ, the problem with this previous method is that you have to provide a topology prior to the calculation….

Strategies Discrete character approaches Parsimonious criterion Model likelihood criterion Bayesian statistics Distance-based clustering Least-square Neighbor-Joining / UPGMA (Implicit topology) Minimum Evolution Phylogeny

Phylogeny • Discrete-character signal versus distance • Distance : Use the characters and a function to evaluate distance metrics. These are used to determine the length of the branch/edges between nodes/vertices. These internal nodes/edges are simply there to maximally reconcile the distance data into a binary tree. • Character : Use discrete characters implicitly or explicitly to define the state of each nodes.

Parsimony • Intuitive method that can be run manually • Assumes that everything observed in the data is connected by the most straightforward relationships.

Parsimony • Algorithm • Postorder tree transversal : from terminal nodes toward the “center”. • At each node: • Create an intersection of the set of observation in the immediate descendent nodes. • If the intersection set is null. Create a set that is the union of the two descendents. • Add one to the count of changes recorded.

Parsimony • Algorithm • The most parsimonious tree will be the topology which will minimize the number of changes to explain the data over all sites (columns). • Statistics • Consistency • Retention

Parsimony • Side-effects • The reconstruction is assuming that the most parsimonious explanation is the correct one. • It also assumes that all changes have a similar “cost”. • Therefore, the parsimony method does not seem to be designed to deal with saturation.

Maximum likelihood criterion • Abstraction • We have a collection of items (sequences). We know that all the instances in the collection are stochastically derived from a unique parent in the hierarchy. We also have a have a model for this stochastic process represented as a Markov process. • We are thus looking for a tree (topology+distances) that will maximize the likelihood of the data, given the Markov process.

For nucleotides there are a limited number of substitutions Jukes-Cantor Model Given two (short) sequences C C A T C C G T P1 = What if the distance implied by P1 are not realistic/representative?

There will be an optimal distance between two sequences. Extrapolation of probability matrices.

There will be an optimal distance between two sequences: Distance to an internal node If the sequence of only one of the node is known, the other end could be any possible characters:

A C G C C t2 t1 t4 t5 t3 y w t7 t7 t7 t7 t7 t6 z t8 x Model based phylogeny • It is possible to compute likelihood of internal nodes by summing over all possibilities.

A C G C C t2 t1 t4 t5 t3 y w t7 t7 t7 t7 t7 t6 z t8 x Model based phylogeny • The structure of the equation once the summation are pushed as far right as possible is the same as the structure of the tree.

A C G C C t2 t1 t4 t5 t3 y w t7 t7 t7 t7 t7 t6 z t8 x Model based phylogeny • The calculation at one node thus depend on the conditional likelihood of each possible character S in the children nodes.

A C G C C t2 t1 t4 t5 t3 y w t7 t7 t7 t7 t7 t6 z t8 x Model based phylogeny For terminal nodes: For internal nodes: This is done for each site i. The log(L) are stored rather than L.

CSCI6904 Genomics and Biological Computing