Different Models for Phylogenetic Networks: how do they relate?

Different Models for Phylogenetic Networks: how do they relate? My kind of network: ARG from populations genetics Dan Gusfield, September 11, 2007 Isaac Newton Institute

Sequence Recombination 01011 10100 S P 5 Single crossover recombination 10101 A recombination of P and S at recombination point 5. The first 4 sites come from P (Prefix) and the sites from 5 onward come from S (Suffix). The order of the sites in the sequences is fixed, so a recombination is an ordered event.

Network with Recombination: ARG The ancestral sequence may be specified or not. 10100 10000 01011 01010 00010 10101 12345 At most one mutataion per site. 00000 1 4 M 3 00010 2 10100 5 The black sequences could be derived on a tree but when the red sequence is added, sites 4 and 5 become incompatible, but one recombination suffices. P 10000 01010 01011 5 S 10101

A tree-like network for the same sequences generated by the prior network. 4 3 1 s p a: 00010 2 c: 00100 b: 10010 d: 10100 2 5 s p 4 g: 00101 e: 01100 f: 01101

A Min ARG for Kreitman’s data ARG created by SHRUB

Basic Problem: Minimizing the number of recombination events (recombination nodes) • Problem: Given a set of sequences M, find an ARG generating M, minimizing the number of recombinations used to generate M. • There is no existence problem: with enough recombinations any M can be generated on an ARG. • Remember that the linear order of the sites on the sequences is fixed.

Minimization is an NP-hard Problem There is no known efficient solution to this problem and there likely will never be one. What we do: Solve small data-sets optimally with algorithms that are not provably efficient but work well in practice; Efficiently compute lower and upper bounds on the number of needed recombinations.

Multiple crossovers two parental sequences two-crossovers; two breakpoints recombinant sequence Multiple crossover recombination is equivalent to letting each site at a recombination node pick its parent from the two parents of the node.

Relation to other models? One of Daniel Huson’s models has as input a set S of m binary splits on n taxa, where the splits are presented in some arbitrary order. A phylogenetic network for S is a DAG with n leaves labeled by the taxa, where each ``reticulation” node has two incoming edges, the root has zero incoming edges, and all other nodes have 1 incoming edge. The DAG ‘’contains” a tree T if T is created by removing exactly one incoming edge at each reticulation node. For each input split Si, the DAG must contain a tree Ti with an edge e whose removal creates split Si. Call this the S-model. The main problem is to find a DAG for S using the minimum number of reticulation nodes.

Reducing the S-model to the ARG model? Although the presented order of the splits S is arbitrary, if we consider S as a set of n binary sequences, where the order of the sites is fixed, we can consider this as data M to an ARG building program. An ARG for M, with the all-zero root sequence, defines a required DAG for the set of splits. Each split s corresponds to a site in M. In more detail, suppose s is a particular split in S, and N is an ARG for M. We have to identify a tree T in N with an edge e whose removal creates the split s. We abuse terminology by sometimes saying that s is a set consisting of the rows (taxa) with state 1 at site s.

Each recombination node in x in N either reaches, via some directed path in N, a leaf in set s, or it does not. If it does, then at least one parent of x must have a sequence with state 1 at site s. Chose the edge into x from that parent (choose either parent if both have a 1 at s). For a recombination node x that does not lead to a leaf in s, if one parent of x has a sequence with state 0 at site s, then choose the edge into x from that parent. Otherwise, choose the edge into x arbitrarilly. The result is a tree T, and of course T contains the edge e labeled with s (i.e., where s mutates from state 0 to state 1). Tree T and edge e in T define the split s. So N can be considered as

a DAG that defines S under the S-model. Hence, the minimum number of reticulations needed under the S-model is less than or equal to the minimum number of recombinations needed in an ARG, given M, whose root sequence is the all-zero sequence. (If we use a different root sequence, we might be able to reduce the number of recombinations.)

If the ARG builder requires that recombinations only allow a single crossover, then the minimum number of recombination events in an ARG for M may be larger than the minimum number of reticulation events in a DAG for the splits. However, if multiple crossovers are allowed at any recombination event in the ARG, then the two minima are the same. To see this, we have to look at reductions in the reverse direction, from the ARG model to the S model.

A DAG considered as an ARG Given a set of sequences M, consider that data as a set of splits S where the order of sites is given by M. Suppose there is a DAG D for S with k reticulation nodes. We want to convert D to an ARG for M with k recombination nodes, where multiple crossovers are allowed at each node. Details: Given the trees in D that contain the required splits, we create a multiple crossover recombination event at any recombination node by picking, for each site i, the parent specified at that node by tree Ti.

Consequences Any property that holds for any M in the ARG model holds for any S in the S-model. Any property that holds for any M in the ARG model with multiple crossovers, but not single crossovers, still holds for any S in the S-model. Any property that holds for any S in the S-model holds for any M in the ARG model when multiple crossovers are allowed, but might not hold when only single-crossovers are allowed. Algorithmic consequences follow from these.

Daniel’s Second model Input is a set of trees ST, and the DAG must contain each of the trees. Given a DAG D for ST, D can be converted to an ARG if multiple crossovers are allowed at each recombination event. This follows from the discussion of the S-model. But unlike the S-model, even if there is an ARG for M derived from S, where S is the set of splits in the set of trees, that ARG does not necessarily define a DAG containing all the trees in ST. The problem really is not the relationship of the DAG and the ARG models, but the relationship between the two kinds of DAG problems.

Confusions over order and crossovers Allowing multiple crossovers is equivalent to saying that the sites have no fixed order, but ARG problems where the order of the sites can be changed, is not equivalent to allowing multiple crossover. For example, we have a theorem that says that if the sites have no fixed order (and so can be ordered in a particular advantagious way) then there is a fully-decomposed (don’t ask!) ARG that minimizes the number of single-crossover recombinations over any ARG. But, for a while we were confusing the fact that the sites can be reordered with the thought, incorrectly, that the result applied to multiple-crossovers, and hence to the S-model.

Tandy’s model?

Feb. 11, 2010 Clusters? The previous slides were made before Daniel Huson started using the terminology of hard and soft clusters, so I am not positive about this, but I think a cluster is really just a split. A cluster is a set, and we can implement that as a split by giving the taxa in the cluster the state 1 and the other taxa the state 0, defining a split. Then everything else goes through as before.

Different Models for Phylogenetic Networks: how do they relate?

Different Models for Phylogenetic Networks: how do they relate?

Presentation Transcript

Phylogenetic Trees - Parsimony Tutorial #12

Computational methods in phylogenetic analysis

Relating models to data: A review

Phylogenetics 4 Maximum Likelihood and Bayesian phylogenetic inference

Phylogenetic Trees

Phylogenetic networks Dr. Steven Kelk Maastricht University Liege, November 2013

Phylogenetic Concepts

Basic Models of Complex Networks

Neural Networks

Phylogenetic Tree

7. Bayesian phylogenetic analysis using MrBAYES

A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters

ReCombinatorics: Phylogenetic Networks with Recombination

Combining Phylogenetic and Hidden Markov Models in Biosequence Analysis

Growth models of Bipartite Networks

Intro. To Phylogenetic Analysis

Chapter 7 Building Phylogenetic Trees

Leo van Iersel 1 , Judith Keijsper 1 , Steven Kelk 2 , Leen Stougie 12

Algorithmic Models for Sensor Networks

Dynamic Models of On-Line Social Networks

Traffic flow on networks: conservation laws models