530 likes | 717 Views
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction. Tandy Warnow Department of Computer Sciences University of Texas at Austin. Reconstructing the “Tree” of Life. Handling large datasets: millions of species. The “Tree of Life” is not really a tree:
 
                
                E N D
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin
Reconstructing the “Tree” of Life Handling large datasets: millions of species The “Tree of Life” is not really a tree: reticulate evolution
Evolution informs about everything in biology • Big genome sequencing projects just produce data -- so what? • Evolutionary history relates all organisms and genes, and helps us understand and predict • interactions between genes (genetic networks) • drug design • predicting functions of genes • influenza vaccine development • origins and spread of disease • origins and migrations of humans
Challenges in estimating phylogenies • Computational: almost all “good” approaches for estimating phylogenies involve solving NP-hard problems • Statistical • Data
Major methods for phylogeny reconstruction • Biology: heuristics for NP-hard optimization problems • Linguistics: an exact algorithm for an NP-hard optimization problem
Outline for the rest of the talk • NP-hard and polynomial time problems • Phylogeny reconstruction in biology: the challenge is to develop better heuristics for NP-hard problems • Phylogeny reconstruction in linguistics: the NP-hard perfect phylogeny problem, and how we solve it exactly
A polynomial-time problem • 2-colorability: Given graph G = (V,E), determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. • Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. • Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.
A polynomial-time problem • 2-colorability: Given graph G = (V,E), determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. • Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. • Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.
A polynomial-time problem • 2-colorability: Given graph G = (V,E), determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. • Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. • Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.
What about this? • 3-colorability: Given graph G, determine if we assign red,blue, and green to the vertices in G so that no edge connects vertices of the same color.
What about this? • 3-colorability: Given graph G, determine if we can assign red,blue, and green to the vertices in G so that no edge connects vertices of the same color. A brute-force solution seems to require O(3n) time, where n is the number of vertices.
Some decision problems can be solved in polynomial time: • Can graph G be 2-colored? • Does graph G have a Eulerian tour? • Some decision problems seem to not be solvable in polynomial time: • Can graph G be 3-colored? • Does graph G have a Hamiltonian cycle?
What about this? • 3-colorability: Given graph G, determine if we can assign red,blue, and green to the vertices in G so that no edge connects vertices of the same color. • This problem is provably NP-hard. What does this mean?
P vs. NP, continued • The “big” question in theoretical computer science is: • Is it possible to solve an NP-hard problem in polynomial time? • If the answer is “yes”, then all NP-hard problems can be solved in polynomial time, so P=NP. This is generally not believed.
Coping with NP-hard problems Since NP-hard problems may not be solvable in polynomial time, the options are: • Solve the problem exactly (but use lots of time on some inputs) • Use heuristics which may not solve the problem exactly (and which might be computationally expensive, anyway)
General comments for NP-hard optimization problems • Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time. • You may not know when you have an optimal solution, if you use a heuristic. • Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation do you need?
-3 mil yrs AAGACTT AAGACTT -2 mil yrs AAGGCCT AAGGCCT AAGGCCT AAGGCCT TGGACTT TGGACTT TGGACTT TGGACTT -1 mil yrs AGGGCAT AGGGCAT AGGGCAT TAGCCCT TAGCCCT TAGCCCT AGCACTT AGCACTT AGCACTT today AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT DNA Sequence Evolution
Molecular Systematics U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W
Maximum Parsimony • Given set S of sequences of the same length over the nucleotide alphabet {A,C,T,G}, find tree leaf-labelled by S with other DNA sequences (of the same length) labelling internal nodes, so as to minimize the “length” of the tree (the sum of the Hamming distances on the edges). • NP-hard! 20
Solving NP-hard problems exactly is … unlikely • Number of (unrooted) binary trees on n leaves is (2n-5)!! • If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in 2890 millennia
Research: we try to develop better heuristics Current best techniques DCM boosted version of best techniques Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset
Summary (so far) • Optimization problems in biology are almost all NP-hard, and heuristics may run for months before finding local optima. • The challenge here is to find better heuristics, since exact solutions are very unlikely to ever be achievable on large datasets.
Phylogenies of Languages • Languages evolve over time, just as biological species do (geographic and other separations induce changes that over time make different dialects incomprehensible -- and new languages appear) • The result can be modelled as a rooted tree • The interesting thing is that many characteristics of languages evolve without back mutation or parallel evolution -- so a “perfect phylogeny” is possible!
Historical Linguistic Data • A character is a function that maps a set of languages, L, to a set of states. • Three kinds of characters: • Phonological (sound changes) • Lexical (meanings based on a wordlist) • Morphological (grammatical features)
Cognate Classes • Two words w1 and w2 are in the same cognate class, if they evolved from the same word through sound changes. • French “champ” and Italian “champo” are both descendants of Latin “campus”; thus the two words belong to the same cognate class. • Spanish “mucho” and English “much” are not in the same cognate class.
The Ringe-Warnow Model of Language Evolution • The nodes of the tree which contain elements of the same cognate class should form a rooted connected subgraph of the true tree • The model is known as the Character Compatibility or Perfect Phylogeny.
Perfect Phylogeny • A phylogeny T for a set S of taxa is a perfect phylogeny if each state of each character occupies a subtree (no character has back-mutations or parallel evolution) 30
Perfect phylogenies, cont. • A=(0,0), B=(0,1), C=(1,3), D=(1,2) has a perfect phylogeny! • A=(0,0), B=(0,1), C=(1,0), D=(1,1) does not have a perfect phylogeny!
A perfect phylogeny • A = 0 0 • B = 0 1 • C = 1 3 • D = 1 2 A C D B
A perfect phylogeny • A = 0 0 • B = 0 1 • C = 1 3 • D = 1 2 • E = 0 3 • F = 1 3 A C E F D B
The Perfect Phylogeny Problem • Given a set S of taxa (species, languages, etc.) determine if a perfect phylogeny T exists for S. • The problem of determining whether a perfect phylogeny exists is NP-hard (McMorris et al. 1994, Steel 1991).
Triangulated Graphs • A graph is triangulated if it has no simple cycles of size four or more.
Triangulating Colored Graphs:An Example A graph that can be c-triangulated
Triangulating Colored Graphs:An Example A graph that can be c-triangulated
Triangulating Colored Graphs:An Example A graph that cannot be c-triangulated
Triangulating Colored Graphs (TCG) Triangulating Colored Graphs: given a vertex-colored graph G, determine if G can be c-triangulated.
The PP and TCG Problems • Buneman’s Theorem:A perfect phylogeny exists for a set S if and only if the associated character state intersection graph can be c-triangulated. • The PP and TCG problems are polynomially equivalent and NP-hard. 40
A no-instance of Perfect Phylogeny • A = 00 • B = 01 • C = 10 • D = 11 0 1 0 1 An input to perfect phylogeny (left) of four sequences described by two characters, and its partition intersection graph. Note that the partition intersection graph is 2-colored.
Solving the PP Problem Using Buneman’s Theorem “Yes” Instance of PP: c1 c2 c3 s1 3 2 1 s2 1 2 2 s3 1 1 3 s4 2 1 1
Solving the PP Problem Using Buneman’s Theorem “Yes” Instance of PP: c1 c2 c3 s1 3 2 1 s2 1 2 2 s3 1 1 3 s4 2 1 1
Some special cases are easy • Binary character perfect phylogeny solvable in linear time • r-state characters solvable in polynomial time for each r (combinatorial algorithm) • Two character perfect phylogeny solvable in polynomial time (produces 2-colored graph) • k-character perfect phylogeny solvable in polynomial time for each k (produces k-colored graphs -- connections to Robertson-Seymour graph minor theory) 44
Constructing trees in historical linguistics • Maximum Compatibility: given the input matrix for the set S of languages described by the set C of characters, find a tree T leaf-labelled by S on which a maximum number of the characters in C are compatible (i.e., evolve without homoplasy). • NP-hard.
The Indo-European (IE) Dataset • 24 languages • 22 phonological characters, 15 morphological characters, and 333 lexical characters. • Total number of working characters is 390 (multiple character coding, and parallel development) • A phylogenetic tree T on the IE dataset (Ringe, Taylor and Warnow) • T is compatible with all but 16 characters • Resolves most of the significant controversies in Indo-European evolution; shows however that Germanic is a problem (not treelike)
Phylogenetic Tree of the IE Dataset (Ringe, Warnow, and Taylor)
Explaining remaining incompatibilies • We modelled the remaining incompatibilities as undetected borrowing between languages. • This leads to the mathematical model of “perfect phylogenetic networks”
“Perfect Phylogenetic Network” for IENakhleh et al., Language 2005