1 / 40

An introduction to maximum parsimony and compatibility

An introduction to maximum parsimony and compatibility. Trevor Bruen PhD Candidate McGill Centre for Bioinformatics. Overview. The point of this talk is to give a sense how discrete mathematics enters into phylogenetic and genetic inference.

Download Presentation

An introduction to maximum parsimony and compatibility

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An introduction to maximum parsimony and compatibility Trevor Bruen PhD Candidate McGill Centre for Bioinformatics

  2. Overview • The point of this talk is to give a sense how discrete mathematics enters into phylogenetic and genetic inference. • I will illustrate these ideas by describing two approaches in detail namely maximum compatibility and maximum parsimony. • I will also show how ideas from these two criteria can be used to develop applications such as bounds and tests for recombination. • My goal is to give the basis for further study in this type of area and to give greater insight into these methods.

  3. Outline • Introduction to compatibility and parsimony • Overview of basic notation/concepts • Compatibility • Compatibility as a graph theory problem • Compatibility for pairs of characters • Interpretation of compatibility • Parsimony • Parsimony score with connections to graph theory • Connections between parsimony and compatibility • Homoplasy • Parsimony for pairs of characters • Connections between SPRs/TBRs and parsimony • Applications to recombination • Parsimony as a consensus method

  4. Introduction • Maximum parsimony and maximum compatibility that are used in phylogenetics, linguistics and population genetics • Phylogenetics goal is to infer an evolutionary tree • Linguistics often the same • Population genetics uses compatibility for recombination • For general phylogenetic inference with molecular data, likelihood (probability based) methods are generally preferred. • BUT compatibility and parsimony are computationally tractable. • ALSO the mathematics behind parsimony and compatibility is very well developed. We can show that parsimony=likelihood in certain circumstances (Tuffley and Steel 1997). This gives us insight in where to go in terms of research.

  5. Formalism • A character is a mapping from a set of taxa to a set of states. • In this case, X={S1,S2,S3,S4} • Also, C={A,C} • Informally, a character is a “column” in a multiple sequence alignment

  6. Binary Character / Splits • If character has two states then it induces a split of the taxa set. • Example: Let X be the taxa set {S1,S2,S3,S4}. Let C be the state set {A,C}. • Then {S1,S2} | {S3,S4} is the split induced by the first character. • In general a character induces a set of equivalence classes

  7. Tree and Labeling • Informally we would like to be able to mathematically describe a tree and a labeling structure. • In graph theory a tree T=(V,E) consists of a graph with no cycles. • Informally, we would also like to be able to add taxa (members of X) to our tree (actually the leaves). • Define a labeling function (such that leaves of V(T) are labeled by members of X):

  8. X-Trees • An X-tree consists of pair: (T, phi) where phi is a labeling function that labels the leaves of T. • Recall:

  9. Extensions • Informally, we have an X-tree consisting of the pair (T,phi). We also have a character chi. We need to relate the character to the tree. • Define an extension of character as a function (which is consistent at the leaves with chi): • Informally, an extension provides a description of how the internal vertices are labeled.

  10. Quick Summary • Summary so far: • X-tree are trees along with functions labeling the leaves with members of X • A character is a function from X into a state set C • An extension is a labeling of the vertices of T with states of C

  11. Compatibility - Definition • A character is compatible with a tree if and only if there exists an extension of the character to the tree so that the subgraphs induced by each of the states are connected. • Example: • First tree character is compatible with tree • Second tree character is incompatible since both A’s are disconnected

  12. Compatibility • Problem definition: Given a sequence of characters determine whether there exists a tree on which all character are compatible. • Related problem: Given a sequence of characters determine largest set of characters that are compatible with some tree

  13. Intersection Graph • Suppose we have sequence of characters where • Then each character induces a partition of X - I.e. • Create a graph where the vertex set consists of • There is an edge between two vertices iff only the intersection of the two subsets are non-empty

  14. Intersection Graph • To figure out whether the sequence of characters are compatible, we will be able to determine this directly from the intersection graph. • First we need to define two concepts: a chordal graph and a restricted chordal completion of the intersection graph.

  15. Chordal Graphs • A graph G=(V,E) is chordal graph if every cycle with at least four vertices contains a chord (an edge connecting two non-consecutive vertices). • A chordalization of graph is a graph G’=(V,E’) where such that G’ is chordal

  16. Restricted Chordal Completions • Imagine the vertices of our graph G=(V,E) are colored. Then a restricted chordalization of G is a graph G’=(V,E’), where G’ is chordal but all edges of G connect vertices of different colors.

  17. Restricted chordal completions • A restricted chordal completion of the intersection graph is a chordalization where there is no edge between vertices that share the same character. • In this case, the “colors” correspond to characters

  18. Main Theorem for Compatibility • Let be a collection of characters. Then is compatible if and only if there is a restricted chordal completion of the intersection graph.

  19. Pairs of Characters • A simple corollary of main theorem arises when we restrict our attention to two characters. • Corollary: Two characters are compatible if and only if the intersection graph, G for both characters is acyclic • Proof: (backwards direction) If graph is acyclic then it is chordal so the characters are compatible. (forward direction) OTOH Suppose G contains a cycle. Then any chordal completion of G must contain a three cycle. But no restricted completion of G can contain a three cycle! So G is acyclic.

  20. Interpretation • Recall: a set of characters are compatible with a X-tree if and only if there exists an extension of the character to the tree so that the subgraphs induced by each of the states are connected. • Informally speaking this is a very strict condition. This corresponds to an “all or nothing” condition - either a character is compatible with a tree or it isn’t. Relaxing this condition is the subject of the next section.

  21. Parsimony • Informally: given an leaf labeled tree and a character, how can we define the fit of the character to the tree? • Consider a character, along with an extension to a leaf labeled tree. Then the length of the extension is the number edges where • Define the parsimony score of a character on a tree as the length of a minimal extension of the character to the tree. Denote this value by

  22. Parsimony • Then the maximum parsimony score for a set of characters on a tree is defined as: • The tree that minimizes this score is referred to as the maximum parsimony tree.

  23. Parsimony and graph theory • A minimal cut-set for a leaf-labeled tree T=(V,E) and a character is a minimal set of edges whose removal ensure that if that x and y are in different components. • Claim: There is a bijection between the set of minimal cut sets and minimal extensions. So the cardinality of the minimal cut set is equal to the parsimony score.

  24. Parsimony and Graph Theory • Recall Menger’s Theorem (1927): Let G=(V,E) be a graph with V1 and V2 as two disjoint subsets of V. Then the minimum number of edges whose removal from G leaves vertices of V1 and V2 in different components is equal to the maximum number of edge disjoint paths between V1 and V2. • Corollary: For a binary character, the maximal number of edge disjoint paths corresponds to the parsimony score.

  25. Compatibility and parsimony • Recall: let be a collection of characters. Then is compatible if and only if there is a restricted chordal completion of the intersection graph. • Question: How can characterize parsimony with respect to an intersection graph?

  26. Compatibility Graph • Recall: Each character induces a partition of X - I.e. • A block for a character is a subset taxa on which is constant. • Thus we may identify the blocks of with the vertices of the intersection graph.

  27. Character Refinement • A character refines another character if implies • Thus characters that refine other characters correspond to refinements of the partition

  28. Compatibility and Parsimony • Recall: Let be a collection of characters. Then is compatible if and only if there is a restricted chordal completion of the intersection graph. • Main:

  29. Special Case: Two characters • Recall: Two characters are compatible if and only if the intersection graph, G for both characters is acyclic • Using the previous theorem we can show that the parsimony score for two characters corresponds to: where k is the number of components in the graph. • Note: This score corresponds to the maximum parsimony score over all trees.

  30. Homoplasy • Recall: The parsimony score of a character on a tree, corresponds to minimum number of changes of a character on a tree. • Informally: What is an intuitive way to think about the parsimony score? • Define the homoplasy of character on a tree as

  31. Homoplasy • Note that with equality if and only if is convex on T • Informally: Homoplasy corresponds to the number of “extra” mutations of the character on the tree. These “extra” mutations correspond to recurrent mutations • Informally: Thus a character is not compatible on a tree iff it cannot be placed on a tree without “extra” mutations.

  32. Homoplasy For Two Characters • Recall: The parsimony score for a pair of characters can be found directly from the bipartite intersection graph. • Recall: This score corresponds to an optimum over all trees. • Thus for two characters, we can define a pairwise homoplasy score as • Recall: Up to now homoplasy refers to “extra” mutations on a tree.

  33. A second look at homoplasy • Example: Two characters with a pairwise homoplasy score equal to one. • Informally: We have seen that the homoplasy corresponds to the number of “extra” mutations on a tree. • But in certain situations, this is biologically implausible. The state 1 may correspond to a mutation that has only arisen once. In this case, the fact that the pairs of characters are incompatible can be explained by a recombination event. • This will be defined more precisely later.

  34. A quick aside - tree distances. • Differences between leaf labeled trees can be defined using various metrics - e.g. Subtree Prune and Regrafts • A “subtree prune and regraft” corresponds to a specific re-arrangement of a tree. • For two leaf-labeled trees, dSPR(T1, T2) is minimum #SPRs between T1 and T2

  35. Homoplasy for two characters • Theorem: If and are two characters then corresponds to the minimum number of SPRs from any leaf-labled tree on which is compatible to any leaf labeled tree on which is compatible! • Informally: Thus we have a whole new interpretation of homoplasy.

  36. Application - Testing for Recombination • If recombination has occurred sites will have different histories • Nearby sites will tend to have “greater” genealogical correlation than distant sites • Idea: If recombination has occurred, genealogical correlation will be partially reflected by a tendency for pairs of closely linked sites to have than less homoplasy than distant sites

  37. Test for Recombination • Idea: We would like to distinguish between two possibilities - recurrent mutation and recombination. • Idea: Use previous observations to develop test for recombination. • H0: Single history describe all sites. • H0 ’: Nearby sites share no more compatibility than arbitrary pairs of sites • Use statistic to capture information and solve analytically for p-values

  38. Application: Parsimony and supertrees • Supertree: MRP - parsimony with characters that represent trees. • What does homoplasy mean in this context? Courtesy of TREE 12:315-322

  39. Parsimony as a consensus tree • Recall: If and are two characters then corresponds to the minimum number of SPRs from any leaf-labeled tree on which is compatible to any leaf labeled tree on which is compatible. • Informally: This can be generalized to show that the maximum parsimony tree for a set of charaters minimizes the SPR distance to each of the set of tree on which each character is compatible…

  40. Acknowledgements • Thanks for listening! • Background and further reading: • Phylogenetics, Semple and Steel (book 2003) • Some results I presented are not on this book - they are from work I have worked on. Please talk to me if you are interested. • I have many other references- please see me if interested.

More Related