1 / 53

Phylogenetics I

Dormouse. Rabbit. Pika. Pig. Hippopotamus. Sheep. Cow. Alpaca. Blue whale. Fin whale. Sperm whale ... Dormouse. Cane-rat. Guinea pig. Mouse. Rat. Vole. Hedgehog. Gymnure ...

erika
Download Presentation

Phylogenetics I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Phylogenetics I

    2. Evolution

    Evolution of new organisms is driven by Mutations The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. Selection bias

    3. Theory of Evolution

    Basic idea speciation events lead to creation of different species. Speciation caused by physical separation into groups where different genetic variants become dominant Any two species share a (possibly distant) common ancestor

    4. The Tree of Life

    Primate evolution A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.

    6. Morphological vs. Molecular

    Classical phylogenetic analysis: morphological features: number of legs, lengths of legs, etc. Modern biological methods allow to use molecular features Gene sequences Protein sequences

    Morphological topology Archonta Ungulata (Based on Mc Kenna and Bell, 1997) Rat QEPGGLVVPPTDA Rabbit QEPGGMVVPPTDA Gorilla QEPGGLVVPPTDA Cat REPGGLVVPPTEG From sequences to a phylogenetic tree There are many possible types of sequences to use (e.g. Mitochondrial vs Nuclear proteins). Mitochondrial topology (Based on Pupko et al.,) Nuclear topology (tree by Madsenl) (Based on Pupko et al. slide)

    11. Phylogenenetic trees

    Leaves - current day species (or taxa – plural of taxon) Internal vertices - hypothetical common ancestors Edges length - “time” from one speciation to the next

    12. Twists in molecular phylogenies

    We have to emphasize that gene/protein sequence can be homologous for several different reasons: Orthologs -- sequences diverged after a speciation event Paralogs -- sequences diverged after a duplication event Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

    13. Paralogs

    Consider evolutionary tree of three taxa: …and assume that at some point in the past a gene duplication event occurred. Gene Duplication

    14. Paralogs

    Speciation events Gene Duplication 1A 2A 3A 3B 2B 1B The gene evolution is described by this tree (A, B are the copies of the same gene).

    15. Paralogs

    Speciation events Gene Duplication 1A 2A 3A 3B 2B 1B If we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree that does not represent the phylogeny of the host species S S S

    16. Types of Trees

    A natural model to consider is that of rooted trees Common Ancestor

    17. Types of trees

    Unrooted tree represents the same phylogeny without the root node Depending on the model, data from current day species does not distinguish between different placements of the root.

    Rooted versus unrooted trees a b c Tree c Represents the three rooted trees

    19. Total numbers of trees

    For N taxa, Rooted bifurcating trees: (2n-3)!! = (2n-3)!/2n-2(n-2)! Unrooted bifurcating trees (2n-5)!! Tree shapes

    20. Positioning Roots in Unrooted Trees

    We can estimate the position of the root by introducing an outgroup: a set of species that are definitely distant from all the species of interest Aardvark Bison Chimp Dog Elephant Falcon Proposed root

    21. Type of Data

    Distance-based Input is a matrix of distances between species Can be fraction of residue they disagree on, or alignment score between them, or … Character-based Examine each character (e.g., residue) separately

    22. Two methods of tree Construction

    Distance- A weighted tree that realizes the distances between the objects. Parsimony – A tree with a total minimum number of character changes between nodes. We start with distance based methods, considering the following question: Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances.

    23. Distance Matrix

    Given n species, we can compute the n x n distance matrix Dij Dij may be defined as the edit distance between a gene in species i and species j, where the gene of interest is sequenced for all n species.

    24. The distance between two sequences

    Protein sequences: PAM BLOSUM DNA sequences Jukes-Cantor HGY Kimura 2-Parameter

    25. General Stationary Time-reversible Model

    R = Time reversibility: pirij = pjrji (Diagonal elements such that rows sum to zero)

    26. General Stationary Time-reversible Model

    P(t) = eRt Given rates, one can find transition probabilities, and vice-versa.

    27. Jukes-Cantor

    R =

    28. Jukes-Cantor

    P(no mutation) = e-4/3ut P(at least one mutation) = 1-e-4/3ut Ds = ľ * (1-e-4/3ut) D ? ut = -3/4 ln (1-4/3 * Ds)

    29. Kimura 2-Parameter

    R = a/b = transition/transversion bias ? R a+2b = 1 per unit time A C G T

    30. Kimura 2-Parameter

    a=R/(R+1), b=0.5/(R+1)

    31. HKY (Hasegawa, Kishino, Yano)

    R = k = transversion / transition Some rules of thumb: Use simpler models with shorter sequences (< 200 bp). Otherwise, use a model as complex as necessary. Compare results from more than one method. Some rules of thumb: Use simpler models with shorter sequences (< 200 bp). Otherwise, use a model as complex as necessary. Compare results from more than one method.

    32. Distances in Trees

    Edges may have weights reflecting: Number of mutations on evolutionary path from one species to another Time estimate for evolution of one species into another In a tree T, we often compute dij(T) - the length of a path between leaves i and j

    33. Distance in Trees: an Exampe

    d1,4 = 12 + 13 + 14 + 17 + 12 = 68

    34. Fitting Distance Matrix

    Given n species, we can compute the n x n distance matrix Dij Evolution of these genes is described by a tree that we don’t know. We need an algorithm to construct a tree that best fits the distance matrix Dij

    35. Reconstructing a 3 Leaved Tree

    Tree reconstruction for any 3x3 matrix is straightforward We have 3 leaves i, j, k and a center vertex c Observe: dic + djc = Dij dic + dkc = Dik djc + dkc = Djk

    36. Reconstructing a 3 Leaved Tree

    37. Trees with > 3 Leaves

    An tree with n leaves has 2n-3 edges This means fitting a given tree to a distance matrix D requires solving a system of “n choose 2” equations with 2n-3 variables This is not always possible to solve for n > 3

    38. Additive Distance Matrices

    Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij NON-ADDITIVE otherwise

    39. Distance Based Phylogeny Problem

    Goal: Reconstruct an evolutionary tree from a distance matrix Input: n x n distance matrix Dij Output: weighted tree T with n leaves fitting D If D is additive, this problem has a solution and there is a simple algorithm to solve it

    40. Using Neighboring Leaves to Construct the Tree

    Find neighboring leaves i and j with parent k Remove the rows and columns of i and j Add a new row and column corresponding to k, where the distance from k to any other leaf m can be computed as: Dkm = (Dim + Djm – Dij)/2 Compress i and j into k, iterate algorithm for rest of tree

    41. Finding Neighboring Leaves

    To find neighboring leaves we simply select a pair of closest leaves.

    42. Finding Neighboring Leaves

    To find neighboring leaves we simply select a pair of closest leaves. WRONG

    43. Finding Neighboring Leaves

    Closest leaves aren’t necessarily neighbors i and j are neighbors, but (dij = 13) > (djk = 12) Finding a pair of neighboring leaves is a nontrivial problem!

    44. Neighbor Joining Algorithm

    In 1987 Naruya Saitou and Masatoshi Nei developed a neighbor joining algorithm for phylogenetic tree reconstruction Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption

    45. Constructing additive trees: The neighbor joining algorithm

    Let i, j be neighboring leaves in a tree, let k be their parent, and let m be any other vertex. The formula shows that we can compute the distances of k to all other leaves. This suggest the following method to construct tree from a distance matrix: Find neighboring leaves i,j in the tree, Replace i,j by their parent k and recursively construct a tree T for the smaller set. Add i,j as children of k in T.

    46. Neighbor Finding

    How can we find from distances alone a pair of nodes which are neighboring leaves? Closest nodes aren’t necessarily neighboring leaves. Next we show one way to find neighbors from distances.

    47. Neighbor Finding: Seitou & Nei algorithm

    Theorem (Saitou & Nei) Assume all edge weights are positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree. Definitions

    48. Naive Implementation: Initialization: ?(L2) to compute d(r,i) and C(i,j) for all i,j?L. Each Iteration: O(L2) to find the maximal C(i,j). O(L) to compute {C(m,k):m? L} for the new node k. Total of O(L3).

    Complexity of Neighbor Joining Algorithm m k r C(m,k)

    49. Complexity of Neighbor Joining Algorithm

    Using Heap to store the C(i,j)’s: Input: Distance matrix D= d(i,j), and an arbitrary object r. Initialization: ?(L2) to compute and heapify the C(i,j)’s in a heap H. Each Iteration: O(log L) to find and delete the maximal C(i,j) from H. O(L) to add the values {d(k,m)} to D, for all objects m. O(L) to delete {d(m,i), d(m,j)} from D (for all m). O(L log L) to delete {C(i,m), C(j,m)} and add C(k,m) from H, for all objects m. Total of O(L2 log L). (implementation details are omitted)

    50. Neighbor Joining Algorithm

    Applicable to matrices which are not additive Known to work good in practice The algorithm and its variants are the most widely used distance-based algorithms today.

    51. The Four Point Condition

    Compute: 1. Dij + Dkl, 2. Dik + Djl, 3. Dil + Djk 1 2 3 2 and 3 represent the same number: the length of all edges + the middle edge (it is counted twice) 1 represents a smaller number: the length of all edges – the middle edge

    52. The Four Point Condition: Theorem

    The four point condition for the quartet i,j,k,l is satisfied if two of these sums are the same, with the third sum smaller than these first two Theorem : An n x n matrix D is additive if and only if the four point condition holds for every quartet 1 = i,j,k,l = n

    53. Least Squares Distance Phylogeny Problem

    If the distance matrix D is NOT additive, then we look for a tree T that approximates D the best: Squared Error : ?i,j (dij(T) – Dij)2 Squared Error is a measure of the quality of the fit between distance matrix and the tree: we want to minimize it. Least Squares Distance Phylogeny Problem: finding the best approximation tree T for a non-additive matrix D (NP-hard).

More Related