1 / 36

CSCE555 Bioinformatics

CSCE555 Bioinformatics. Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555. HAPPY CHINESE NEW YEAR. University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu. Outline.

oliver-gill
Download Presentation

CSCE555 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555 HAPPY CHINESE NEW YEAR University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

  2. Outline • Review For Exams • Data for Phylogenetic Tree inference • Classification of Tree inference approaches • Neighbor-joining algorithm • Parsimony-based tree reconstruction • Least Square Best-fit reconstruction

  3. Midterm, Midterm • How to review: read slides and textbooks, especially CG book. • Format of problems: examples • Brief questions: what is the difference between global alignment and local alignment? • calculation: build a HMM model for a multiple seq alignment • Definition: blasting, Motif, ORF

  4. Covered Topics • Understand: concepts, algorithm ideas, tools • Sequencing/blasting • Gene finding • Alignment algorithms and applications • DNA motif search • HMM profiles • Gene prediction algorithms • Promoter predictions • Comparative genomics • ……

  5. Phylogenetic Reconstruction • There are essentially two types of data for phylogenetic tree estimation: • Distance data, usually stored in a distance matrix, e.g. DNA×DNA hybridisation data, morphometric differences, immunological data, pairwise genetic distances • Character data, usually stored in a character array; • e.g. multiple sequence alignment of DNA sequences, morphological characters. Characters Distances 1 1 2 3 4 5 6 7 8 9 0 A A 1 0 0 0 1 1 0 1 1 1 B B 0 1 0 0 1 1 1 1 1 1 Taxa C Taxa C 0 0 1 0 0 0 1 1 1 1 D D 0 0 0 1 0 0 0 0 0 1 E E 0 0 0 0 0 0 0 0 0 0

  6. Phylogenetic Reconstruction • Given the huge number of possible trees even for small data sets, we have two options: • Build one according to some clustering algorithm • Assign a “goodness of fit” criterion (an objective function) and find the tree(s) which optimise(s) this criterion

  7. Phylogenetic Reconstruction Type of Data Nucleotide Distances Sites UPGMA Clustering Algorithm Neighbor-Joining Tree Building Method Maximum Parsimony Optimality Minimum Criterion Evolution Maximum Likelihood CS369 2007

  8. Phylogenetic Methods Many different procedures exist. Three of the most popular: Neighbor-joining • Minimizes distance between nearest neighbors Maximum parsimony • Minimizes total evolutionary change Maximum likelihood • Maximizes likelihood of observed data

  9. Distance based tree Construction Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances. Orc: ACAGTGACGCCCCAAACGT Elf: ACAGTGACGCTACAAACGT Dwarf: CCTGTGACGTAACAAACGA Hobbit: CCTGTGACGTAGCAAACGA Human:CCTGTGACGTAGCAAACGA Orc Elf Dwarf Hobbit Human

  10. Distance Matrix • Given n species, we can compute the n x n distance matrixDij • Dij may be defined as the edit distance between a gene in species i and species j, where the gene of interest is sequenced for all n species. • Dij can also be any other feature-based distances

  11. Distances in Trees • Edges may have weights reflecting: • Number of mutations on evolutionary path from one species to another • Time estimate for evolution of one species into another • In a tree T, we often compute dij(T) - the length of a path between leaves i and j

  12. Distances in Trees • Edges may have weights reflecting: • Number of mutations on evolutionary path from one species to another • Time estimate for evolution of one species into another • In a tree T, we often compute dij(T) - the length of a path between leaves i and j

  13. j i Distance in Trees: an Exampe d1,4 = 12 + 13 + 14 + 17 + 12 = 68

  14. Fitting Distance Matrix • Given n species, we can compute the n x n distance matrixDij • Evolution of these genes is described by a tree that we don’t know. • We need an algorithm to construct a tree that best fits the distance matrix Dij

  15. Tree reconstruction for any 3x3 matrix is straightforward We have 3 leaves i, j, k and a center vertex c Reconstructing a 3 Leaved Tree Observe: dic + djc = Dij dic + dkc = Dik djc + dkc = Djk

  16. dic + djc = Dij + dic + dkc = Dik 2dic + djc + dkc = Dij + Dik 2dic + Djk = Dij + Dik dic = (Dij + Dik – Djk)/2 Similarly, djc = (Dij + Djk – Dik)/2 dkc = (Dki + Dkj – Dij)/2 Reconstructing a 3 Leaved Tree

  17. Trees with > 3 Leaves • An tree with n leaves has 2n-3 edges • This means fitting a given tree to a distance matrix D requires solving a system of “n choose 2” equations with 2n-3 variables • This is not always possible to solve for n > 3

  18. Additive Distance Matrices Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij NON-ADDITIVE otherwise

  19. Distance Based Phylogeny Problem • Goal: Reconstruct an evolutionary tree from a distance matrix • Input: n x n distance matrix Dij • Output: weighted tree T with n leaves fitting D • If D is additive, this problem has a solution and there is a simple algorithm to solve it

  20. Find neighboring leavesi and j with parent k Remove the rows and columns of i and j Add a new row and column corresponding to k, where the distance from k to any other leaf m can be computed as: Using Neighboring Leaves to Construct the Tree Compress i and j into k, iterate algorithm for rest of tree Dkm = (Dim + Djm – Dij)/2

  21. Finding Neighboring Leaves • To find neighboring leaves we simply select a pair of closest leaves.

  22. Finding Neighboring Leaves • To find neighboring leaves we simply select a pair of closest leaves. • WRONG

  23. Finding Neighboring Leaves • Closest leaves aren’t necessarily neighbors • i and j are neighbors, but (dij= 13) > (djk = 12) • Finding a pair of neighboring leaves is • a nontrivial problem!

  24. Neighbor Finding: Seitou & Nei algorithm (1987) Definitions Theorem (Saitou & Nei)Assume all edge weights are positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree.

  25. Neighbor Joining Algorithm • In 1987 Naruya Saitou and Masatoshi Nei developed a neighbor joining algorithm for phylogenetic tree reconstruction • Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves • Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption

  26. Neighbor-joining • Guaranteed to produce the correct tree if distance is additive • May produce a good tree even when distance is not additive Step 1: Finding neighboring leaves Define Dij = dij – (ri + rj) Where 1 ri = –––––k dik |L| - 2 1 3 0.1 0.1 0.1 0.4 0.4 4 2

  27. Algorithm: Neighbor-joining Initialization: Define T to be the set of leaf nodes, one per sequence Let L = T Iteration: • Pick i, j s.t. Dij is minimal • Define a new node k, and set dkm = ½ (dim + djm – dij) for all m  L • Add k to T, with edges of lengths dik = ½ (dij + ri – rj) • Remove i, j from L; • Add k to L Termination: When L consists of two nodes, i, j, and the edge between them of length dij

  28. Rooting a tree, and definition of outgroup Neighbor-joining produces an unrooted tree How do we root a tree between N species using n-j? An outgroup is a species that we know to be more distantly related to all remaining species, than they are to one another Example: Human, mouse, rat, pig, dog, chicken, whale Which one is an outgroup? Outgroup can act as a root 1 4 3 2

  29. Neighbor Joining Algorithm-Widely Used • Applicable to matrices which are not additive • Known to work good in practice • The algorithm and its variants are the most widely used distance-based algorithms today.

  30. Maximum Parsimony Method for Tree Inference A Character-based method Input: h sequences (one per species), all of length k. Goal: Find a tree with the input sequences at its leaves, and an assignment of sequences to internal nodes, such that the total number of substitutions is minimized. • Two sub-problems: • Find the parsimony cost of a given tree (easy) • Search through all tree topologies (hard)

  31. AAA AAA AAA 2 1 1 GGA AGA AAG AAA Total #substitutions = 4 Example Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species. By the parsimony principle, we seek a tree that has a minimum total number of substitutions of symbols between species and their originator in the phylogenetic tree. Here is one possible tree.

  32. Least Squares Distance Phylogeny Problem • If the distance matrix D is NOT additive, then we look for a tree T that approximates D the best: Squared Error : ∑i,j (dij(T) – Dij)2 • Squared Error is a measure of the quality of the fit between distance matrix and the tree: we want to minimize it. • Least Squares Distance Phylogeny Problem: finding the best approximation tree T for a non-additive matrix D (NP-hard).

  33. Search through tree topologies: Branch and Bound Observation: adding an edge to an existing tree can only increase the parsimony cost Enumerate all unrooted trees with at most n leaves: [i3][i5][i7]……[i2N–5]] where each ik can take values from 0 (no edge) to k At each point keep C = smallest cost so far for a complete tree Start B&B with tree [1][0][0]……[0] Whenever cost of current tree T is > C, then: • T is not optimal • Any tree with more edges containing T, is not optimal: Increment by 1 the rightmost nonzero counter

  34. Comparison of Methods

  35. Summary • Category of phylogenetic inference algorithms • Neighbor-joining algorithm

  36. Acknowledgement • Anonymous authors

More Related