1 / 17

15-853:Algorithms in the Real World

15-853:Algorithms in the Real World. Computational Biology IV Phylogenetic Trees. Phylogenetics. The study of genetic connections and relationships among species. Classically was based on physical or morphological features (e.g. size, eye-color, hoof-type, …)

nau
Download Presentation

15-853:Algorithms in the Real World

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 15-853:Algorithms in the Real World • Computational Biology IV • Phylogenetic Trees 15-853

  2. Phylogenetics • The study of genetic connections and relationships among species. • Classically was based on physical or morphological features (e.g. size, eye-color, hoof-type, …) • Now is based on DNA and protein sequencing. • Goal is to find the “most likely” evolutionary connection among species or individuals and possibly the time at which they diverged. • E.g., Mitochondrial DNA has been used to trace humans back to a single female ancestor from Africa (“African Eve”). 15-853

  3. Phylogenetic Trees • Phylogenetic relationships are typically represented as a tree. dog cat lynx Typically leaves are current species and internal nodes represent hypothetical evolutionary ancestors. Edge lengths can indicate evolutionary or genetic distance. Trees can be rooted or not. In this lecture we will assume rooted binary trees. 15-853

  4. Perfect Genetic Trees • The “molecular clock theory” (Zuckerkandl and Pauling, 1962) assumes that there is an evolutionary “clock” that determines the rate of “accepted” mutations. The distance (weight) on edges then represents time on the clock. • A perfect or ultrametric tree is one in which the time from the root (a common ancestor) to all leaves (current species) is equal. 3 5 1 4 2 3 3 2 15-853

  5. Scoring/Costing a Tree • Three main models: • Parsimony • Distance Matrix • Maximum likelihood • Can give different results, and there are different opinions on what is best, or even whether a tree is adequate at all. • For all three models, the general problem is NP-hard and in the worst-case can require enumerating all trees of size n (this is super-exponential in n). • Phylogeny Software 15-853

  6. Parsimony • Cost = # of changes along each edge summed across all edges. • e.g. CT 1 0 AT Cost = 2 1 0 CT AG AT • Need to choose: • Topology of the tree • Alignment of the sequences • Assignment of the internal nodes • Small parsimony: The topology and alignment are given • Large parsimony: The full problem 15-853

  7. Small Parsimony • Observation: can process each character position separately since the costs are additive • Fitch-Hartigan Algorithm: S = the character set C(v,x) = best cost for the sub-tree rooted at v assuming v is assigned the character x 2S Internalnodes: Leaves: 15-853

  8. Dynamic programming • Go up the tree calculating C(v,x) and C(v) • Trace back down the tree assigning one of the x to each node. • Time: k = |S|, m = number of characters in each sequence O(nk) per character O(nmk) total time. 15-853

  9. Large Parsimony • Solution 1: Branch and Bound (exact solution) Each node of the search tree adds a new leaf in all possible positions. A 3 5 B C A Algorithm works OK if initial estimate is very good and pruning works well. A D A D D C B B C B C 15-853

  10. A B C D A C B D A D C B Large Parsimony • Solution 2: Local Search: Start with a good guess and use local search to find a local optimum Can hill-climb, or use simulated annealing. 15-853

  11. Trees based on Evolutionary Distances • Assume a distance metric Dij between sequences that models evolutionary distance between i an j (i.e., time on an evolutionary clock) Problem: Find a phylogenetic tree with edge weights that “best” matches these distances. 15-853

  12. The Distance • Edit Distance does not properly model evolutionary change when the distance is large and the alphabet is small. • Jukes Cantor method: For a mutation rate a and a single DNA location which gives: where f is the fraction of locations that have mutated 15-853

  13. The Cost • D(i,j) = sum of weights on path from i to j in the phylogenic tree, e.g. 2 1 2 1 C A B Cavalli/Edwards cost metric: Fitch/Margoliash cost metric: Need to determine both the tree and the edge weights. 15-853

  14. Finding the optimal • In general the problem is NP-hard. • If there is a solution with zero cost, the matrix defines an “additive metric space”. • In this case there is an O(n2) algorithm for the problem. • Otherwise heuristics based on clustering are used. • e.g. UPGMA (Unweighted Pair Group Method with Arithmetic-mean) 15-853

  15. UPGMA Clustering • Initially each sequence is its own cluster • Repeat: • Find two clusters i and j with minimum Dij • Join into new cluster, and new phylogenetic tree with Dij/2 as the weight of the two root branches, e.g., • For each other cluster k, DAB,C/2 DAB,C/2 C C B A A B 15-853

  16. Maximum Likelihood a • Problem: find a tree, an internal labeling, and a set of edge weights (representing evolutionary time) such that:is maximized. • The probability Px ! y(txy) is the probability x will mutate to y in time txy. • PT is the likelihood (probability) of the given tree. tab taC b tbA tbB C A B 15-853

  17. Maximum Likelihood • Finding internal labeling given times and tree is easy using dynamic programming. • In practice finding the times is not hard. • Finding the tree is hard. 15-853

More Related