1 / 78

Phylogenetic Analysis

Phylogenetic Analysis. Review of Linux. ls cd mkdir less cp mv cat pwd >. Perl. Variables $DNA="A"; @DATA=('A', 'B'); %TABLE=(A=>'A', N=>'[AC]',); Statements print length open close substr push pop shift unshift. #!/usr/bin/perl –w $word = 'MNIDDKL';

Download Presentation

Phylogenetic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogenetic Analysis

  2. Review of Linux • ls • cd • mkdir • less • cp • mv • cat • pwd • >

  3. Perl • Variables • $DNA="A"; • @DATA=('A', 'B'); • %TABLE=(A=>'A', N=>'[AC]',); • Statements • print • length • open • close • substr • push • pop • shift • unshift

  4. #!/usr/bin/perl –w $word = 'MNIDDKL'; if($word eq 'QSTVSGE') { print "QSTVSGE\n"; } elsif($word eq 'MRQQDMISHDEL') { print "MRQQDMISHDEL\n"; } elsif ( $word eq 'MNIDDKL' ) { print "MNIDDKL-the magic word!\n"; } else { print "Is \”$word\“ a peptide?\n"; } exit;

  5. $x = 10; $y = -20; if ($x <= 10) { print "1st true\n";} if ($x > 10) {print "2nd true\n";} if ($x <= 10 || $y > -21) {print "3rd true\n";} if ($x > 5 && $y < 0) {print "4th true\n";} if (($x > 5 && $y < 0) || $y > 5) {print "5th true\n";}

  6. $position = 0; while ( $position < length $DNA) { $base = substr($DNA, $position, 1); if ( $base eq 'C' or $base eq 'G') { ++$count_of_CG; } $position++; } for ( $position = 0 ; $position < length $DNA ; ++$position ) { $base = substr($DNA, $position, 1); if ( $base eq 'C' or $base eq 'G') { ++$count_of_CG; } }

  7. The Most Common Sequence Formats

  8. Converting Formats • Don’t re-compute your MSA if it is not in the right format • Convert your file using one of the online conversion tools • The 3 most popular reformatting utilities: • Fmtseq The most complete • RESDSEQ Very popular and robust • SeqCheck Can clean FASTA sequences

  9. Editing your MSA • If your MSA looks bad . . . • Don’t torture the online server • Edit the MSA yourself locally • Never, ever, ever (ever) use a standard word processor • Always use a dedicated MSA editor • The most popular online tool is Jalview • You can get it at www.jalview.org

  10. MSA => LOGO Graph • A LOGO graph summarizes an MSA • Tall letters indicate highly conserved positions • Short letters indicate poorly conserved positions • LOGO graphs are ideal for identifying conserved patterns • weblogo.berkeley.edu/

  11. Molecular Evolution and Phylogenetic Reconstruction

  12. Evolutionary Tree of Bears and Raccoons

  13. Human Evolutionary Tree (cont’d) http://www.mun.ca/biology/scarr/Out_of_Africa2.htm

  14. Human Migration Out of Africa 1. Yorubans 2. Western Pygmies 3. Eastern Pygmies 4. Hadza 5. !Kung 1 2 3 4 5 http://www.becominghuman.org

  15. Reading Your Tree • There’s a lot of vocabulary in a tree • Nodes correspond to common ancestors • The root is the oldest ancestor • Often artificial • Only meaningful with a good outgroup • Trees can be un-rooted • Branch lengths are only meaningful when the tree is scaled • Cladograms are often scaled • Phenograms are usualy unscaled

  16. Rooted and Unrooted Trees • In the unrooted tree the position of the root (“oldest ancestor”) is unknown. Otherwise, they are like rooted trees

  17. Type of Trees (Cladogram)

  18. Type of Trees (Phylogram)

  19. Orthology and Paralogy 直系(垂直)同源和旁系(平行)同源 • Orthologous genes • Separated by speciation • Often have the same function • Paralogous genes • Separated by duplications • Can have different functions • In the graph: • A is paralogous with B • A1 is orthologous with A2

  20. Which Sequences ? • Orthologous sequences • Produce a species tree • Show how the considered species have diverged • Paralogous sequences • Produce a gene tree • Show the evolution of a protein family

  21. Building the Right MSA • Your MSA should have as few gaps as possible. Most time should remove columns with gaps. • Some variability but not too much! • Some conservation but not too much!

  22. Building the Right Tree • There are three types of tree-reconstruction methods • Distance-based methods • Statistical methods • Parsimony methods • Statistical methods are the most accurate • Maximum likelihood of success • Bayesian methods • Statistical methods take more time • Limited to small datasets

  23. j i Distance in Trees: an Exampe d1,4 = 12 + 13 + 14 + 17 + 12 = 68

  24. Compute a Distance Matrix Evolutionary Distance - number of substitutions per 100 amino acids (for proteins) or nucleotides (for DNA) A C T G T A G G A A T C G C A A T G A A A G A A T C G C 3 observed changes A C T G T A G G A A T C G C A C T G C A G G A A T A G C A A T G A A A G A A T C G C 6 actual changes

  25. j i Edit Distance vs Tree Distance d1,4 = 12 + 13 + 14 + 17 + 12 = 68 D1,4 may be smaller than 68, as some changes may not be observed

  26. Fitting Distance Matrix • Given n species, we can compute the n x n distance matrixDij • Evolution of these genes is described by a tree that we don’t know. • We need an algorithm to construct a tree that best fits the distance matrix Dij

  27. Tree reconstruction for any 3x3 matrix is straightforward We have 3 leaves i, j, k and a center vertex c Reconstructing a 3 Leaved Tree Observe: dic + djc = Dij dic + dkc = Dik djc + dkc = Djk

  28. dic + djc = Dij + dic + dkc = Dik 2dic + djc + dkc = Dij + Dik 2dic + Djk = Dij + Dik dic = (Dij + Dik – Djk)/2 Similarly, djc = (Dij + Djk – Dik)/2 dkc = (Dki + Dkj – Dij)/2 Reconstructing a 3 Leaved Tree(cont’d)

  29. Additive Distance Matrices Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij NON-ADDITIVE otherwise

  30. The Four Point Condition (cont’d) Compute: 1. Dij + Dkl, 2. Dik + Djl, 3. Dil + Djk 2 3 1 2 and 3 represent the same number: the length of all edges + the middle edge (it is counted twice) 1 represents a smaller number: the length of all edges – the middle edge

  31. The Four Point Condition: Theorem • The four point condition for the quartet i,j,k,l is satisfied if two of these sums are the same, with the third sum smaller than these first two • Theorem : An n x n matrix D is additive if and only if the four point condition holds for every quartet 1 ≤ i,j,k,l ≤ n

  32. Distance Based Phylogeny Problem • Goal: Reconstruct an evolutionary tree from a distance matrix • Input: n x n distance matrix Dij • Output: weighted tree T with n leaves fitting D • If D is additive, this problem has a solution and there is a simple algorithm to solve it

  33. Find neighboring leavesi and j with parent k Remove the rows and columns of i and j Add a new row and column corresponding to k, where the distance from k to any other leaf m can be computed as: Using Neighboring Leaves to Construct the Tree Dkm = (Dim + Djm – Dij)/2 Compress i and j into k, iterate algorithm for rest of tree

  34. Finding Neighboring Leaves • To find neighboring leaves we simply select a pair of closest leaves.

  35. Finding Neighboring Leaves • To find neighboring leaves we simply select a pair of closest leaves. • WRONG

  36. Finding Neighboring Leaves • Closest leaves aren’t necessarily neighbors • i and j are neighbors, but (dij= 13) > (djk = 12) • Finding a pair of neighboring leaves is • a nontrivial problem!

  37. Neighbor Joining Algorithm • In 1987 Naruya Saitou and Masatoshi Nei developed a neighbor joining algorithm for phylogenetic tree reconstruction • Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves • Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption

  38. Overview • Based on the current distance matrix calculate the matrix Q (defined later). • Find the pair of taxa for which has its lowest value Qij. Add a new node to the tree, joining these taxa to the rest of the tree. • Calculate the distance from each of the taxa in the pair to this new node. • Calculate the distance from each of the taxa outside of this pair to the new node. • Start the algorithm again, replacing the pair of joined neighbors with the new node and using the distances calculated in the previous step.

  39. Basic Algorithm

  40. D Q

  41. D Q

  42. D Q

  43. Another Example

  44. Q(ij)=(N-2)d(ij) - [r(i) + r(j)]

  45. Q

  46. D(AU) =d(AB) / 2 + [r(A)-r(B)] / 2(N-2) = 1 D(BU) =d(AB) -D(AU) = 4 Tree (So far)

  47. d(CU) = [d(AC) + d(BC) - d(AB)] / 2 = 3 d(DU) = [d(AD) + d(BD) - d(AB) ]/ 2 = 6 d(EU) = [d(AE) + d(BE) - d(AB) ]/ 2 = 5 d(FU) = [d(AF) + d(BF) - d(AB) ]/ 2 = 7 New Matrix

More Related