1 / 86

Phylogenetic analysis

Phylogenetic analysis. A brief introduction in 2 x 4 hours. brigitte.boeckmann@isb-sib.ch. What you can learn today. Understand trees Different types of gene relationships The difference between a cladogram and a phylogram Phylogenetic analysis methods

morag
Download Presentation

Phylogenetic analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogenetic analysis A brief introduction in 2 x 4 hours brigitte.boeckmann@isb-sib.ch

  2. What you can learn today • Understand trees • Different types of gene relationships • The difference between a cladogram and a phylogram • Phylogenetic analysis methods • Steps performed during a phylogenetic analysis • Search strategies for tree topologies • Measures for tree robustness • Gene relationships and function prediction

  3. Outline Introduction to phylogenetic analysis Application: Protein function prediction Databases, servers and software TP5

  4. Introduction Ancestral genome Polymorphisms - CNV Gene duplication – Gene loss – gene fusion – gene fission - exon shuffling – retroposition – mobile elements – de novo gene origination Genome species 1 Genome species 2 HGT HGT Phylogeny is the study of evolutionary relationships.Phylogenetic analysis is the means of inferring evolutionary relationships.

  5. Trees B C D E F G A B C D E F G A End nodes Internal nodes Branches Roots

  6. Phylogenetic trees Cladogram Phylogram The branch length represents the number of character changes Molecular clock

  7. Phylogenetic trees A phylogenetic tree is a model about the evolutionary relationship between operational taxonomic units (OTUs) based on homologous characters. But not all trees are phylogenetic trees Dendrogram: general term for a branching diagram Cladogram: branching diagram without branch length estimates Phylogramor phylogenetic tree: branching diagram with branch length estimates Please note: Guide trees produced during multiple sequence alignment have nophylogenetic meaning: the dendrograms are based on distances derived from pair-wise alignments; they are used to determine in what order sequences are aligned during the construction of the MSA.

  8. Rooted and unrooted trees Outgroup

  9. How many distinct trees?

  10. Solved (bifurcating) and un(re)solved (multifurcating) trees A A B B C C D D E E F F G G

  11. Speciation and gene duplication A1 A1 B1 B1 Gene duplication C1 B2 Gene duplication A2 C B2 D C2 E D F

  12. Relationships within homologs Frog gene 1 Orthologs Human gene 1 Mouse gene 1 Gene duplication Paralogs Mouse gene 2 Homologs Ancestral gene Human gene 2 Orthologs Frog gene 2 Drosophila gene

  13. Relationships between orthologs and paralogs Frog gene 1 Orthologs (Group 1) Human gene 1 Mouse gene 1 Co-orthologs of the Drosophila gene Gene duplication Inparalogs of Group 2 Orthologs (Group 2) Mouse gene 2 Ancestral gene Human gene 2 Outparalogs of Group 1 Frog gene 2 Drosophila gene

  14. Gene trees versus species trees …

  15. Gene relationships Homologs = Genes of common origin Orthologs = 1. Genes resulting from a speciation event, 2. Genes originating from an ancestral gene in the last common ancestor of the compared genomes Co-orthologs = Orthologs that have undergone lineage-specific gene duplications subsequent to a particular speciation event Paralogs = Genes resulting from gene duplication Inparalogs = Paralogs resulting from lineage-specific duplication(s) subsequent to a particular speciation event Outparalogs = Paralogs resulting from gene duplication(s) preceding a particular speciation event One-to-one (1:1) orthologs = Orthologs with no (known) lineage-specific gene duplications subsequent to a particular speciation event One-to-many (1:n) orthologs: Orthologs of which at least one - and at most all but one - has undergone lineage-specific gene duplication subsequent to a particular speciation event Many-to-many (n:n) orthologs = Orthologs which have undergone lineage-specific gene duplications subsequent to a particular speciation event Pseudo-orthologs = Paralogs with lineage-specific gene loss of orthologs Xenologs = Orthologs derived by horizontal gene transfer from another lineage

  16. Sequence data of actin-related protein 2 >Species A - RecName: Full=Actin-related protein 2; MDSQGRKVVV CDNGTGFVKC GYAGSNFPEH IFPALVGRPI IRSTTKVGNI EIKDLMVGDE ASELRSMLEV NYPMENGIVR NWDDMKHLWD YTFGPEKLNI DTRNCKILLT EPPMNPTKNR EKIVEVMFET YQFSGVYVAI QAVLTLYAQG LLTGVVVDSG DGVTHICPVY EGFSLPHLTR RLDIAGRDIT RYLIKLLLLR GYAFNHSADF ETVRMIKEKL CYVGYNIEQE QKLALETTVL VESYTLPDGR IIKVGGERFE APEALFQPHL INVEGVGVAE LLFNTIQAAD IDTRSEFYKH IVLSGGSTMY PGLPSRLERE LKQLYLERVL KGDVEKLSKF KIRIEDPPRR KHMVFLGGAV LADIMKDKDN FWMTRQEYQE KGVRVLEKLG VTVR >Species B - RecName: Full=Actin-related protein 2; MDSQGRKVVV CDNGTGFVKC GYAGSNFPEH IFPALVGRPI IRSTTKVGNI EIKDLMVGDE ASELRSMLEV NYPMENGIVR NWDDMKHLWD YTFGPEKLNI DTRNCKILLT EPPMNPTKNR EKIVEVMFET YQFSGVYVAI QAVLTLYAQG LLTGVVVDSG DGVTHICPVY EGFSLPHLTR RLDIAGRDIT RYLIKLLLLR GYAFNHSADF ETVRMIKEKL CYVGYNIEQE QKLALETTVL VESYTLPDGR IIKVGGERFE APEALFQPHL INVEGVGVAE LLFNTIQAAD IDTRSEFYKH IVLSGGSTMY PGLPSRLERE LKQLYLERVL KGDVEKLSKF KIRIEDPPRR KHMVFLGGAV LADIMKDKDN FWMTRQEYQE KGVRVLEKLG VTVR …. Phylogenetic analysis – an approach I Species are: Caenorhabditis briggsae Drosophila melanogaster Homo sapiens Mus musculus Schizosaccharomyces pombe

  17. ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE ARP2_C MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_E MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE *:* :* ******** *** *** . **::****::*: . *::::**:***:* ARP2_A AEAVRSLLQVKYPMENGIIRDFEEMNQLWDYTF-FEKLKIDPRGRKILLTEPPMNPVANR ARP2_B CSQLRQMLDINYPMDNGIVRNWDDMAHVWDHTFGPEKLDIDPKECKLLLTEPPLNPNSNR ARP2_C ASQLRSLLEVSYPMENGVVRNWDDMCHVWDYTFGPKKMDIDPTNTKILLTEPPMNPTKNR ARP2_D ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR ARP2_E ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR .. :*.:*::.***:**::*::::* ::**:** :*:.**. *:******:** ** ARP2_A EKMCETMFERYGFGGVYVAIQAVLSLYAQGLSSGVVVDSGDGVTHIVPVYESVVLNHLVG ARP2_B EKMFQVMFEQYGFNSIYVAVQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFALHHLTR ARP2_C EKMIEVMFEKYGFDSAYIAIQAVLTLYAQGLISGVVIDSGDGVTHICPVYEEFALPHLTR ARP2_D EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR ARP2_E EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR **: :.*** * *.. *:*:****:****** :***:********* **** . * **. ARP2_A RLDVAGRDATRYLISLLLRKGYAFNRTADFETVREMKEKLCYVSYDLELDHKLSEETTVL ARP2_B RLDIAGRDITKYLIKLLLQRGYNFNHSADFETVRQMKEKLCYIAYDVEQEERLALETTVL ARP2_C RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRIMKEKLCYIGYDIEMEQRLALETTVL ARP2_D RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL ARP2_E RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL ***:**** *.***.*** .** **.:******* :******:.*::* : .*: ***** ARP2_A MRNYTLPDGRVIKVGSERYECPECLFQPHLVGSEQPGLSEFIFDTIQAADVDIRKYLYRA ARP2_B SQQYTLPDGRVIRLGGERFEAPEILFQPHLINVEKAGLSELLFGCIQASDIDTRLDFYKH ARP2_C VESYTLPDGRVIKVGGERFEAPEALFQPHLINVEGPGIAELAFNTIQAADIDIRPELYKH ARP2_D VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH ARP2_E VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH .*******:*.:*.**:*.** ******:. * *::*: *. ***:*:* * :*. ARP2_A IVLSGGSSMYAGLPSRLEKEIKQLWFERVLHGDPARLPNFKVKIEDAPRRRHAVFIGGAV ARP2_B IVLSGGTTMYPGLPSRLEKELKQLYLDRVLHGNTDAFQKFKIRIEAPPSRKHMVFLGGAV ARP2_C IVLSGGSTMYPGLPSRLEREIKQLYLERVLKNDTEKLAKFKIRIEDPPRRKDMVFIGGAV ARP2_D IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV ARP2_E IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV ******::**.*******.*:***:::***:.: : :**:.** .* *. **:**** ARP2_A LADIMAQND-HMWVSKAEWEEYGV-RALDKLGPRTT ARP2_B LANLMKDRDQDFWVSKKEYEEGGIARCMAKLGIKA- ARP2_C LAEVTKDRD-GFWMSKQEYQEQGL-KVLQKLQKISH ARP2_D LADIMKDKD-NFWMTRQEYQEKGV-RVLEKLGVTVR ARP2_E LADIMKDKD-NFWMTRQEYQEKGV-RVLEKLGVTVR **:: :.* :*::. *::* *: . : ** Species are: Caenorhabditis briggsae Drosophila melanogaster Homo sapiens Mus musculus Schizosaccharomyces pombe Which sequence is likely to correspond to which species?

  18. ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE

  19. Distance matrix Species are: Caenorhabditis briggsae Drosophila melanogaster Homo sapiens Mus musculus Schizosaccharomyces pombe

  20. Expected species tree for … Caenorhabditis briggsae Drosophila melanogaster Homo sapiens Mus musculus Schizosaccharomyces pombe

  21. Phylogenetic analysis Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation

  22. What data types can be used to infer phylogenies? Morphological characters Physiological characters Gene order Sequence data (nucleotide sequences, amino acid sequences) Mixed characters ….

  23. Data selection To be considered: Input data must be homolog! Taxonomic range and ~ distribution (balance, avoid LB) Content of phylogenetic information Number of character states Size of the dataset etc

  24. Phylogenetic analysis Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation

  25. Data comparison To be considered: Prediction of characters that are derived from a common ancestor Chose a suitable alignment method Highly diverged sequences Domain/family predictions Structures

  26. Alignment Pairwise alignment versus MSA MSA methods ClustalW (very fast) Muscle (very fast) MAFFT (fast) Probcons T-coffee … When to use which method and why?

  27. Phylogenetic analysis Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation

  28. Characters to be selected for the analysis To be considered: Each position in the alignment should be homolog! Missing data (in some OTU) Number of characters etc Selection of a data model

  29. Selection of a data model Common methods Gap removal GBLOCKS

  30. Phylogenetic analysis Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation

  31. Evolutionary models Phylogenetic tree-building presumes particular evolutionary models The model chosen influences the outcome of the analysis and should be considered in the interpretation of the analysis results

  32. Evolutionary models Which aspects are to be considered? … … … … etc

  33. Evolutionary models Which aspects are to be considered? Frequencies of aa exchange … … … etc

  34. http://www.russell.embl-heidelberg.de/aas/other_images/lb3.gifhttp://www.russell.embl-heidelberg.de/aas/other_images/lb3.gif

  35. Frequencies of aa exchange Substitution matrices Empirically derived from alignment datasets PAM (Dayhoff, 1968) JTT (Jones, Taylor, Thornton, 1992) Gonnet et al. (1992) WAG (Whelan, Goldman, 2001) mtrev (Hadachi, Hasegawa, 1996, specific for mitochondrial data) Estimated rate matrix -> series of replacement probability matrices (e.g. PAM1 … PAM250)

  36. Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution … … etc Why?

  37. GC content

  38. Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution GC content Differs between species (20-72%) Differs within a genome (isochores) Biased recombination-associated DNA repair Temperature

  39. Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution Exchangeability matrix can be build for a particular dataset JTT + F

  40. Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution Between-site rate variation or Among-site substitution rate heterogenity

  41. Alignment

  42. Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution Between-site rate variation or Among-site substitution rate heterogenity Variation in substitution rates among different positions Mostly discrete gamma model

  43. Gamma distribution is a continuous probability density function Alpha parameter Scaling factor Infinitely large alpha value, rate variation is the same for all sites alpha = 1, extensive rate variation alpha < 1, many invariable sites Probability density Relative evolutionary rate http://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Gamma_distribution_pdf.png

  44. Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution Between-site rate variation or Among-site substitution rate heterogenity Variation in substitution rates among different positions Mostly discrete gamma model Select the number of categories (4/8)

  45. Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution Between-site rate variation or Among-site substitution rate heterogenity Presence of invariable sites

  46. Evolutionary models Notation, e.g. JTT JTT + F JTT + F + gamma (4 ) JTT + F + gamma (8 ) + I (under discussion) JTT + F + I It is not always the most complex model that produces the best result. The more complex the model, the more complex the explanation of the results.

  47. Evolutionary models Selection of best-fit models (statistically) of evolution ProtTest AIC (Akaike Information Criterion); simple relationship between the likelihood and the number of parameters to estimate the distance of a model from truth BIC (Bayesian Information Criterion) includes a penalty for the number of parameters to avoid overfitting of the selected model

  48. Phylogenetic analysis Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation

  49. Tree-building methods Distance (matrix) methods Calculate distances for all pairs of taxa based on the sequence alignment Construct a phylogenetic tree based on a distance matrix Character-based (Sequence) methods Constructs a phylogenetic tree based on the sequence alignment

  50. Step 1: Compute distances Simple measure for the extend of sequence divergence: p distance: p=nd/n p = proportion (p distance) nd= number of aa differences n = number of aa used ^

More Related