Phylogenetics workshop:Protein sequence phylogenyweek 2 Darren Soanes
Species trees • Interpretation of trees • Taxon sampling • Tools • Lateral (horizontal) gene transfer • Fast evolving genes
Using DNA sequence to construct trees TGCTATT TGCTTTT TGCTTTT TGCTTTT – sequence change due to mutation TGCTATT – ancestral DNA sequence
Reversals can confuse phylogenies TGCTATT TGCTTTT TGCTTTT TGCTTTT TGCTATT TGCTATT reversal TGCTTTT – sequence change TGCTATT – ancestral DNA sequence
To minimise the effect of reversals • Use DNA sequences that are evolving slowly – mutations happen rarely. • Use long stretches of DNA. • Align sequences, use the parts of the alignment that show a high degree of conservation. • rDNA sequences (genes that encode ribosomal RNA) are often used.
Using protein sequences to create species trees • Advantages • protein sequences evolve more slowly than DNA sequences (many DNA mutations are neutral – they do not change amino acid sequences) • reversals are less common than in DNA • Single copy protein encoding genes identified • Protein sequences joined together to create a multiple protein sequence for each species • Sequences aligned • Disadvantage – need sequenced genomes
Fungal species trees – more proteins = better resolution oomycete (not fungi) 30 proteins microsporidia plant zygomycete basidiomycetes yeasts ascomycetes 60 proteins filamentous ascomycetes
Clades A clade consists of an ancestor organism and all its descendants.
Gene trees • The evolutionary history of genes can be represented as phylogenetic trees based on alignment of protein sequences. • Gene duplication and loss can be inferred from phylogenetic trees. • Protein sequences evolve more slowly that DNA sequences (due to redundancy in genetic code)
Gene duplication Gene duplication due to unequal crossing over during meiosis can create gene families. Sequence and function of different members of a gene family can diverge.
Sequence homology (1) Genes are said to be homologous if they share a common evolutionary ancestor. Orthologues are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologues retain the same function in the course of evolution. (e.g. myoglobin in mammals).
Sequence homology (2) Paralogous genes are related by duplication within a genome. Paralogues often evolve new functions, even if these are related to the original one. In-paralogues, paralogues that were duplicated aftera speciation and are therefore in the same species Out-paralogues, paralogues that were duplicated before a speciation. Not necessarily in the same species.
Paralogues A, B and C are different species α and β are different paralogues of the same gene Out-paralogues In-paralogues
TOR gene duplication events in fungi TOR: protein kinase, subunit of a complex that regulate cell growth in response to nutrient availability and cellular stresses
Taxon sampling methods • BLAST easiest – though subjective • Occurence of Pfam (protein family) motif • Clustering e.g. • INPARANOID http://inparanoid.sbc.su.se/cgi-bin/index.cgi • orthoMCLhttp://www.orthomcl.org/cgi-bin/OrthoMclWeb.cgi
Minimum bootstrap • 70% bootstrap is thought to be broadly similar to P-value 0.05 • Minimum bootstrap used depends on study • To improve bootstrap support • remove poorly aligned sequences if possible, can be due to mis-annotation of genomes. • Change taxon sampling
Lateral gene transfer (purine-cytosine permease) oomycete fungi
Eukaryotic Tree of Life Phytophthora sojae Aspergillus oryzae
Genes that evolve quickly (1) • Synonymous substitution – change in DNA sequence that does not affect the amino acid sequence, often in the third position of a codon, e.g. CCG (Pro)→CCA (Pro). • Non-synonymous substitution - change in DNA sequence that does affect the amino acid sequence, often in the first or second position of a codon, e.g. CCG (Pro)→CAG (Gln).
Genes that evolve quickly (2) • For a given protein encoding gene (comparison between orthologues in more than one species) • dN=number of non-synonomous mutations • dS=number of synonomous mutations • We can calculate the ratio dN/dS. • For most genes this is < 1 • Genes under evolutionary pressure to change protein sequence (diversify), dN/dS > 1
Genes that evolve quickly (3) • CodeML (part of the PAML package) will calculate dN/dS for a set of orthologues from different (closely related) species. • Human vs Chimpanzee – rapidly evolving genes involved in immunity, reproduction and olfaction (smell). • Genes with very low dN/dS (under purifying selection) involved in metabolism, intracellular signalling, nerve / brain function.