1 / 43

Genome Trees and the Nature of Genome Evolution

Berend Snel, Martijn A. Huynen and Bas E. Dutilh Presented by Audrey No ël. Genome Trees and the Nature of Genome Evolution. Introduction - Multiple alignments. Most existing approaches for phylogenetic inference use multiple alignment

alissa
Download Presentation

Genome Trees and the Nature of Genome Evolution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Berend Snel, Martijn A. Huynen and Bas E. Dutilh Presented by Audrey Noël Genome Trees and the Nature of Genome Evolution

  2. Introduction - Multiple alignments • Most existing approaches for phylogenetic inference use multiple alignment • Assume a sort of an evolutionary model and show problems in computational complexity • Becomes misleading due to gene rearrangements, inversion, transposition and translocation • Do not directly apply on complete genomes where such events as rearrangements make traditional full length alignments impossible • Become insufficient for phylogenies using complete genomes

  3. Introduction • Archaeal organisms appear to be close to Eukarya when the protein synthesis machinery is considered but close to Bacteria if metabolic genes are compared. This differences reflect problems in phylogenetic reconstruction due to • Horizontal transfer • Unequal rates of nucleotide substitution • Gene displacement • Scientists today look at the classification of organisms in a way that is different from the approach of just a few decades ago • Molecular technologies such as PCR and sequencing allow genetic observations that are more precise • Availability of multiple complete genome sequences requires the development of new phylogenetic approaches

  4. Introduction • How can genomic information be used to obtain useful information concerning genome evolution? • Complete genome trees are less affected than phylogenies based on single genes by • Horizontal gene transfer • Paralogy • Highly variable rates of gene evolution • Misalignment

  5. Lateral gene transfer Lead to phylogenies that are inconsistent with the species phylogeny Refers to the transfer of genes or genetic material directly from one individual to another by processes similar to infection Implies that genes can be transferred between distant species that would never interbreed in nature Horizontal gene transfer (HTG)

  6. Horizontal gene transfer • Problems: • If a plant gets a gene from an Archea, when we will do a tree with this gene this plant will be close to Archea and not with other plants • Produces complex trees with criss-crossing branches and not a fan-shaped trees • HGT is in the minority of anomalous phylogenetic events observed in fully sequenced genomes • Gene loss and gene duplication give more frequent challenges to genome phylogeny • For archeal and bacterial genomes <15% of phylogenetically trouble events are from HGT

  7. Definitions • Phylogeny : the origin and evolution of a set of organisms. It use the evolutionary distance as the main criterion for taxonomy • Phylogenetic\genome tree: a graphical representation showing the evolutionary relationships among taxonomic units. Taxonomic units : species, populations, individuals or genes • Branches of the tree are connected at ancestral taxonomic units (nodes) • Living units are the ends of the branches • Branch length represents the number of changes that have occurred

  8. Rooted vs Unrooted • Rooted tree: • Directed tree with a unique node corresponding to the most recent common ancestor of all the entities at the leaves of the tree • Unrooted tree: • Tree derived from a rooted phylogenetic tree by omitting the root • It’s a forest of rooted phylogenetic trees

  9. Definitions • Dichotomous tree: each node has exactly two descendants • Polytomous tree: each node has three or more descendants

  10. A B C D E F Taxon • Taxon : Group with common attributes • Monophyletic taxon: is one which includes all the evolutionary descendants of the taxon's common ancestor and only those descendants • Ex : mammalian, birds, insects • Paraphyletic taxon: is one which includes descendants from only one ancestor, but not all of them • Ex : fish, invertebrates. • Polyphyletic taxon: is one descended from more than one ancestor • Ex : marine mammals, bipedal mammals, flying vertebrates, algae

  11. Definitions • Homologs: similar sequences that have been derived from a common ancestor sequence • Orthologs: similar sequences in 2 different organisms that have arisen due to a speciation event • Paralogs: similar sequences within a single organism that have arisen due to a gene duplication event

  12. 5 classes of genome trees based on different aspects of genome • Alignment-free trees • Gene content trees • Chromosomal gene order trees • Average sequence similarity trees • Phylogenomics trees

  13. Alignment-free trees • Based on statistic properties of the genome • Used 2 categories of methods: • Based on statistics of word frequency (DNA string) • Shared information

  14. DNA string • Not rely on homology • Count the frequency of oligopeptide strings of a fixed length in the collection of the protein sequences • Results are combined in a word-frequency vector and the distance is defined in a Cartesian space • Angle between 2 vectors = distance between 2 genomes • Trees are construct using standard distance-based algorithms

  15. DNA string : advantages • At the beginning there was comparison of G+C content or amino acid composition for the analysis of biological sequences • By extending single-nucleotide counting to longer strings, it increase the resolution power of the analysis • Does not contain free parameters • There was no choice of genes (no ambiguity) or no multiple alignment of sequences • Only parameter = the length of the oligopeptides

  16. DNA string: disadvantages • Placement problems : related to small genome size • But applied to small chloroplast genomes alone = good results • This approach needs more justification and further study • Test it by including new complete genomes, especially those of Eukaryotes

  17. Shared information • Algorithmic compression • Lempel-Ziv complexity • Identified the regularities in the given DNA sequence • These regularities would have biological implications • Distance between 2 genomes = length of the shortest computer program to output a, given input b

  18. LZ : advantages • Able to perform comparisons at the whole genome level where multiple alignment method fail • Utilize the entire information contained in the sequences and require no human intervention • Unequal sequence length are not problematic

  19. LZ : disadvantage • LZ compression substitutes the detected repeated patterns with references to a dictionary • The larger the dictionary, the greater the number of bits are necessary for the references

  20. Alignment-free trees : applications • Construct phylogeny of the Eutherian (placental mammals) orders using complete unaligned mitochondrial genome • Consistent with the commonly accepted one • 109 organisms : 16 Archaea, 87 Bacteria, and 6 Eukarya • Unrooted tree that agrees with the biologists ‘‘tree of life’’

  21. Genome trees based on shared gene content • Distances represents the fraction of shared orthologous genes between genomes • Use distance algorithms to construct the tree • neighbor joining • minimum evolution • Few horizontal transfer events or the events occur mainly between closely related species

  22. Gene content : Genome size effect • Problem : a large genome can share more genes with other large genomes than he can do with his more closely related but smaller cousins • There is 2 ways to correct this effect • Divide the number of shared genes by the number of genes in the smaller genome, the latter representing the maximum number of genes the two genomes can share • Leaving out the small genomes

  23. Gene content : applications • Divides 174 taxa into Archaea, Bacteria, and Eukarya • Sorts most of the major groups within these superkingdoms • Not every organism appears exactly at its classical phylogenetic position in these trees • Used 11 complete genomes of free-living microorganisms • Additional phylogenetic relationships appear to be resolved • Used clusters of orthologs group data to construct tree of herpesviruses • Tree agree well with those based on other methods • The tree is robust when tested by bootstrap analysis

  24. Genome trees based on shared gene content : disadvantages • Things that contribute indirectly to the position of an organism • Genome size • Loss or acquisition of genes • The inclusion of small genomes, which may have undergone massive gene losses, may alter the genomic tree by the limitation imposed on the proportions of genes shared with common ancestry in other genomes

  25. Trees based on gene order • Based on the position of genes in the chromosome or chromosomes that compose the genome of the analyzed species • Estimate evolutionary distance from the number of rearrangements necessary to transform one genome into another

  26. Trees based on gene order : disadvantages • The gene order is well conserved in near species both for the prokaryotes and the eukaryotes • Because the transcription in the prokaryotes is done by operons => that some genes must stay together, so there is more conservation of the gene order in prokaryotes than eukaryotes

  27. Gene order : application • 11 genomes of species belonging to the lactic acid bacteria • Tree do not provide much additional information about relationships among bacterial taxa compared to more traditional alignment based methods • Study can bring other kind of information like in determining which genes are shared, when genes were lost in evolutionary history, detect the presence of HTG • BUT the absence of conservation of gene order across the species makes this approach less suitable for comparing distantly related organisms By fermenting lactic acid, Oenococcus oeni plays a critical role in de-acidifying wine

  28. Differences between gene content and gene order • Even if both correlates with evolutionary distance, gene order evolve faster • E. coli and H. influenzae (Gram-negative bacterium) share 78% of their genes, while their gene order is only conserved for 36% • Gene order tree showed some improbable higher order affiliations, reflecting a lack of resolution for these longer evolutionary distances in which too many gene rearrangements have occurred but gene content tree behaved normal for these distances Haemophilus influenzae

  29. Genome trees based on average sequence similarity • Make BLAST comparisons with DNA sequences of each pairs of complete genome • BLAST (Basic Local Alignment Search Tool) • Program that receives a sequence as input and find in a data base all similar sequences • Build a similarity matrix in wich each cells represents the blast score (measure of similarity) between 2 genomes • The matrix is used by the neighbor joining method to build a tree, the 2 species with the best score will be put together and so on • Opposite of the other method because this method neglects any knowledge of orthology

  30. Genome trees based on average sequence similarity : advantages • Straightforward to implement • Intermediate between gene content approach and sequence based approach

  31. Genome trees based on average sequence similarity : disadvantages • They compare homologous genes rather than orthologous genes => introducing noise • A filter should be applied to reduce the impact of nonorthologous homologs • Researchers are reluctant to adopt the method because • Approach appear to combine the problems present in trees based on gene content and in trees based on sequence

  32. Genome trees based on average sequence similarity : applications • Construct trees for completely sequenced bacterial and archaeal genomes. The resulting tree supports: • The separation of bacteria and archaea • Some terminal bifurcations within the bacterial and archaeal domains

  33. Genome trees based on gene trees • Supertrees • Concatenated sequences

  34. Supertrees • Currently the only phylogenetic method that can build complete phylogenies of very large clades (hundreds of species)

  35. Conclusion • Reliable phylogenies help to understand : • The sequence of evolutionary events that generated present day diversity • The mechanisms of evolution as well as the history of organisms • Good applications can be done from whole genome data, but the approaches have to be yet improved!

  36. QUESTIONS ?

  37. References • Snel B., Huynen M. A., Dutilh B. E. 2005. Genome trees and the nature of genome evolution. ARI: 191-209. • Li M., Badger JH., Chen X., Kwong S., Kearney P., Zhang H. 2001. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17:149-54. • Vinga S., Almeida J. 2003. Alignment-free sequence comparison-a review. Bioinformatics 19:513-23. • Out HH., Sayood K. 2003. A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19:2122-30. • Qi J., Wang B., Hao B. 2004. Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach. J Mol Evol 58:1–11. • Qi J., Luo H., Hao B. 2003. CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Research 32 : W45–47. • Snel B., Bork P., Huynen M. A. 1999. Genome phylogeny based on gene content. Nature genetics 21 : 108-110. • Yang S., Doolittle R. F., Bourne P. E. 2004. Phylogeny determined by protein domain content. PNAS 102: 373–378. • Gu X., Zhang H. 2004. Genome Phylogenetic Analysis Based on Extended Gene Contents. Mol. Biol. Evol. 21:1401–1408. • Wolf Y., Rogozin I. B., Grishin N. V., Tatusov R. L., Koonin E. V. 2001. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evolutionary Biology 1:8.

  38. References • Slesarev A., Mezhevaya K. V., Makarova K. S., Polushin N. N., Shcherbinina O. V., Shakhova V. V., Belova G., Aravind L., Natale D. A., Rogozin I. B., Tatusov R. L., Wolf Y., Stetter K. O., Malykh A. G., Koonin E. V., Kozyavkin S. A. 2001. The complete genome of hyperthermophile Methanopyrus kandleri AV19 and monophyly of archaeal methanogens. PNAS 99: 4644–4649. • Huynen M. A., Bork P. 1998. Measuring genome evolution. Proc. Natl. Acad. Sci. USA 95: 5849–5856. • Sankoff D., Leduc G., Antoine N., Paquin B., Lang F., Cedergren. 1992. Gene order comparisons for phylogenetic inference: Evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. USA 89: 6575-6579. • Boore J. L., Brown W. M. 1998. Big trees from little genomes: mitochondrial gene order as a phylogenetic tool. Current Opinion in Genetics & Development 8:668-674. • Huynen M. A., Snel B., Bork P. 2001. Inversions and the dynamics of eukaryotic gene order. Trends in Genetics 17: 304-306. • Korbel J. O., Snel B., Huynen M. A., Bork P. 2002. SHOT: a web server for the construction of genome phylogenies. Trends in Genetics 18: 158-162. • Lerat E., Daubin V., Moran N. A. 2003. From Gene Trees to Organismal Phylogeny in Prokaryotes:The Case of the c-Proteobacteria. PLoS Biology 1: 101-109. • Daubin V., Gouy M., Perrière G. 2002. A Phylogenomic Approach to Bacterial Phylogeny: Evidence of a Core of Genes Sharing a Common History. Genome Research 12:1080–1090. • Bininda-Emonds O. 2004. The evolution of supertrees. Trends in Ecology and Evolution 19: 315-322.

  39. References • Kurland C. G., Canback B., Berg O. G. 2003. Horizontal gene transfer: A critical view. PNAS 100: 9658–9662. • Delsuc F., Brinkmann H., Philippe H. 2005. Phylogenomics and the reconstruction of the tree of life. Nature reviews genetics 6: 361-375. • Lake J. A., Rivera M. C. 2004. Deriving the Genomic Tree of Life in the Presence of Horizontal Gene Transfer: Conditioned Reconstruction. Mol. Biol. Evol. 21:681–690.

  40. Comparison between Gene order and alignment-free • Gene order • Time consuming because they require gene identification • Compare genome using only partial genome information • Alignment-free • Use all genome info

More Related