Large scale proteome comparisons Genome trees Fredj Tekaia Institut Pasteur firstname.lastname@example.org
207 21 Complete genomes • 1387 projects • 261 published (01-03-05) • 654 prokaryotes • 472 eukaryotes Tree of life 33 http://www.genomesonline.org/
95 96 97 98 99 00 01 02 03 04 03-05 Cumulated number of available completely sequenced genomes Completely sequenced Genomes that span the three domains of life are growing at a rapid rate List and references GOLD
Genome sequencing projects There are several web-based resources that document the progress of completely sequenced genomes and their reference publication, including: GOLD Genomes Online Database http://wit.integratedgenomics.com/GOLD/ GNN Genome News Network http://www.genomenewsnetwork.org/index.php
Resources for genomes There are two main resources for genomes: EBI European Bioinformatics Institute http://www.ebi.ac.uk/genomes/ NCBI National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov But many others resources from sequencing Institutions: Sanger The welcome Trust Sanger Institut http://www.sanger.ac.uk/ TIGR The Institute for Genomic Research http://www.tigr.org Genolevures http://cbi.labri.fr/Genolevures/index.php
Definitions Genome The genome of a cell is formed by the collection of the DNA it comprises. The genome size is the total of its DNA bases. Gene Is a particular DNA sequence situated in a specific position on a chromosome and that codes for a specific function. Protein Is a sequence composed of amino-acids ordered according to the DNA sequences of the gene it codes for. Proteome Is the set of proteins in an organism. Genomics Is the exhaustive studyof genomes: genetic material, genes; their functions, their organization....
Chronology of completely sequenced genomes • 1977: first viral genome (5386 base pairs; encoding 11 genes). Sanger et al. sequence bacteriophagefX174. • 1981: Human mitochondrial genome. 16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA) • 1986: Chloroplast genome. 156,000 base pairs (most are 120 kb to 200 kb)
1995: first genome of a free-living organism, the bacterium Haemophilus influenzae, by TIGR, 1830 Kb, 1713 genes. 1996: first genome of an archaeal genome: Methanococcus jannaschii DSM 2661, by TIGR, 1664 Kb, 1773 genes. 1997: first eukaryotic genome : Saccharomyces cerevisiae S288C; International collaboration; 16 Chromosomes; 12,057 Kb, ~6000 genes. 1998: first multicellular organism Nematode Caenorhabditis elegans; 97 Mb; ~19,000 genes.
• 2000: Fruitfly Drosophila melanogaster (137 Mb; ~13,000 genes) •2000 first plant genome: Arabidopsis thaliana (115,428 Mb; 22670 genes • 2001: draft sequence of the human genome (x Mb; ~28000 genes) • 2002: plasmodium falciparum (22,9 Mb; 5334 genes) • 2002: mouse genome (x Mb; ~28000 genes) • 2004: Fish draft Tetraodon nigroviridis genome (x Mb; ~28000 genes);
How big are genome sizes? Viral genomes: 1 kb to 350 kb (Mimivirus: 1.2 Mb) Bacterial genomes: 0.5 Mb to 13 Mb; Eukaryotic genomes: 8 Mb to 670 Gb; DOGS: http://www.cbs.dtu.dk/databases/DOGS/abbr_table.bysize.txt
Comparative genomics Analyses of the genetic material of different species help understanding the similarity and differences between genomes, their evolution and the evolution of their genes. •Intra-genomic comparisons help understanding the degree of duplication (genome regions; genes) and genes organization,... •Inter-genomic comparisons help understanding the degree of similarity between genomes; degree of conservation between genes; •understanding gene and genome evolution
Phylogeny* Expansion* genesis duplication HGT HGT Exchange* Deletion* loss Evolutionary processes include: Ancestor species genome and selection
Gene duplications are traditionally considered to be a major evolutionary source of protein new functions Understanding how duplications happened and how important is this evolutionary process is a key goal of genome analysis > Some examples
S. cerevisiae genome Kellis et al. Nature, 2004 Colours reveal Duplications
Duplication Speciation Deletion Actual content of the 2 copies Reconstruction of the ancestral organization Kellis et al. Nature, 2004
Nature Reviews Genetics3; 827-837 (2002); SPLITTING PAIRS: THE DIVERGING FATES OF DUPLICATED GENES
Original version Actual version Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206.
Genome duplication. a, Distribution of Ks values of duplicated genes in Tetraodon (left) and Takifugu (right) genomes. Duplicated genes broadly belong to two categories, depending on their Ks value being below or higher than 0.35 substitutions per site since the divergence between the two puffer fish (arrows). b, Global distribution of ancient duplicated genes (Ks > 0.35) in the Tetraodon genome. The 21 Tetraodon chromosomes are represented in a circle in numerical order and each line joins duplicated genes at their respective position on a given pair of chromosomes. Jaillon et al. Nature 431, 946-857. 2004.
Inter-genomic comparisons • Compositional comparisons between species (nuc and aa compositions); • Gene, protein conservation between species (rate of conservation); • Orthologs; families of orthologs; • Specific and non-specific genes; • Genes exclusively conserved in one or in a subset of species (or in domains); • Gene Dictionary; • Gene conservation profiles; • Genome tree construction; • Genome multiple alignments.
Methodology Fp • • 1 i p • • • • • • 1 • • • • • • • • • • • • • j kij • F1 • • • • • • • n • sup • Matrice T kij > 0 Correspondence Analysis Classification • orthogonal system; • use of euclidean distance;
Growth t° •Glu •Arg •Lys GC% r=0.83 p<1.e-4 •Gln Tekaia, F., Yeramian, E. and Dujon B. (2002)Gene. 297 pp. 51-60.
Growth t° GC% 2005
Proteome comparisons: Methodology
Species specific comparisons blastp, pam250, SEG filter NP new proteome • bestp1np • allp1np • segmatchp1np • bestnpp1 • allnpp1 • segmatchnpp1 P1 proteome1 • bestnppn • allnppn • segmatchnppn • bestpnnp • allpnnp • segmatchpnnp Pn proteomen SPECSO bestnppi np1 size pij e-value1 HS/IS/NS allnppi np1 size pij e-value1 HS/IS/NS np1 size pik e-value HS/IS/NS 100 species: E:28, A: 19, B: 53 • Paralogs • Orthologs The expected number of HSPs with score at least S is given by: E = Kmne-S. m and n are sequence and database lengths.
Homolog - Paralog - Ortholog O Homologs: A1, B1, A2, B2 Paralogs: A1 vs B1 and A2 vs B2 Orthologs: A1 vs A2 and B1 vs B2 B A B2 B2 A2 A1 A2 A1 B1 B1 Sequence analysis Species-2 S1 S2 Species-1 a b
Example Comparing S. cerevisiae (SC) genome with C. elegans (CE) genome
- Paralogs - multiple matches - Partitions/clustering
SC/CE CE/SC Orthologs
Partitions/MCL Clustering P7.1.C4.1 • • • • • • • P7.1 A set of genesdefines a "partition" if and only if a) each member of the set has at least one significant match with another member of the set; b) no member of the set has significant matches with members not included in the set; c) the set is minimal. • P4.2.C3.1 • • • • • • • • • • • MCL: Markov Cluster algorithm Stijn van Dongen: A cluster algorithm for graphs. http://micans.org/mcl/ • • • • • Each gene is identified by its partition and its MCL cluster P7.1.C3.1 P4.1
Markov Cluster (MCL) algorithm http://micans.org/mcl/ • Traditionally, most methods deal with similarity relationships in a pairwise manner, while graph theory allows classification of proteins into families based on a global treatment of all relationships in similarity space simultaneously. • Similarity between proteins are arranged in a matrix that represents a connection graph. • Nodes of the graph represent proteins, and edges represent sequence similarity that connects such proteins. • A weight is assigned to each edge by taking -log10(E-value) obtained by a BLAST comparison.
•These weights are transformed into probabilities associated with a transition from one protein to another within this graph. •This matrix is passed through iterative rounds of matrix multiplication and matrix inflation until there is little or no net change in the matrix. The final matrix is then interpreted as a protein family clustering. • The inflation value parameter of the MCL algorithm is used to control the granularity of these clusters.
blastp proteome specific comparisons all protein significant hits Adapted from Enright et al. NAR 2002.
Example of Partition/MCL clustering P6 19 Total number of distinct ORFs= 6 --------------------
Example of Partition/MCL clustering P6 22 Total number of distinct ORFs= 6 --------------------
Gene Dictionary Table : 541880 predicted proteins x 100 species
Protein conservation profiles (phylogenetic profiles) E A B S1..............I.............I................Sn G1,1 100000000000000000000000000000000000000000000000 G2,1 111111111111111111111111111111111111111111111111 G3,1 111111111111111111111111111111111111111111111111 ....................................................... Gn1,1 000001110001000000000000000000000000000000000000 G1,2 000000000000000000010100000000000000000000000000 G2,2 000000000000000000000000000000000111000011100011 ........................................................ Gn2,2 111111110011111111111111011101110101111111111111 ........................................................ G1,n 011110100000000000000000001000000000000000000000 G2,n 011111100000000000000000000000000000000000000000 G3,n 011111100011111111100011011011110100111111101111 ........................................................ Gnp,n 100110000000000000000000000000000000000000000000 Table : 541880 predicted proteins x 100 species
Ancestral weight matrix j i Wii: weight of ancestral duplication; Wij: weight of ancestral conservation of i in j; nsi: nonspecific genes in species i. i Wii • j • • Wjj Wij nsi nsj
Ancestral duplication A B E mean= 52.1 30. 38.4 std= 17.8 11.7 11.2