Bacterial genome annotation in the AGC group

Meeting on Cenibacterium arsenoxidans annotation - 14/04/05 Bacterial genome annotation in the AGC group Claudine Médigue Atelier de Génomique Comparative GENOSCOPE/CNRS UMR “Structure et évolution des génomes” Dir. Jean Weissenbach

What is genome annotation ? • Annotation: A note, added by way of comment, or explanation. • Typical genome annotation questions: What genes does this genome contain? What is their location? What proteins do they encode? How are they regulated? In what interactions and in what pathways do the protein products participate?

Syntaxic/structural annotation • Location of genes (both protein- coding genes and RNA genes) • Location of regulatory signals • Location of other regions (such as repeats, etc) EMBL Static view of the genome détection par contenu • Functionnal annotation • Biological function of the genes • Operators family SWISSPROT • Process annotation(or relationnal) Dynamic view of the genome How genomic objets are linked to build functionnal module, responsible for specific task in the cell such as : What is genome annotation ? • metabolic networks • regulatory processes • molecular assembly • … Experimental results Three annotation level L. Stein (2001)

Structural annotation tools • Oriloc : Cumulatif GC skew to predict the replication origin and terminus • tRNA-scan : tRNA gene prediction (G. Fichant et al.) • findrRNA: rRNA gene finding • AMIGene : CDS prediction in bacterial genomes • ProFED : Procaryotic Frameshift Error Detection • AFC/Kmean : Statistical analysis (i.e, codon or oligonucleotide usage) • AMIMat : CDS prediction in bacterial genomes • Petrin : rho indépendant terminators prediction (C. Term et al.) • Spat : Pattern finding such as RBS, promoters, …(A. Viari et al.) • Nosferatu : Closest or distants DNA repeats (E. Rocha et al.) From different authors From the AGC group

GTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCGTGATTTTGGATTCA...GTCGTTTAACAACGTCGGTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCGTGATTTTGGATTCA...GTCGTTTAACAACGTCG Stop A D N N S T Q E T A M T V I T D S V V Stop GTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCGTGATTTTGGATTCA...GTCGTTTAACAACGTCG Stop M T V I T D S V V Stop Gene finding process rbs Start candidates GTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCGTGATTTTGGATTCA...GTCGTTTAACAACGTCG Stop M T V I T D S V V Stop ORF (Open Reading Frame) =>ORF more than 300 nt in length: probably not a random ORF Potential coding region Coding probablility ? => We used a statistical property of coding regions based on different compositions in oligonucleotides of length k between coding/non coding region.

Ribosome binding sites (RBS) Start codon http://cwx.prenhall.com/horton/medialib/media_portfolio/ RBS-finder (TIGR)

• Statistical model The probability that a nucleotide is in position I depends only on the type of the k preceeding nucleotides : P(X/X1...Xk) Transition probabilities Gene models • Practical use A,C,G,T Learning step => i Gene finding : methods based on Markov Models k Searching for stop/start codon patterns (RBS) + chaining constraints GeneMark (Borodovski) Glimmer (Salzberg) +3 +2 +1 phase 1 -1 Pcodant phase 2 -2 start stop w -3 phase 3

How are built reference models in the learning step ? GeneMark Glimmer COMPLETE GENOME Longest ORFs extraction (500 to 1000pb) Set of sequences : Set of sequences : Coding Coding + Non coding "Glimmer-learn" "Make-mat" The matrix of transition probabilities is built by The matrix of transition probabilities is built by assimilation discrimination (coding versus coding) (coding versus non coding) Gene model (matrix) which reflect the codon usage of the coding regions

Example of gene prediction the reference matrix used by the gene finding methods is very important ! E. coli +3 +2 +1 E. coli gene model C. jejuni +3 +2 +1 -1

+3 +2 Acinetobacter «natifs» gene model +1 -1 -2 -3 The matrix used does not fit the codon usage of the genes founded in this part of the sequence Horizontal transfer ? • start codon assertion (non-ATG / alternatifs) Several existing problems • small genes detection • « atypical » genes Heterogeneity in genomis sequences Building one or more gene models : AMIMat AMIGene(S. Bocs) Gene prediction using Markov Model Annotationof MIcrobial Genes (Such as GeneMark) Heuristic for the selection of the most probable CDSs.

Construction d’un modèle de gènes à partir de la séquence utilisateur (> 10 kb) Utilisation des modèles de gènes calculés pour un ensemble de génomes (environ 80) AMIGene et les modèles de gènes … http://www.genoscope.cns.fr/agc/tools/amigene

S. Cruveiller presentation Gene model construction : AMIMat strategy

? • rôle biochimique • rôle physiologique • mécanisme « FONCTION » ? Annotation fonctionnelle • « synténies »• métabolisme. … • par contexte (voisinage) • expérimental (gène rapporteur; expression différentielle...) • par similarité de séquence (criblage de banques)

BlastP : Similarities searches in protein databanks and alignments Also used for orthologs and paralogs identification • InterProScan : Searching for functionnal domains in Prosite, PFAM, PRODOM databanks • Cognitor : Finding similarities in the Cluster of Orthologous Genes (COG classification) • PRIAM : Finding similarities with enzymatic profiles (enzymatic classification) • Pathway tools (BioCyc/P; Karp) : Metabolic pathway reconstruction D. Vallenet presentation L. Labarre presentation Functional annotation tools • Syntonizer : Synteny group detection • SignalP /TMhmm : Peptide signal and Transmembrane helix predictions • AutoFAssign: Automatic functionnal assignation From different authors From the AGC group

Pour une séquence peptidique comparée, liste des protéines des banques les plus “similaires” (= hits blast). BlastP FastA + SWISSALL CDSs traduites = protéome • On opère un transfert par similarité de la fonction biologique présumée(identité > 50% sur une longueur de 80% des séquences). • On va propager des annotations du type ‘putative kinase’ à d’autres protéines, ressemblant de moins en moins à la première. => quel est le seuil de ressemblance à partir duquel 2 protéines peuvent avoir la même fonction ? • Similarité en séquence/similarité en structure ou de la fonction Recherche de similarités : banques de protéines • annotations des banques incomplètes/fausses  => propagation les erreurs d’annotation • “Orphelins”

Banque de domaines protéiques • Domaines répertoriés sous forme de “profiles” • Autant de programmes de recherche que de banques (formats différents) -> PROSITE, BLOCKS, PRINTS, PFAM, etc. Recherche de similartiés : banques de motifs protéiques • Compléments des résultats de BlastP => éviter une annotation unique dans le cas de protéines modulaires. Objectif : tenir compte de lamodularité des protéines Pour une séquence peptidique, caractéristiques des motifs protéiques les plus probables Programme “ad-hoc” + CDSs traduites = protéome

Relations : Genome A Genome B 1 1 Dyn. Prog. «Best Hits Bidirectionnels» 1 1 1 1 n 1 2 «Best Hits» 3 2 Gène orphelin E. coli/B. subtilis • Comparaison des protéomes de deux génomes A et B. 4174 4098 Genes 36.0% 35.0% BHB=1503 S. aureus/B. subtilis • Chaque protéine de Gi est alignée avec toutes les protéines de Gj. 4098 2593 Genes BHB=1552 59.8% 37.9% Exploration des voisinages : caractérisation d’orthologues E. coli/Y. pestis •Une paire d’orthologues vérifie la relation bijective BHB Genes 4174 4017 BHB=2402 57.5% 59.8% Y. pestis/Y.pseudotuberculosis 4347 Genes/CDSs 4017 80.9% 87.6% BHB=3518

Groupes de Gènes Orthologues = COG (Koonin) http://www.ncbi.nlm.nih.gov/COG/ Un COG = ens. de protéines qui devraient dériver d’une protéine ancestrale commune Principe : • comparaisons 2 à 2 des protéomes de 70 génomes bactériens • regroupement des gènes orthologues (BBH) : forment une classe fonctionnelle particulière

PkGDB : Procaryotic Genome DataBase Objectif : données d’annotation ‘propres’, cohérentes, à la source des méthodologies de génomique comparative • SGBD relationnel (MySQL) • Génomes complets(Refseq NCBI) • Intégration dans PkGDB Homogénéité des données Gestion des ‘frameshifts’

PkGDB Databank_Annotation Données issues des banques Toutes les CDSs : Jeu de CDSs (1) + CDSs dont les bornes ont été corrigées automatiquement OU à corriger manuellement Processus d’intégration des données publiques dans PkGDB Courbes de probabilité de codage • Correction/vérification des CDS à ‘problème’ • Annotation des pseudogènes PkGDB Fichiers des banques Databank_Annotation Données issues des banques Compare_Annotation Ens. des CDSs ‘valides’ CDSs ‘valides’ des banques (1) Construction des pré-matrices (probabilités de transition/ modèle markovien)

Exemple de corrections : annotation des pseudogènes gene 622524..624571 /gene="kdpB" /locus_tag="S0610" /note="frameshift" /pseudo /db_xref="GeneID:1077039" gene 624580..625152 /gene="kdpC" /locus_tag="S0611" CDS 624580..625152 /gene="kdpC" /locus_tag="S0611" /function="enzyme; Transport of small molecules: Cations" /codon_start=1 /transl_table=11 /product="potassium-transporting ATPase" gene 625145..627825 /gene="kdpD" /locus_tag="S0612" /note="frameshift" /pseudo gene 627822..628507 /gene="kdpE" /locus_tag="S0613" /note="frameshift" /pseudo gene 629197..631394 /gene="speF" /locus_tag="S0614" /note="frameshift" /pseudo … Error type = ‘No3multiple’ kdpC speF kdpB kdpD kdpE CDS ‘complexe’ (type cCDS) CDSs ‘fragment’ (type fCDS)

Processus d’intégration des données publiques dans PkGDB PkGDB PkGDB PkGDB Fichiers des banques Databank_Annotation Données issues des banques Databank_Annotation Données issues des banques Databank_Annotation Données issues des banques Compare_Annotation Annotations banques Statut = ‘Checked’ Compare_Annotation Ens. des CDSs ‘valides’ CDSs ‘valides’ des banques (1) Toutes les CDSs : Jeu de CDSs (1) + CDSs dont les bornes ont été corrigées automatiquement OU à corriger manuellement CDSs corrigées/validées (2) Construction des pré-matrices (probabilités de transition/ modèle markovien) AMIMat : construction des modèles de gènes Courbes de probabilité de codage • Correction/vérification des CDS à ‘problème’ • Annotation des pseudogènes

PkGDB : Procaryotic Genome DataBase Objectif : données d’annotation ‘propres’, cohérentes, à la source des méthodologies de génomique comparative • SGBD relationnel (MySQL) • Génomes complets(Refseq NCBI) • Intégration dans PkGDB Homogénéité des données Gestion des ‘frameshifts’ • Ré-annotation syntaxique Complétion /correction des données

Fichier EMBL ou GenBank Séq. nucl Annotations Calcul de la probabilité moyenne de codage + Modèle(s) de gènes COMPARAISON Gènes annotés CDS prédites Position des codons stop CDS UNIQUES AMIGene CDS UNIQUES Banques CDS communes http://www.genoscope.cns.fr/agc/tools/micheck MICheck : ré-annotation (syntaxique) de génomes bactériens Cruveiller et al. (2005)MICheck : A Web tool to fast check annotations of bacterial genomes. Nucleic Acid Research (en révision) Objectif : Vérifier rapidement si les annotations répertoriées dans les banques de séquences pour un génome donné sont complètes.

Base de données CMR (Comprehensive Microbial Resource) du TIGR «Primary annotation» : annotations originales + « TIGR annotation » : annotations automatiques Gènes en plus (disponibles en consultation uniquement) • Les banques de séquences publiques • NCBI (Genbank) : projet Refseq (Reference Sequence) Reviewed RefSeq : annotations automatiques + ‘curation’ manuelle par des experts du NCBI. Provisional RefSeq : annotations automatiques uniquement Gènes en plus/en moins ProvisionalRefSeq :annotations originales Projets de ré-annotation de génomes bactériens

Genbank ‘original’ (BA000002) Fichier ‘Refseq’ (NC_000854) Résultats MICheck sur A. pernix (status Reviewed Refseq) APE1089 APE1097 APE1087a APE1077 rplX APE1088a CDS UNIQUES Banques CDS UNIQUES AMIGene CDS communes 18 1565 941 BA000002 35 1569 186 NC_000854

Résultats MICheck sur O. iheyensis (status Reviewed Refseq) gene complement(2047445..2047618) /gene="OB2021" CDS complement(2047445..2047618) /gene="OB2021" /product="hypothetical protein" gene 2047725..2048765 /gene="OB2022" CDS 2047725..2048765 /gene="OB2022" /EC_number="3.5.1.28" /product="N-acetylmuramoyl-L-alanine amidase (partial) " /translation="MKLTTLISTIL… " gene complement(2048799..2049245) /gene="OB2023" CDS complement(2048799..2049245) /gene="OB2023" BA000028 gene complement(2047445..2047618) /locus_tag="OB2021" /db_xref="GeneID:1018510" CDS complement(2047445..2047618) /locus_tag="OB2021" /product="hypothetical protein" misc_feature 2047725..2048765 /note="similar to N-acetylmuramoyl-L-alanine amidase" gene complement(2048799..2049245) /locus_tag="OB2023" /db_xref="GeneID:1018512" CDS complement(2048799..2049245) /locus_tag="OB2023" /note="CDS_ID OB2023 NC_004193 CDS UNIQUES Banques CDS UNIQUES AMIGene CDS communes 2 3406 18 BA000028 14 3392 18 NC_004193 Fichier ‘Refseq’ (NC_004193)

Base de données CMR (Comprehensive Microbial Resource) du TIGR «Primary annotation» : annotations originales + « TIGR annotation » : annotations automatiques Gènes en plus (disponibles en consultation uniquement) Projets de ré-annotation de génomes bactériens • Les banques de séquences publiques • NCBI (Genbank) : projet Refseq (Reference Sequence) Reviewed RefSeq : annotations automatiques + ‘curation’ manuelle par des experts du NCBI. Provisional RefSeq : annotations automatiques uniquement Gènes en plus/en moins ProvisionalRefSeq :annotation originales • EBI (EMBL) : projet Genome Reviews • Enrichissement/correction des annotations fonctionnelles originales (Données UniProt, Genome Ontology, InterPro, etc) • Standardisation/homogénéisation des annotations originales • Détection et élimination des annotations ‘erronées’ (système Xanthippe) Gènes en moins

CDS UNIQUES Banques CDS UNIQUES AMIGene CDS communes 4114 20 216 AE005176 150 4144 0 AE005176_GR Genbank ‘original’ (AE005176) Fichier Genome Review (AE005176_GR) Résultats MICheck sur S. oneidensis (status Reviewed Refseq)

gene 3266258..3268062 /gene="dctB" /locus_tag="SO3137" /note="This region contains an authentic frame shift and is not the result of a sequencing artifact; C4-dicarboxylate transport sensor protein, authentic frameshift" gene 3268059..3269438 /gene="dctD" /locus_tag="SO3138" CDS 3268059..3269438 /gene="dctD" /locus_tag="SO3138" /note="similar to GB:X14046, SP:P11049, and PID:29794; identified by sequence similarity; putative" /codon_start=1 /transl_table=11 /product="C4-dicarboxylate transport transcriptional regulatory protein" gene complement(3269514..3272585) /locus_tag="SO3139" /note="This region contains an authentic frame shift and is not the result of a sequencing artifact; conserved hypothetical protein; identified by Glimmer2; putative" gene complement(3273023..3273601) /locus_tag="SO3140" CDS complement(3273023..3273601) /locus_tag="SO3140" /note="identified by match to PFAM protein family HMM PF00265" /codon_start=1 /transl_table=11 /protein_id="AAN56142.1" /product="thymidine kinase gene 3274138..3276066 /locus_tag="SO3141" /note="This region contains a gene with one or more premature stops or frameshifts, and is not the result of a sequencing artifact; cytochrome c, degenerate; similar to GP:3628769; identified by sequence similarity; putative" … AE005176 FT CDS 3264761..3266158 FT /codon_start=1 FT /gene="dctM {UniProt/TrEMBL:Q8ECK2}" FT /locus_tag="SO3136 {UniProt/TrEMBL:Q8ECK2}" FT /product="C4-dicarboxylate transport protein … FT CDS 3268059..3269438 FT /codon_start=1 FT /gene="dctD {UniProt/TrEMBL:Q8ECK1}" FT /locus_tag="SO3138 {UniProt/TrEMBL:Q8ECK1}" FT /product="C4-dicarboxylate transport FT transcriptional regulatory protein FT {UniProt/TrEMBL:Q8ECK1} » FT CDS complement(3273023..3273601) FT /codon_start=1 FT /gene="tdk {UniProt/Swiss-Prot:Q8ECK0}" FT /locus_tag="SO3140 {UniProt/SwissProt:Q8ECK0}" FT /product="Thymidine kinase {UniProt/Swiss- FT Prot:Q8ECK0}" FT /EC_number="2.7.1.21 {UniProt/Swiss-Prot:Q8…}" FT /function="ATP binding {GO:0005524} » FT /function="thymidine kinase activity {GO:0004797}" FT /biological_process="DNA metabolism FT {GO:0006259}" FT CDS 3276288..3278438 FT /codon_start=1 FT /gene="dcp-1 {UniProt/TrEMBL:Q8ECJ9}" FT /locus_tag="SO3142 {UniProt/TrEMBL:Q8ECJ9}" FT /product="Peptidyl-dipeptidase Dcp" FT /function="metalloendopeptidase activity FT {GO:0004222}" FT /biological_process="proteolysis and peptidolysis FT {GO:0006508}" AE005176_GR /note="This region contains an authentic frame shift and is not the result of a sequencing artifact; C4-dicarboxylate transport sensor protein, authentic frameshift" /note=" This region contains an authentic frame shift and is not the result of a sequencing artifact; … " Fichier d’annotation original et fichier EMBL (GR) /note="This region contains a gene with one or more premature stops or frameshifts, and is not the result of a sequencing artifact; cytochrome c, degenerate; similar to GP:3628769; identified by sequence similarity; putative"

Génomes nouveaux(projets d’annotation) • Résultats d’analyses : PkGDB : Procaryotic Genome DataBase • Intrinsèques : gènes, signaux, répétitions,… • Extrinsèques : Blast, InterPro, COG, synténies … Objectif : données d’annotation ‘propres’, cohérentes, à la source des méthodologies de génomique comparative • SGBD relationnel (MySQL) • Génomes complets(Refseq NCBI) • Intégration dans PkGDB Homogénéité des données Gestion des ‘frameshifts’ • Ré-annotation syntaxique Complétion /correction des données

Séquençage Prédiction automatique de gènes Prediction of coding regions, promoters, terminators, RNAs Similarity searches, assignments to protein families, sequence features, … Suggestion of function, classification Annotation fonctionnelle (auto) Validation of automatic annotations, Additional database and literature searches, Contextual analysis, gene fusions, protein interactions, Phylogenetic profiles Annotation manuelle Intégration dans d’autres plateformes d’analyse Validation and update of previous annotations Expression data, knock-out phenotypes, etc. Ré-annotation Stratégie générale de l’annotation des génomes bactériens -1- Biological databases

Stratégie générale de l’annotation des génomes bactériens -2- Sequençage Lab work + Bioinformatics Prédiction automatique de gènes Bioinformatics AUTOMATION needed Annotation fonctionnelle (auto) Bioinformatics Biological databases Effort manuel Annotation manuelle VISUALIZATION needed Intégration dans d’autres plateformes d’analyse Bioinformatics Ré-annotation Lab work + Bioinformatics

Schéma général du système MaGe Specialized databases Public databanks PkGDB «Private» sequences MySQL DB Bacillus Scope Yersinia Scope ColiScope tRNAscan-SE Blast PRIAM COGnitor FrankiaDB InterProScan AcinetoDB HaloplanktisDB TMHMM «AutoFunc» Databases for annotation and re-annotation projects Automatic functional assignment combining multiple evidence and synteny results GRAPHICAL ANNOTATION INTERFACE (Web server connected to the data base) • Validation and completion of the automatic annotation • (Re) Annotation using synteny results

IF identity > 40% AND alignment on 80%of the protein lengths OR identity > 30% AND alignment on 80%of the protein lengths AND SYNTENY DA = « Definitive_Annotation » Query protein IF identity > 40% AND partial alignment PM = Partial_Match (>80% of the databank protein length) Module d’assignation fonctionnelle automatique (AutoFunc) -1- Databank protein /product Description of the best hit : PM_SWALL OR the one of Monica R. (EcoGene database) IF one E. coli protein is similar to the annotated gene : PM_COLI + (partial match) /label CDS name (very different from gene name !) = CENARnumber Genomes de Référence : E. coli et Acinetobacter ADP1 /product Description of the best hit : DA_SWALL OR the one of Monica R. (EcoGene database) IF one E. coli protein is similar to the annotated gene : DA_COLI Gene name and synonyms from the EcoGene database IF one E. coli protein is similar to the annotated gene. /gene /function Functionnal Classification (E. coli) /EC_number PRIAM EC number(s)

Query protein IF identity > 40% AND partial alignment FO= Fragment_Of (>80% of the query protein lenght) Databank protein /product Description of the best hit : PM_SWALL OR the one of Monica R. (EcoGene database) IF one E. coli protein is similar to the annotated gene : PM_COLI + (partial) IF 30% < identity < 40% AND alignment on 80%of the protein lengths PA = Putative_Annotation /product Putative/Probable (?) + description of the best hit PA_SWALL OR the one of E. coliPA_COLI IF identity < 30% : no significant databank similarity HP = Hypothetical_Protein /product Hypothetical protein / Orphan Protein ? Module d’assignation fonctionnelle automatique (AutoFunc) -2- /note Summary of the 3 SWALL best hits

Annotation définitive : example 2.1.1: DNA replication

Annotation définitive, partial match : example Ratio of alignment lengths with Lmatch (length of match), Lprot1 (length of protein 1) and Lprot2 (length of protein 2) minL = Lmatch/ min(Lprot1, Lprot2) and maxL = Lmatch /max(Lprot1, Lprot2)

Visualisation sous MaGe de CENAR0426 CENAR0426

Annotation définitive, partial : example

Visualisation sous MaGe de CENAR0361 CENAR0361 Erreur de séquence probable -> il manque le début du gène (mettre CENAR361 à CheckSeq)

« Partial » and « partial match » : other cases CENAR3153 3150 « partial match» 3151 CENAR3149 CENAR3156 « partial » mdoH mdoH mdoG CENAR3149/3950 : « CheckSeq » CENAR3153/56 : Ajuster le codon start

Bacterial genome annotation in the AGC group

Bacterial genome annotation in the AGC group

Presentation Transcript

Genome annotation

MICROBIAL GENOME ANNOTATION

Computational Genome Annotation

Bacterial Genome Assembly

Genome Annotation

Genome Annotation

Eukaryotic Genome Annotation

Genome Assembly and Annotation

Genome Annotation

Genome Annotation

Genome Annotation

Genome Annotation Continued

microbial genome annotation

Bacterial physiology in the post-genome era

Genome Annotation

Genome Annotation

VectorBase genome annotation

Eukaryotic Genome Annotation

Arabidopsis Genome Annotation

Annotation of the Laccaria genome

bacterial genome sequencing

Genome analysis and annotation