Annotation of the Laccaria genome

Annotation of the Laccaria genome Jan WUYTS INRA Tree-Microbe Interactions Unit INRA 54280 Champenoux, France VIB Department of Plant Systems Biology Ghent University, Technologiepark 927, 9052 Gent, Belgium jan.wuyts@psb.ugent.be

Presentation overview • Assembly • EuGène • Latest annotation • Clustering of genes • Ks distribution • Duplicated segments • Gbrowse database

Assembly 20050315 • map reads to assembly • BLASTn, E<=1e-50 • >= 97% sequence identity • calculate coverage for top15 scaffolds • which reads map to more than 1 location • discard

assembly coverage

assembly coverage (multihit removed)

GC% 100% 50%

Coding potential search EuGene Blastn Blastx tBlastx EuGene • developed by the INRA (Toulouse, France) in cooperation with our group SplicePredictor Intrinsic approaches NetGene2 Netstart Predicted Genes (structural annotation) Extrinsic approaches Swissprot Cryptococcus Coprinus cDNA & EST

EuGene

Graphical output of EuGene

Collecting Data: Interpolated Markov Model Actual genes (full or partitial) Introns Intergenic regions were possible >gene1 atggctaggatagctctcgatagtcgat... >gene2 atggtccgcttcgctatgctagatcggat... >gene3 cgattagctgagctcttttctcgatcgtagct... >intron1 gtagctcgctgctcgag >intron2 gtagctcgataaaatcgctggggctcgctgag >intron3 gtagctgttttttcgctagctgatcgtttag >intergenic1 acgctgctgctcgggctcgctcgatcgatcccaaaatatcgctagatctagatcta... >intergenic2 gctcgatgagagatcgcgctcgctatataaatatcgcgatcgat...

Collecting Data: Splice Machine actual GT donors actual acceptors actual GC donors ...accgtgtGTgctttgt... ...cggtcgtGTccgaat... ... ...acttgtatAGgctgggt... ...cggtcgtAGaggaatc... ... ...actggatGCgcgtgca... ...ttgtcgtttGCaggaatc... ... pseudo GT donors pseudo acceptors pseudo GC donors ...tttcgtgtGTgctttgt... ...cgaacgtGTccaat... ... ...aattgtatAGgcccggt... ...aatacgtAGaggaatc... ... ...acccgatGCaacgtca... ...atgtcgggGCagggatc... ...

Predicting Genes Each signal on the sequence is scored using the SVM models GT Donors gt ac GC Donors gc gc Acceptors ag ct ...acgcgcgatagctgatggtcttttctcgcgagatctagagaggacacacatacatgatctagatcttaaa... 0.1 0.254 0.36 0.9 0.11 ...

Latest EuGène annotation • 23164 genes (18678 complete) • 9956 covered by EST for at least 100bp • 8929 match in swissprot, 11772 uniprot • 9232 match with Cryptococcus (4502 reciprocal best hit) • 12932 match with Phanerochaete (5515 reciprocal best hit)

23000 ?!? • 1176 match Class 1 TE, 1000 Class 2 • ~1500 tandem repeats • genes split by gaps in assembly (?) • genes split by annotation mistake (?) • false positives • most manually annotated genes look _very_ similar to EuGene annotation.

peptide length

coding density

Predicted introns 73014 3622 76636

blastclust 1357 clusters max cluster: 21 genes 19831 single genes tight clusters, too strict, mostely very small proteins Li-Rost Single linkage clustering 2410 (2347) clusters max cluster: 224 genes 12194 (12008) single genes top clusters too relaxed Clustering predicted genes

Li & Rost top clusters • (224) Kinesin light chain (KLC) • (156) ?? • (126) ?? • (124) ?? • (119) Myosin heavy chain related • (115) Putative AC transposase

Ks distribution • Synonymous substitutions per synonymous site • “free” to mutate => follow molecular clock hypothesis • protein alignment -> codon alignment • indication for age of divergence

Ks distribution Laccaria

Ks distribution Phanerochaete

Duplicated regions • i-ADHoRe (automatic detection of homologous regions) by Cedric, Klaas, Yvan • reduce chromosomes (scaffolds) to strings of genes (no tandem duplicates) • map homologous genes (anchor points) • find statistically significant regions of colinearity

Simillion et al. 2004 Genome Research

results i-ADHoRe • 52 multiplicons of 5 … 11 anchor points • 4.6% of Laccaria genome duplicated • no colinearity with Cryptococcus genome

Age of duplicated blocks

median Ks scaffold_13 1.53 scaffold_39 scaffold_1 0.44 scaffold_1 scaffold_11 0.39 scaffold_30 scaffold_31 0.33 scaffold_6 scaffold_115 0.02 scaffold_39

Gbrowse database http://bioinformatics.psb.ugent.be/genomes/browse/gbrowse/laccaria

Thiamin pyrophosphate riboswitch • mRNA feature in 5’-UTR of mRNA • tertiary structure of mRNA has affinity for ligand (thiamine) • binding induces conformational change => regulation of translation • reported during previous workshop, now corresponding gene annotated

Lbscf0025g00410 Lbscf0025g00400

Annotation of the Laccaria genome

Annotation of the Laccaria genome

Presentation Transcript

Genome annotation

MICROBIAL GENOME ANNOTATION

Computational Genome Annotation

Genome Annotation

Genome Annotation

Eukaryotic Genome Annotation

Genome Assembly and Annotation

Genome Annotation

Genome Annotation

Peptide-assisted annotation of the Mlp genome

Genome Annotation

Basics of Genome Annotation

Genome Annotation Continued

Annotation of Signal Transduction Genes in Laccaria bicolor

microbial genome annotation

Genome Annotation

Genome Annotation

VectorBase genome annotation

Eukaryotic Genome Annotation

Arabidopsis Genome Annotation

Genome analysis and annotation

Bioinformatics and Genome Annotation