1 / 36

Annotation of Anopheline Genomes at VectorBase

Annotation of Anopheline Genomes at VectorBase. Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI. Anopheline species in this study: Current status. Genome sequencing 9 of 16 species assembled and annotated RNAseq 10 of 12 species sequenced Isolate re-sequencing

anika
Download Presentation

Annotation of Anopheline Genomes at VectorBase

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Annotation of Anopheline Genomes at VectorBase • Dan Lawson, • VectorBase & The Anopheles Genomes Cluster Consortium • EMBL-EBI

  2. Anopheline species in this study: Current status • Genome sequencing • 9 of 16 species assembled and annotated • RNAseq • 10 of 12 species sequenced • Isolate re-sequencing • 12 of 12 species sequenced

  3. Genome annotation • First-pass genome annotation is almost always based on “automatic” computational approaches • ab initio • Similarity based • Transcript (ESTs, RNAseq) • Protein (nr protein database)

  4. Genome annotation • First-pass genome annotation is almost always based on “automatic” computational approaches • ab initio • Similarity based • Transcript (ESTs, RNAseq) • Protein (nr protein database)

  5. Genome annotation • First-pass genome annotation is almost always based on “automatic” computational approaches • ab initio • Similarity based • Transcript (ESTs, RNAseq) • Protein (nr protein database)

  6. Genome annotation - building a pipeline Genome assembly Map Repeats Map Peptides Map Transcripts Genefinding nc-RNAs Protein-coding genes Functional annotation Submission to archival databases (Release)

  7. Automatic annotation strategies ab initio similarity

  8. Genome annotation: resources • ab initio predictions using SNAP and Augustus • Mixed whole animal RNAseq datasets generated using Illumina sequencing • Assembled using Trinity (Broad Institute) • Many dipteran proteomes (including 4 mosquitoes & D. melanogaster) • All arthropod/metazoan proteomes

  9. MAKER annotation with RNAseq and reference proteomes • Aim: • Gene prediction aggregation for the masses. • Used for a number of arthropod genome projects • Touted as the default pipeline for many more (part of the GMOD toolkit) • Overview • ab-initio gene predictions from SNAP, Augustus & FGENESH • Final gene models from MAKER • Similarity alignments from both EXONERATE and BLAST • Repeats from RepeatFinder & RepeatMasker • Additional data sets integrated via GFF3 files (RNA-Seq) • Uses MPI for parallelization over a compute farm • Summary • Iterative runs give acceptable reference gene sets. • Used for Heliconius, Glossina, sandflies and the first tranche of the 16 Anophelines

  10. Current VectorBase annotation pipeline • MAKER based automatic annotation • includes SNAP training and ab initio • RNAseq based transcript similarity prediction • Taxonomically constrained peptide similarity prediction • 2 rounds of prediction refinement & final round includes all peptide similarity • Community annotation phase • Capture gene structure changes • Metadata associated with locus (symbol, description, citation) • Submission to INSDC, propagation to UniProt • Presentation through VectorBase Start 1.0 set (automatic) 1.1 set (published)

  11. Projection from a reference annotation

  12. Gene prediction based on projection from reference annotation • Local alignment of An. gambiae CDS to the assemblies provide a platform for improving gene predictions. • Example loci: Rps7 (AGAP008916) • Potential for transcript based assembly improvement via seqedits of genome sequence

  13. Annotation: Preliminary genesets • 10,738 - 13,162 predictions • no ncRNAs yet predicted

  14. Preliminary comparative analysis • OrthoMCL runs including 17 species • An. gambiae PEST 12,810 protein-coding genes An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

  15. Preliminary comparative analysis • OrthoMCL runs including 17 species • No. of clusters containing all 13 mosquitoes 4961 (≃ 39%) An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

  16. Preliminary comparative analysis • OrthoMCL runs including 17 species • No. of clusters containing all 13 mosquitoes 4961 (≃ 39%) • No. of clusters containing all 11 Anophelines 5463 (≃ 43%) An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

  17. Preliminary comparative analysis • OrthoMCL runs including 17 species • No. of clusters containing all 13 mosquitoes 4961 (≃ 39%) • No. of clusters containing all 11 Anophelines 5463 (≃ 43%) • No. of clusters containing 10 Anophelines (minus darlingi) 6606 (≃ 52%) An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

  18. Preliminary comparative analysis • OrthoMCL runs including 17 species • No. of clusters containing all 13 mosquitoes 4961 (≃ 39%) • No. of clusters containing all 11 Anophelines 5463 (≃ 43%) • No. of clusters containing 10 Anophelines (minus darlingi) 6606 (≃ 52%) • No. of clusters containing 9 Anophelines (minus darlingi & christyi) 7477 (≃ 58%) An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

  19. Preliminary comparative analysis • OrthoMCL runs including 17 species • No. of clusters containing all 13 mosquitoes 4961 (≃ 39%) • No. of clusters containing all 11 Anophelines 5463 (≃ 43%) • No. of clusters containing 10 Anophelines (minus darlingi) 6606 (≃ 52%) • No. of clusters containing 9 Anophelines (minus darlingi & christyi) 7477 (≃ 58%) • No. of clusters containing representatives of the gambiae complex (ar/ga/qu) 9089 (≃ 71%) An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

  20. Preliminary comparative analysis • OrthoMCL runs including 17 species • No. of clusters containing all 13 mosquitoes 4961 (≃ 39%) • No. of clusters containing all 11 Anophelines 5463 (≃ 43%) • No. of clusters containing 10 Anophelines (minus darlingi) 6606 (≃ 52%) • No. of clusters containing 9 Anophelines (minus darlingi & christyi) 7477 (≃ 58%) • No. of clusters containing representatives of the gambiae complex (ar/ga/qu) 9089 (≃ 71%) • No. of clusters containing 8 Anophelines (- darlingi & christyi) but not gambiae 600 An. darlingi Glossina morsitans Lutzomyia longipalpis Phlebotomus papatasi

  21. Browser All genomes deserves a home Downloads • Genome browser • Similarity searches • BLAST/BLAT • Query tools • Simple keyword • Complex queries • Downloads Browser Compara Query tool Similarity searches

  22. VectorBase • Long term home for these genomes is VectorBase. • NIAID-funded Bioinformatic Resource Center focused on arthropod vectors of human pathogens • Ensembl genome browser • Similarity searches • File downloads

  23. Anopheles Genomes Cluster wiki site

  24. Thematic analysis groups & community annotation • Community led annotation of the genomes using the Community Annotation Portal (CAP)

  25. Community annotation decision tree

  26. Community annotation decision tree

  27. Community annotation decision tree

  28. Community annotation decision tree

  29. Community annotation workflow Identify gene scf7180000638805 ptn2genome ptn_match 52 605 892 + . ID=xxxx;Name=tr|Q3UIQ2| scf7180000638805 ptn2genome ptn_match 78 205 960 + . ID=xxxx2;Name=tr|Q3TIU7| scf7180000638805 ptn2genome ptn_match 52 305 696 + . ID=xxxx3;Name=sp|Q91VD9| scf7180000638805 ptn2genome ptn_match 78 205 950 + . ID=xxxx2;Name=tr|Q3VIU732| >MY SUPERCONTIG ATATATGCGTTGAGCTGCGTTACGTTCGGGATGCGTTAGGCTTGTGAGCTGGATCGGTCCTGCCTGCGTCGATATAAACGACCT… FASTA GFF3 ARTEMIS APOLLO Modify model scf7180000638805 ptn2genome ptn_match 52 605 892 + . ID=xxxx;Name=tr|Q3UIQ2| scf7180000638805 ptn2genome ptn_match 78 205 960 + . ID=xxxx2;Name=tr|Q3TIU7| scf7180000638805 ptn2genome ptn_match 78 205 950 + . ID=xxxx2;Name=tr|Q3VIU732| SubmitCAP

  30. CAP reporting • Email report back to submitter to show status • If successful then the model is stored in a local database and then presented to the genome browser via DAS • Failed submissions have (some) information as to why. Submitters then need to correct these errors and re-submit

  31. CAP submissions displayed in the genome browser • Similarity track for supporting evidence (from previous updates)

  32. Genome annotation metrics • Metrics for quality of a gene set are far from standardised but... • Simple statistics (length, number of exons, intron size) • Level of support from transcript data (how many genes have overlapping EST/RNAseq) • Junction data (confirmation of introns) • Comparison to public datasets (UniProt) • Protein domains (InterPro) • Comparative analysis - orthologs/paralogs

  33. Still to do... • Primary annotation • Still 7 genomes outstanding from the Broad Institute - de novo repeat finding and MAKER annotation • Analysis • Whole genome alignments and (12 Drosopholid analysis pipelines from Kellis group - Rob Waterhouse) • Data presentation (Trinity clusters, correlation with legacy Hittinger clusters, velvet assembled 37 bp reads) • Variation (SNP calls) from each of the 16 species • Other genomes • New version of the An. darlingi genome (Osvaldo Marinotti, recently published in NAR) • New version of the Indian strain of An. stephensi (Jake Tu)

  34. Acknowledgements • V Daniel Lawson, Gareth Maslen, Mikkel Christensen, Nick Langridge, Derek Wilson,Gautier Koscielny,Karyn Megy, Martin Hammond, Daniel Hughes,Ewan Birney, Paul Kersey • EMBL-EBI Imperial College Fotis Kafatos, Bob MacCallum, George Christophides, Seth Redmond, Timo Tiirikka Frank Collins, Greg Madey, Rob Bruggner, Nate Konopinski, EO Stinson, Scott Emrich, Andrew Sheehan, Rory Carmichael, Dave Cieslak, Dave Campbell, Ryan Butler, Katie Cybulski, Neil Lobo, Gloria Calderon, Greg Davis NoTre Dame New MexicO Maggie Werner-Washburne Phil Baker HaRvard Bill Gelbart, Susan Russo, Dave Emmert, Pinglei Zhou, Lynn Crosby, Kathy Campbell IMBB Kitsos Louis, Pantelis Topalis, Emmanuel Dialynas, Vicky Dritsou A Sequencers TIGR/JCVI WashU Broad Institute, Baylor College Ensembl GEnomes Michael Nuhn Dan Neafsey, Brian Haas Nora Besansky, Michael Fontaine Rob Waterhouse Paul Howell

  35. Contact lawson@ebi.ac.uk or info@vectorbase.org

  36. Anopheles Genomes Cluster Consortium Steering committee Community liaisons

More Related