Discovery and Annotation of Transposable Elements on VectorBase http://www.vectorbase.org. Ryan C. Kennedy 1,2 , Maria F. Unger 1,3 , Scott Christley 4 , Jenica L. Abrudan 1,3 , Neil F. Lobo 1,3 , Greg Madey 1,2 , Frank H. Collins 1,2,3
Discovery and Annotation of Transposable Elements on VectorBase http://www.vectorbase.org
Ryan C. Kennedy1,2, Maria F. Unger1,3, Scott Christley4,
Jenica L. Abrudan1,3, Neil F. Lobo1,3, Greg Madey1,2, Frank H. Collins1,2,3
1Eck Institute for Global Health, University of Notre Dame
2Department of Computer Science and Engineering, University of Notre Dame
3Department of Biological Sciences, University of Notre Dame
4Department of Mathematics & Department of Computer Science, University of California, Irvine
Although transposable elements (TEs) were discovered over 50 years ago, the robust discovery of them in newly sequenced genomes remains a difficult problem. Numerous types with different structural characteristics, sequence degradation, multiple insertions within existing elements, and co-option by the organism’s regulatory system are some of the issues confounding the discovery process.
We have developed an automated pipeline employing a homology-based approach, complemented with de novo- and structure-based approaches, to discover and annotate TEs in invertebrate genomes. Once fully automated, our pipeline will be integrated with VectorBase, an NIAID Bioinformatics Resource Center for invertebrate vectors of human pathogens, to produce a first-pass discovery and annotation of TEs for newly sequenced genomes. Currently hosting five organisms with more on the way, VectorBase provides the Ensembl genome browser, computational tools, and other data specific to the study of invertebrate vectors.
The annotation component of our pipeline includes enhancements to the Ensembl genome browser, elevating the importance of TEs by displaying genomic location, structural details, alignments with consensus TEs, and homology with other organisms. VectorBase has developed a community annotation system whereby the research community can upload annotation corrections to genes for curation and broad dissemination; we plan to extend this to TEs. We hope this will provide an invaluable resource for researchers studying the biology of TEs and their genomic impact.
TEs are difficult to thoroughly characterize because of their complex and varying structure (or lack thereof). Most current TE discovery techniques fall into the following categories: homology-based, structure-based, and de novo. Popular tools exist within each of these categories, yet most are not automated or easily accessible for all researchers. We have developed a semi-automated discovery pipeline that utilizes a homology-based approach and is complemented with de novo and structure-based components. Our pipeline is reliant on several well-known technologies, including BLAST, Perl (and BioPerl), and DNASTAR SeqMan II. We also require a library of representative TEs, which we obtain from Repbase, TEfam, and the literature.
We aim to provide an automatic and easy-to-use method, integrated with VectorBase, to identify and annotate TEs in invertebrate genomes.
Community TE Annotation
While not yet fully implemented on VectorBase, annotation of TEs on VectorBase will follow the same general steps as genes and TEs will be shown within the genome browser. Current work has led to a means to store consensus TEs in the same Chado database schema as genes and also to provide a structural display of TEs. Current TE online repositories traditionally lack this structural display as well as the user-feedback system that VectorBase employs. Additionally, BLAST will be utilized to provide a mechanism to show coverage of TEs within a genome. Figure 1 graphically shows the information flow for TE community annotation on VectorBase.
Figure 2. Simplified visual diagram of homology-based discovery pipeline.
D. Lawson, et al., VectorBase: a data resource for invertebrate vector genomics. Nucleic Acids Research, 37:D58307, 2009.
Figure 1. Information flow diagram for TE annotation.
The VectorBase project is funded by the US National Institute of Allergy and Infectious Diseases (NIAID), contract HHSN266200400039C.