2nd International Barcode of Life Conference 18 September 2007 The BARCODE Data Standard: Enabling Molecular Diagnostics for Biodivesity Robert Hanner, PhD Database Working Group Chair, CBOL Global Campaign Coordinator, FISH-BOL Associate Director, Canadian Barcode of Life Network Biodiversity Institute of Ontario, University of Guelph, Canada
The Infrastructure of Taxonomy • Collections and databases of specimens • Codes of Taxonomic Nomenclature • Compilations of taxonomic names • Data repositories (characters, gene sequences, images, trees) • Monographs • Floristic and faunistic surveys/inventories • Revisions • The (undigitized) Taxonomic Literature
International Nucleotide Sequence Database Collaboration http://www.insdc.org/
Roles of INSDan archival database/repository for nucleotide sequence Output of Project A Common access interface Users Output of Project B Output of Project C Assignment of a unique identifier (an accession number) to a sequence Standardization of data structure including data items and values
New tools for taxonomy DNA Barcoding The ability to compare genotype information across a huge range of organisms is a powerful tool
Validation demonstrates that a procedure is robust, reliable and reproducible. PCR amplification and DNA sequencing: • Are robust methods which produces successful results a high percentage of the time. • Are reliable methods that produce accurate results. • Are reproducible methods producing similar results each time a sample is tested.
Manual Assembly Subjective interpretation?
“Only [27%] of papers had a legitimate specimens examined section, with museum numbers for each voucher, and names of the museums where the specimens used in the study could be examined”
Couplets Consisting of:“Species Name - DNA Sequence” Basis of a “look-up table” enabling molecular diagnostic applications However, both elements are assertions Underlying specimens and associated raw sequence data are not typically available for secondary inspection
Problem Areas TRANSPARENCY AND TRACEABILITY • Genetic Data Quality • Specimen Data Quality • Taxonomy • Information Access
First International Barcode of Life Conference: Feb 5-8, 2005
Rationale for Defining “BARCODE” keyword in GenBank • Provides the community with reference records with verifiable and retrievable data: • Associated with retrievable voucher specimens (liberally defined: tissue, DNA, etc.) • Linked to on-line metadata • Meet an agreed upon standard of taxonomic identification • Provide an assured level of data completeness • On an agreed upon gene region • Recommended for use in identifying unknowns
The Barcode Data Standard • Establishing a new data standard for “BARCODE” keyword records in DDBJ/EMBL/GenBank: • Minimum 500bp, <1% ambiguous base calls • Double stranded sequence • Trace files and associated quality scores • Primers used to generate sequence • Linkages to: • A morphological voucher specimen • Structured reference to collections • Geospatial reference information • Valid species name • Who performed the identification • Literature citations
Features, Qualifiers and Values The Feature table is updated based on discussions at the International Collaborators meeting of INSDC
NCBI Barcode Submission Tool in Beta Test Phase Since 2005, better software, more sequences, better links to museum vouchers…
Triplet structure for specimen identifiers /specimen_voucher=“<institution-code>|<collection-code>|<specimen-id>” <institution-code> - abbreviation of the archiving institution <collection-code> - collection within the institution (possibly null) (*) <specimen-id> - specimen identifier within the collection The above approach is used in the DarwinCore/GBIF and is parallel to the Life Science Identifier (LSID) that is an Object Management Group (OMG) standard. (*) museums herbaria culture collections stock centers germplasm repositories (seed banks) frozen tissue banks zoos/aquaria/botanical gardens DNA banks personal collections e-voucher archives
Summary • INSDC is an archival genetic database in the public domain • BOLD is a public/private workbench for assembling BARCODE compliant projects & supports the organization of barcode campaigns • BOLD and GenBank continue to develop routines for synchronization and interoperability • As of this Meeting, the BARCODE Data Standard is Ready for Full Implementation!
Acknowledgments: • All Participants of the CBOL Database Work Group • Scott Federhen, NCBI • Donal Hobern, GBIF • Scott Miller, Smithsonian Institution • David Schindel, CBOL • Sujeevan Ratnasingham, Biodiversity Institute of Ontario