1 / 30

Darwin Core Archives

Darwin Core Archives. Checklist Archives Checklist Extensions Archive Tools Checklist Bank. Markus Döring & David Remsen, GBIF 2010. Checklist Scope. Darwin Core. Ratified in 2009 Significant additions/refinements Ongoing process Set of terms http://rs.tdwg.org/dwc/terms/index.htm

Download Presentation

Darwin Core Archives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Darwin Core Archives Checklist Archives Checklist Extensions Archive Tools Checklist Bank Markus Döring & David Remsen, GBIF 2010

  2. Checklist Scope

  3. Darwin Core • Ratified in 2009 • Significant additions/refinements • Ongoing process • Set of terms • http://rs.tdwg.org/dwc/terms/index.htm • Not tied to technology • Use Text Guidelines for DwC-A • http://rs.tdwg.org/dwc/terms/guides/text/index.htm

  4. Darwin Core Archives for interoperability • Simplicity • Complete datasets, compressed • Allow for rich dataset metadata • Single CSV /w header minimal requirement • Flexible • 1:many extensions • Schema descriptor meta.xml • Property mapping tocolumn or global valu • GNA exchange format • Standard extensions • Taxonomic core conventions • Controlled vocabularies

  5. Best Practices • Include dataset metadata file or URL • inside <archive metadata=“...”> • GBIF recognises eml file • For simplicity a Dublin Core xml file does it • Data file format • UTF8 • tab or csv files • header row • NULL as empty stringnot “\N” or “NULL”

  6. Dwc:Taxon – Identifier • Relational data, Record ID the primary key that other id terms relate to • = TaxonID for checklist archives • = OccurrenceID for occurrence archives • TaxonConceptID • Asserting that taxa have a shared concept • ScientificNameID • Link out to some optional name identifier, GUID really • Identifier are plain strings, can be any format • Literal terms, e.g. parentNameUsage • All Dwc ID terms have such a literal friend • Redundant if id terms are used • to be avoided for relations, e.g. homonyms

  7. Dwc:Taxon - Classification • Classification only for accepted taxa, not synonyms • parentNameUsageID • Allows for arbitrary ranks and levels • Beware infinite loops • Root with parentID=NULL or parentID=recordID • Denormalised (prefer the use of parentNameUsageID) • Kingdom,Phylum,Class,Order,Family,Genus,Subgenus • No explicit records required for higher taxa • TaxonRank • String, but recommended vocabularyhttp://rs.gbif.org/vocabulary/gbif/rank.xml • Examples http://code.google.com/p/gbif-ecat/wiki/publishingClassifications

  8. Dwc:Taxon - Synonyms • Synonym are records in core file • But classification should be ignored • acceptedNameUsageID • Synonyms point to the accepted/valid name usage • Accepted names have NULL or point to themselves • pro parte synonyms concatenate with | symbol all accepted IDs • taxonomicStatus • Accepted, (hetero-/homotypic) synonym, misapplied • See http://rs.gbif.org/vocabulary/gbif/taxonomic_status.xml • nameAccordingTo • sec. / sensu part of taxon concepts

  9. Dwc:Taxon – Nomenclature • scientificName • full name with authorship • genus, subgenus, specificEpithet, verbatimTaxonRank, infraspecificEpithet, scientificNameAuthorship • namePublishedIn • nomenclaturalStatus • nomenclaturalCode • http://rs.gbif.org/vocabulary/gbif/nomenclatural_code.xml • originalNameUsageID • Basionym, Pointer to usage that first established the name

  10. Darwin Core Extensions

  11. Dwc Extensions - Basics • One to many relation, schema descriptor meta.xml • id column required to join extensions • rowType specifies the class of records / extension • Property mapping to column or global value • List of allowed properties with • Definition, examples, further link • Mandate Vocabulary • Basic data types: string, integer, decimal, boolean, date, dateTime • Centrally hosted at http://rs.gbif.org • Staging environment • Production is manually moderated, but open to community

  12. Dwc:Taxon Extensions • Frozen soon for GNA “Simple Exchange Format”http://rs.gbif.org/extension/gbif/1.0/ • Vernaculars • Distribution • Bibliography • Alternative ids & links. Webpage, LSID, DOI, JSON, etc • Candidates for further extensions • species info • images • nomenclatural acts & name relations • concept relations • type specimen

  13. Darwin Core Tools Publishing support

  14. DwC-A Reader Java library • Provides iterators across star schema • Dwc terms and GNA extension terms as enumerations

  15. Validator Status: Under Evaluation http://tools.gbif.org/dwca-validator/

  16. Integrated Publishing Toolkit • Compose EML Metadata • Connect to database • Upload Data • Transform to DWCA • Publish via GBIF http://ipt.gbif.org Status: Stable release – end 2010

  17. Guidelines and Best Practices • DB Admin skills • Database export • No tools required • Successful pilots • Ireland • NBN UK • Norway • Avian Knowledge network • IPNI • IRMNG Status: Drafts for November campaign (see roadmap)

  18. Authoring Descriptor XML Status: Ready for Review Metafile http://tools.gbif.org/dwca-assistant/

  19. Excel Spreadsheet Templates Status: Ready for Review/Testing

  20. Spreadsheet Processor Status: Ready for Review http://tools.gbif.org/spreadsheet-processor/

  21. Checklist Bank Indexing checklists

  22. GBIF Checklist Bank • Rich index to checklists and their content • All of Dwc Taxon and GNA Simple Format extensions:Vernacular names, Identifier & Links, Distribution, References • ~35 million name usages, 90 datasets + 8500 derived from occurrence index • Checklists • DwC-A created by • Publisher • Adapters (CoL, ITIS, NCBI, USDA, GRIN, TreeOfLife) • manual Transformation, static • No versioning • 4 main types: taxonomic, nomenclatural, occurrences, thematic

  23. Name Usages • Checklists are made up of name usagesa plain name string with optionally: • Classification • Taxonomic status, e.g. synonym, misapllied name • Original name, i.e. basionym • According to, i.e. taxon concept • Nomenclatural status • Original publication

  24. Lexical Grouping • Name strings are parsed and grouped • Correct & incorrect spellings • Homonyms in several groups • Semiautomatic processlargely based on canonical,year and higher classification • Allows for • Fuzzy matching • Checklist crosswalk • Rubussilvaticus • Rubussylvaticus • Rubussilvaticum • RubussilvaticusWeihe & Nees • Vertebrata [animal subphylum] • Vertebrate • Vertebrata Cuvier, 1812 • Vertebrata [algae genus] • Vertebrata Gray • Vertebrata S.F. Gray, 1821 • Gerardiapaupercula var. borealis (Pennell) Deam • Gerardiapaupercula (Gray) Britt. var. borealis (Pennell) Deam • Gerardiapaupercula (A.Gray) Britton var. borealis (Pennell) Deam • Gerardiapaupercula borealis • Gerardiapaupercula borealis (Pennell) Deam

  25. Nomenclatural Grouping • Grouping homotypic names • Original name relation • Homotypic synonyms • Not yet available

  26. Checklist Bank Portal • Preliminary until new GBIF portal complete • Browse & Search • Statistics • Links to source pages • Flickr Images

  27. Checklist Bank Webservices • Common API to all resources • RESTful JSON services • search names, usages, checklists • navigate classification • http://ecat-dev.gbif.org/api/clb

  28. Importing Darwin Core • Highly relational data • Challenges faced • Syntactically damaged sources • Wrong mappings, charsets, non escaped line breaks or field delimiters • Data Quality • Broken referential integrity • Non names, e.g. “Unallocated Family” • No standard vocabularies for ranks, status, etc • Name strings have several publishing options • ScientificName, Authorship, Genus + epithets + rank • Classification has several publishing options • Normalised (parentUsage / parentUsageID) or flat via Linnean Ranks

  29. GBIF Nub • Synthetic “union taxonomy”, checklist #1 • Lexical group = nub name usage • Classification based on prioritized checklists • Align to 8 CoL kingdoms • Fixed accepted ranks: • Linnean + subfamily, subgenus, section, subspecies, variety, form • Other ranks become “Intermediate rank” synonyms • Homotypic synonyms only • Work in progress!

  30. Personal Name Lists • User accounts with personal name lists • Name string + kingdom/nom code • Add classifications, status, distribution, vernaculars, etc from one or more indexed checklists • Also on the fly via webservices • but only for already indexed name strings • In development …

More Related