1 / 18

GLOBAL BIODIVERSITY

INFORMATION FACILITY. GLOBAL BIODIVERSITY. Challenges operating a global biodiversity Portal. Tim Robertson Information Systems Architect September 2010. www.gbif.org. About GBIF. An operational network Connecting hundreds of institutions Thousands of data sources

gamba
Download Presentation

GLOBAL BIODIVERSITY

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INFORMATIONFACILITY GLOBALBIODIVERSITY Challenges operating a global biodiversity Portal Tim Robertson Information Systems Architect September 2010 www.gbif.org

  2. About GBIF • An operational network • Connecting hundreds of institutions • Thousands of data sources • Free and open access to information • Achieved through globally recognised standards • Not “standards”, “interoperability” (Dr. Michael J. Ackerman)

  3. The Data Portal Status: Live since 2007 http://data.gbif.org • Provides services • Search (real time) • Browse (taxonomic, geographic, by publisher etc) • Pre-processed reports • Visualisations • Various export capabilities • Means to access the original source of data • An index of content available through the GBIF Network

  4. Registry component Provides the information to determine the participating institutions in GBIF and the technical end-points to access their datasets, along with contact information.

  5. Registry component • Previously implemented using an open industry business registry known as UDDI • 2-tier model of “data publisher having several datasets” • The GBIF network is more complicated than this. • Datasets are shared or published through multiple channels. Results in complex attribution chains.

  6. Registry component Status: Prototype available http://gbrds.gbif.org • Developed a graph based model to handle this information • Challenge is now to open the management of content • Wikipedia style open access curation? • Facebook style request / confirmation? • Complex rules for editing permissions?

  7. Registry component http://gbrds.gbif.org Status: Prototype available http://gbrds.gbif.org

  8. Metadata catalogue Status: Under construction • Data portal currently provides • Contact information • Basic attribution • Limited means to • To understand the nature of dataset creation • Difficult to assess fitness-for-use of data • To discover undigitised content or content in non standard forms

  9. Metadata catalogue Status: Under construction • Recent work focusing on: • Accommodate existing metadata standards (ISO, FGDC, EML, DIF, DC, NCD etc) • Limit use of “lossy” transformations • Support OAI-PMH protocols for harvesting • Provide OAI-PMH services for wider participation • Developing a GBIF metadata profile • Based on the EML 2.1.0 profile • http://rs.gbif.org/schema/eml-gbif-profile/ • Prototype basic and structured search

  10. ChecklistBank Status: Prototype available http://ecat-dev.gbif.org • Unified access to multiple checklists • Taxonomic, nomenclatural, thematic • Provide dictionaries to help improve services for parsing and name finding • Name based services • Treatment of names • Classification services for names • Vernacular names • Identifiers used by sources of checklists • E.g. Catalogue of Life LSIDs • Lexical and nomenclatural grouping

  11. ChecklistBank http://ecat-dev.gbif.org Status: Prototype available http://ecat-dev.gbif.org

  12. Annotating content Annotation Interest Group • Correcting mistakes • Aligning to standardised vocabularies • Completing missing terms (e.g. reverse georeferencing) • Complementing with additional information • Invasive indicator • Protected area identifier • etc etc. • Lesson learnt: Calculate once, store along with record

  13. Annotating content Annotation Interest Group • Not all annotations are of interest to the data holder • Are all annotations from trustworthy source? • Challenge is to design an infrastructure that supports • Widespread quality control • Brokerage of annotations for reuse • Investigate open access to help foster innovation and research

  14. Performance • An index should be: • Fast in operation • Relevant • Provide means to search that suit the users • Accurate • Reflect changes in the network quickly • “Changes made by the data holder should be reflected in index within 1 month”

  15. Performance • Transfer stage: • No robust mechanism to follow changes • dwc:dateLastModified not often usable • TAPIR / DiGIR / BioCASe • Inefficient transfer for full dataset harvesting • No mechanism to inform of deletion of records • Need to do a complete dataset harvest each time • Darwin Core text guidelines one means for simplifying this • 1 month saw a 13 million record increase over usual 1-2 million due to DarwinCore archives

  16. Challenge: performance • Post-harvesting stage: • Clearly parallelisation is key… … and database becomes a bottleneck

  17. Challenge: consistency • The more “batch” processing one does, the higher the risk of inconsistencies • Aim for eventual consistency? • Can be mitigated • through careful data process planning • through clear explanations to users of when a “view” was last produced

  18. Roadmap 2011 • Consolidate existing work • Unified data entry to (Data API) • Institutions, collections, occurrences, names… • Rich metadata where available • Multiple indexes to the content • Marine, botany, invasive etc • Service offerings (Service API) • Registration services • Name services • Mapping services • Annotation services

More Related