1 / 43

DataONE Cyberinfrastructure Overview

DataONE Cyberinfrastructure Overview. USGS Workshop April, 2012. Increasing Economic Impacts. Data deluge. Sensors, sensor networks, and remote sensing gather observations; Data management and stewardship. Photo courtesy of www.carboafrica.net. The long tail of orphan data.

maryw
Download Presentation

DataONE Cyberinfrastructure Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DataONE Cyberinfrastructure Overview USGS Workshop April, 2012

  2. Increasing Economic Impacts

  3. Data deluge Sensors, sensor networks, and remote sensing gather observations; Data management and stewardship Photo courtesy of www.carboafrica.net

  4. The long tail of orphan data “Most of the bytes are at the high end, but most of the datasets are at the low end”– Jim Gray Specialized repositories (e.g. GenBank, PDB) Volume Orphan data (B. Heidorn) Rank frequency of datatype

  5. Data entropy Time of publication Specific details General details Retirement or career change Information Content Accident Death Time (Michener et al. 1997)

  6. DataONE vision and approach Enable new science and knowledge creation through easy access to data about life on earth and the environment that sustains it, plus access to key tools. Build on existing cyberinfrastructure Create new cyberinfrastructure Support communities of practice

  7. DataONE Architecture

  8. Data Access Platform • DataONE offers a platform that bridges across heterogeneous existing and new repositories to provide consistent, reliable access to diverse data • Operates core services necessary to maintain platform consistency • Existing tools and techniques modified to work with a single common data layer instead of a multitude

  9. Enabling Functionality Fundamentals of the core cyberinfrastructure: • Identifiers • Preservation • Identity • Discovery

  10. DataONE Cyberinfrastructure Three major components for a flexible, scalable, sustainable network • Member Nodes • diverse institutions • serve local community • provide resources for managing their data • retain copies of data • Coordinating Nodes • retain complete metadata catalog • indexing for search • network-wide services • ensure content availability (preservation) • replication services Investigator Toolkit

  11. Three Major Components Investigator Toolkit Client Libraries Web Interface Analysis, Visualization Data Management Command Line Java Python Member Nodes Coordinating Nodes Service Interfaces Service Interfaces Resolution Discovery Tier 1 – Read only, Public Replication Registration Tier 2 – Read only, Auth-z Identifiers Catalog Tier 3 – Read, Write Preservation Monitor Tier 4 – Replication target Auth-z Identity Service Bridge Data Repository Object Store Index

  12. 1. Coordinating Nodes • Object tracking and replica management • High availability • Performance • Scalable architecture • Java on Tomcat, Hazelcast, SOLR • Leveraging Metacat Investigator Toolkit Member Nodes Coordinating Nodes Service Interfaces Resolution Discovery Replication Registration Identifiers Catalog Preservation Monitor Auth-z Identity Object Store Index

  13. 2. Member Nodes • Data storage • Data access • Access control • Replication • Metadata quality • Primary user interaction Investigator Toolkit Coordinating Nodes Member Nodes Service Interfaces Tier 1 – Read only, Public Tier 2 – Read only, Auth-z Tier 3 – Read, Write Tier 4 – Replication target Service Bridge Data Repository

  14. Member Node Functional Tiers • Tier 1: Read only, public content ping(), getLogRecords(), getCapabilities(),get(), getSystemMetadata(), getChecksum(),listObjects(), synchronizationFailed() • Tier 2: Read only, with access control isAuthorized(), systemMetadataChanged() • Tier 3: Read/Write using client tools create(), update(), delete() • Tier 4: Able to operate as a replication target • replicate(),getReplica() • http://mule1.dataone.org/ArchitectureDocs-current/apis/MN_APIs.html

  15. Diverse Member Nodes

  16. The DataONE Federation

  17. The DataONE Federation

  18. 3. The Investigator Toolkit • Developer, end-user tools • Creation, search, retrieval, management • Plugins, extensions for analysis tools Investigator Toolkit Client Libraries Web Interface Analysis, Visualization Data Management Command Line Java Python Member Nodes Coordinating Nodes Kepler

  19. Investigator Toolkit Activities DMP-Tool Kepler

  20. User Interactions

  21. Libraries and CLI • Client libraries available in Java and Python (+bash) • Low-level direct interaction with service endpoints • Higher level abstraction (e.g. data packages) • Command Line Client (CLI) • Interact with DataONE platform from command line • One-shot or command shell operation • Intended for developers or “technical” users

  22. Using the DataONE R Client Initialize client object d1 <- D1Client() Resolve, download, and convert data dataPackage <- getD1Object(d1, "erd.362.1") erd.train.locs <- asDataFrame(dataPackage,1) Store model results on Member Node d1object <- createD1Object(d1, dataId, doc_char, format, mn_nodeid) d1object$create() d1object$setPublicAccess()

  23. ONE Mercury • Data discovery tool • Enables search and retrieval of content indexed by DataONE • Primary web based user interface for DataONE • Operates on each Coordinating Node • Same SOLR / Lucene index is utilized by other client tools

  24. ONE Mercury Architecture Catalog for Earth Observations BDP NBII EML LTER EML NCEAS Internal Metadata Index FGDC ORNL DAAC Metadata Extraction EML OBFS Data Centers / Member Nodes FGDC IAI DIF LP DAAC • Single portal • Numerous search capabilities • Search sharing functions (RSS, Web Services) • Metadata has link to data, which reside at Data Center FGDC LBA Stored at Coordinating Node EML I-LTER EML TERN EML SAEON

  25. Others • ONEDrive • Excel add-in • Morpho metadata editor • Workflow tools

  26. Technical activities • DataONE Architecture • USGS data contributions through Member Node(s) • USGS data access through DataONE • Investigator Toolkit access, plug-ins • Data Management Planning Tool • EzID DOI Service • USGS technology leveraging (i.e. Clearinghouse, etc.) • USGS Metadata leveraging (i.e. Training, tools, etc.)

  27. Investigator Toolkit Development • Work in more robust metadata options for USGS CDI Tools • Implement FuseFS & Dokan (network file system) to enable DataONE Toolkit interaction with loaded datasets • Contribute ScienceBase components as plugins to DataONE • Expose ArcGIS to DataONE APIs for data access, contributions Investigator Toolkit ScienceBase

  28. DataONE Drive – Windows implementation • Concept: Virtual drive integrated into the OS which allows for scientists to access & deposit data • DataONE has developed a MAC/Linux OS implementation • USGS is working on a Windows OS implementation

  29. DataONE Drive – Windows implementation

  30. Internals

  31. Component Communications • HTTPS • Representational State Transfer (REST) end points • XML encoded messages • Message structures defined by schema MN MN Investigator Tools CN CN CN

  32. Application Programming Interfaces • Coordinating Node • Core • Read • Authorization • Identity • Replication • Register • Member Node • Core • Read • Authorization • Storage • Replication

  33. Data Model Package Package Package SystemMetadata SystemMetadata 1 1 ScienceMetadata Data 1 1 n n Any data object XML documents: ISO19115, EML, FGDC, … ResourceMap SystemMetadata 1 1 OAI-ORE RDF

  34. Access Control DataONE Platform MN Retrieve MN Replication Create MN Manage Synchronization Discover Investigator Tools CN CN CN Indexing Authentication Identity CN Replication Identity Providers CILogon

  35. Schedule

  36. Targets for Initial Public Release • Operational core infrastructure • Three coordinating nodes: • ORC, UCSB, UNM • Eight member nodes: • KNB SANParksAvian Knowledge Network • Dryad ORNL DAAC • MerrittUSGS • PISCOLTER • Essential investigator toolkit components: • Search interface (ONE Mercury) • ONE R-plugin • Developer tools in in Python and Java • Design and component documentation

  37. Schedule

  38. DataONE Team and Sponsors • EwaDeelman • Amber Budden, Roger Dahl, Rebecca Koskela, Bill Michener, Robert Nahf, Mark Servilla • Peter Honeyman • Dave Vieglais • Suzie Allard, Carol Tenopir, MaribethManoff, Robert Waltz, Bruce Wilson • Jeff Horsburgh • John Cobb, Bob Cook, GiriPalanismy, Line Pouchard • Bertram Ludaescher • Robert Sandusky • Patricia Cruse, John Kunze • Sky Bristol, Mike Frame, Richard Huffine, VivHutchison, Jeff Morisette, Jake Weltzin, Lisa Zolly • Peter Buneman • Chad Berkley, Stephanie Hampton, Matt Jones • David DeRoure • Paul Allen, Rick Bonney, Steve Kelling • Carole Goble • Ryan Scherle, Todd Vision • Donald Hobern • Randy Butler • Cliff Duke LEON LEVY FOUNDATION

  39. Resources • Architecture Docs : http://mule1.dataone.org/ArchitectureDocs-current • Operations Docs: http://mule1.dataone.org/OperationDocs/ • Component Docs: Distributed with component • Source Code Repository https://repository.dataone.org/ • Community DUG – DataONE Users Group Developers, CCIT, Working Groups

More Related