1 / 22

Designing Flexible Workflow for Upstream Participation of the Scientific Data Community

Designing Flexible Workflow for Upstream Participation of the Scientific Data Community. Robert R. Downs and Robert S. Chen NASA Socioeconomic Data and Applications Center (SEDAC) Center for International Earth Science Information Network (CIESIN) The Earth Institute, Columbia University

marlo
Download Presentation

Designing Flexible Workflow for Upstream Participation of the Scientific Data Community

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Designing Flexible Workflow for Upstream Participation of the Scientific Data Community Robert R. Downs and Robert S. Chen NASA Socioeconomic Data and Applications Center (SEDAC)Center for International Earth Science Information Network (CIESIN)The Earth Institute, Columbia University Prepared for presentation to the IASSIST 2010 Meeting June 3, 2010 Cornell University Ithaca, NY

  2. Scientific Data are at Risk if not Archived • Replication, comparison, new, and future uses of existing data require scientific data stewardship • Data must be identifiable, discoverable, accessible, usable, and recoverable • Data Preservation requires preparation • Datasets need to be complete, documented, and described, and must contain permissions for their use • Stewardship of data often decreases after completion of the project that produced the data • Some data are neglected if not archived soon after creation

  3. Saving Scientific Data For Use By Others • Scientific data repositories can provide capabilities to submit data for archiving • Scientist or team member submits data online • A data submission system could assist data producers in preparing and describing their data for archiving • Data preparation prior to project completion • Capabilities for data submission must balance the need for comprehensive information about the data with the practicalities of what data producers are willing and able to provide. • Easy tools to deposit and describe data

  4. Designing a Data Submission System • Identify Trusted Repository Requirements for Submission • Categorize Submission Services • Define Functions for Submission Services • Create Workflow for Data Submission and Review • Model Scientific Data Submission and Workflow • Review of Successful Submissions • Recommendations for Submission Services

  5. Identifying Requirements for Submission System • Reviewed requirements for trustworthy archives and digital repositories in relevant standards and documents • Consultative Committee for Space Data Systems (CCSDS) (2002) Reference Model for an Open Archival Information System (OAIS). Adopted as ISO 14721:2003 • CCSDS (2004) Producer-Archive Interface Methodology Abstract Standard. Adopted as ISO 20652:2006 • CCSDS. Audit and Certification of Trustworthy Digital Repositories: Draft Recommended Practice. 652.0-R-1 Red Book, Issue 1. (July 2009). • Initially Utilized TRAC document • Online Computer Library Center (OCLC) and Center for Research Libraries (CRL) (2007) Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC), Version 1.0. • Identified and categorized pre-ingest requirements from TRAC • Requirements relevant to submission and workflow prior to ingest • Identified pre-ingest requirements from 652.0-R-1 (Draft ISO Standard) • Related and additional submission and pre-ingest workflow requirements

  6. Communication Requirements Identified From TRAC Document • A3.5 Repository has policies and procedures to ensure that feedback from producers and users is sought and addressed over time. • A3.7 Repository commits to transparency and accountability in all actions supporting the operation and management of the repository, especially those that affect the preservation of digital content over time. • B1.4 Repository’s ingest process verifies each submitted object (i.e., SIP) for completeness and correctness as specified in B1.2. • B1.6 Repository provides producer/depositor with appropriate responses at predefined points during the ingest processes. • B1.7 Repository can demonstrate when preservation responsibility is formally accepted for the contents of the submitted data objects (i.e., SIPs). • B1.8 Repository has contemporaneous records of actions and administration processes that are relevant to preservation (Ingest: content acquisition). *Source: Online Computer Library Center (OCLC) and Center for Research Libraries (CRL). (2007). Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC), Version 1.0.OCLC & CRL. February 2007. Available: http://www.crl.edu/sites/default/files/attachments/pages/trac_0.pdf

  7. Authentication Requirements Identified From Draft Recommended Practice (CCSDS 652.0-R-1)* 3.3.4 The repository shall commit to transparency and accountability in all actions supporting the operation and management of the repository that affect the preservation of digital content over time. 4.1.4 The repository shall have mechanisms to appropriately verify the identity of the Producer of all materials. 4.6.2 The repository shall follow policies and procedures that enable the dissemination of digital objects that are traceable to the originals, with evidence supporting their authenticity. *Source: Consultative Committee for Space Data Systems (CCSDS) Audit and Certification of Trustworthy Digital Repositories: Draft Recommended Practice. Red Book, Issue 1. 652.0-R-1 (July 2009). Available: http://public.ccsds.org/sites/cwe/rids/Lists/CCSDS%206520R1/NASAUSOverview.aspx

  8. Authentication Verify identity of data producer or representative for each submission session Data Deposit Gather and deposit data and documentation Data Description Describe data for preservation, discovery, and use Submission Agreement Establish agreement between the producer and repository Communication Confirm submission, request information if needed, and notify upon ingest Review and Approval Review submission information package and approve for ingest Transformations Transform descriptive information and actions into metadata standards for ingest Digital Repository Services for Web-Based Data Submission Source: Downs & Chen (2009) Earth and Space Science Informatics Workshop. http://essi.gsfc.nasa.gov/pdf/Downs.pdf

  9. Secure authenticated login by authorized data producer or representative Multiple sessions may be needed to assemble submission information Deposit and describe data and documentation files Automate and encourage descriptions for each file Describe scientific data set Encourage unique title and offer selectable choices when possible Grant permissions for data set Offer choices based on data type, organization, and collection Submit Data Set Provide capabilities to review and modify entire package before submission Notify Submitter and Archivist that submission was completed Email notifications include contact information for subsequent communication Review submission for completeness and correctness Apply appraisal criteria for collection to which data set was submitted Contact producer regarding questions or need for additional information Approve data set for ingest to digital repository Notify submitter that submission has been approved for ingest into digital repository Transform descriptions and actions into metadata for ingest to digital repository Descriptive information is converted into XML metadata and ingested into digital repository Workflow for Web-Based Submission of Scientific Data Source: Downs & Chen (2009) Earth and Space Science Informatics Workshop. http://essi.gsfc.nasa.gov/pdf/Downs.pdf

  10. Model for Web-Based Data Submission and Workflow Data Producer Data Reviewer Authentication Login for One or More Sessions Ingest Archival Information Package In Digital Repository Communication Notifications and Requests Review and Approval Appraise and Approve Submission Information Package Data Deposit Provide Files and Descriptions Transformation Transform Values to XML Metadata Data Description Describe Data Set Submission Agreement GrantIntellectual Property Rights Derived from Downs &Chen (2009) Earth and Space Science Informatics Workshop http://essi.gsfc.nasa.gov/pdf/Downs.pdf

  11. Review of Successful Data Submissions • Resources Reviewed: • Legacy Data Submission Process • Forms Used in Legacy Submission Process • Descriptions of Submitted Data • Data Collections • Cyberinfrastructure and physical facilities • Initial Prototype of Submission System

  12. Support for Successful Submission • Affordances identified to address challenges for online submission of data: • Enable Timely Preparation of Submissions • Facilitate Authentication of Submitter • Elicit Information to Contact Submitter • Invite Complete Documentation • Foster Composition of Data Descriptions • Provide Choices to Describe Data • Request Non-Restrictive Permissions

  13. Enable Timely Preparation of Submissions • Challenge: Data submitted before creation or a long time after creation can be incorrect or incomplete • Previous asynchronous capabilities enabled assembly of submissions locally prior to submission. • Submissions prior to completion can result in an addendum to replace missing or incorrect files. • Submissions long after completion can result in delays for scheduling dissemination. • Recommendation: Encourage producers to submit data at the time when it has been created by enabling multiple sessions for producers to prepare and submit data.

  14. Facilitate Authentication of Submitter • Challenge: Identification of the data submitter is needed to ensure that the data producer is being represented • Previous physical and email submission capabilities enabled verification of the identity of the data provider. • Submissions received from non-authorized individuals might not contain the correct or complete data. • The data producer or their representative can provide rights for archiving and using the data. • Recommendation: Establish capabilities and procedures to allow data producers and their representatives to receive a username and password that can be used to log in to the data submission system when submitting data.

  15. Elicit Information to Contact Submitter • Challenge: Submitters need to be contacted to resolve issues with submission. • Recommendation: Request or generate the complete name and email address of the individual who submits the data. • Automatically populate contact information fields upon log in and request verification. • Online form to request for contact information: complete name and email address • Obtain additional contact information • Institution, mailing address, telephone number

  16. Invite Complete Documentation • Challenge: Data require documentation to facilitate understanding about the data and their applicability • Data must be understood by those not familiar with the study. • Recommendation: Request submission of documents describing the data, their creation, and measures used. • Methodology document (who, why, what, where, when, and how the data were obtained) • Variable definitions and specification (location) of values (codebook) • Descriptions of instruments, measures, and units of measurement • Explanations of caveats, assumptions, additions, corrections

  17. Foster Composition of Data Descriptions: Title • Challenge: The relevance of a data set cannot always be determined from the title. • Recommendation: Guidance for describing the data within the title to enable discovery and to differentiate it from other data. • Considerations for inclusion within title: • Purpose: Characteristic measured • Measure: Instrument • Location: Geographical aspects measured or political (country, state, county, city, etc.) • Temporal Aspects: Date or range of dates when data was collected or measured • Version: Sequential version identifier or date of release • Examples Indicators of Coastal Water Quality: Change in Chlorophyll-a Concentration 1998-2007, Alaska-Argentina National Footprint Accounts, 2006 Edition, Footprint and Biocapacity by major land type by nation, 2003

  18. Provide Choices to Describe Data • Challenge: Identifying terms to describe data can be time consuming • Recommendation: Provide choices from groups of controlled vocabularies to describe data • Examples of terminology for consideration: • ISO 19115:2003 Geographic Information – Metadata Topic Categories • Semantic Web for Earth and Environmental Terminology (SWEET) See http://sweet.jpl.nasa.gov/ontology/

  19. Selecting Terms from Controlled Vocabulary: ISO 19115 Topic Categories Source: Downs & Chen (2009) Earth and Space Science Informatics Workshop. http://essi.gsfc.nasa.gov/pdf/Downs.pdf

  20. Request Non-Restrictive Permissions • Challenge: Intellectual property rights must be obtained to enable the use of data by the archive and by others. • Unknown rights to data can restrict data stewardship and use • Limiting the rights to data can prevent some uses of the data • Recommendation: Avoid legal terms in request for data producer to grant rights, with limited restrictions, if possible. • Simple Form with choices to be clicked, based on affiliation of submitter and type of resource • Creative Commons License (Attribution) http://creativecommons.org/ • Additional data sharing options http://sciencecommons.org/ • Public Domain (Created by Government employee)

  21. Summary: Capabilities for Upstream Submission and Workflow • Requirements are applicable to social science and natural science data and to interdisciplinary data • Potential risk when engaging data producers early • not knowing which data are important to preserve (but, capturing more information should improve selection and appraisal) • Benefits of obtaining data through robust workflow prior to the end of the project that collects the data • higher quality metadata, including provenance information • reduced risk of not getting minimum metadata (e.g., when authors move on to other projects) • lower costs overall (data are submitted when ready) • ability to follow up with producers

  22. Consultative Committee for Space Data Systems (2004) Producer-Archive Interface Methodology Abstract Standard. (CCSDS 651.0-B-1). Also: Space data and information transfer systems – Producer-archive interface – Methodology abstract standard (ISO 20652:2006). Available: http://public.ccsds.org/publications/archive/651x0b1.pdf Consultative Committee for Space Data Systems (2002) Reference Model for an Open Archival Information System (OAIS). Also: Space data and information transfer systems - Open archival information system - Reference model (ISO 14721:2003). Available: http://public.ccsds.org/publications/archive/650x0b1.pdf Consultative Committee for Space Data Systems (CCSDS) Audit and Certification of Trustworthy Digital Repositories: Draft Recommended Practice. Red Book, Issue 1. 652.0-R-1 (July 2009). Available: http://public.ccsds.org/sites/cwe/rids/Lists/CCSDS%206520R1/NASAUSOverview.aspx Downs RR, Chen RS (2009) Designing Submission Services for a Trustworthy Digital Repository of Interdisciplinary Scientific Data. Earth and Space Science Informatics Workshop: Developing the Next Generation of Earth and Space Science Informatics. August 3-5, 2009. University of Maryland, Baltimore County. Available: http://essi.gsfc.nasa.gov/pdf/Downs.pdf Downs RR, Chen RS (2010) Designing Submission and Workflow Services for Preserving Interdisciplinary Scientific Data. Earth Science Informatics. Available: http://dx.doi.org/10.1007/s12145-010-0051-6 Nestor Working Group, Trusted Repositories -Certification (2006) Catalogue of Criteria for Trusted Digital Repositories, Version 1. Available: http://edoc.hu-berlin.de/series/nestor-materialien/8en/PDF/8en.pdf Online Computer Library Center (OCLC) and Center for Research Libraries (CRL) (2007) Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC), Version 1.0.OCLC & CRL. February 2007. Available: http://www.crl.edu/sites/default/files/attachments/pages/trac_0.pdf The Digital Curation Centre (DCC) and Digital Preservation Europe (DPE) (2007) Digital Repository Audit Method Based on Risk Assessment (DRAMBORA). Available: http://www.repositoryaudit.eu/ References

More Related