1 / 31

PAWN: A Novel Ingestion Workflow Technology for Scientific Data

PAWN: A Novel Ingestion Workflow Technology for Scientific Data. Mike Smorul, Joseph JaJa, Yang Wang, Mike McGann, and Fritz McCall. Overall Principles. Distributed, secure ingestion Use of web/grid technologies – platform independent Minimal client-side requirements

ashton
Download Presentation

PAWN: A Novel Ingestion Workflow Technology for Scientific Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PAWN: A Novel Ingestion Workflow Technology for Scientific Data Mike Smorul, Joseph JaJa, Yang Wang, Mike McGann, and Fritz McCall

  2. Overall Principles • Distributed, secure ingestion • Use of web/grid technologies – platform independent • Minimal client-side requirements • Ease of integration with data grid systems. • Designed to satisfy data integrity requirements of scientific collections and digital preservation

  3. Producer

  4. Producer • Provides data to a data grid based on a prior agreement. • Consists of a management/metadata server and an ingestion client. • Provides initial arrangement, context, and metadata.

  5. Data Grid - receiving

  6. Data Grid – receiving • Receives data from a Producer • Validates bitstreams and metadata, and sends acknowledgement to Producer. • Arranges into collections and specifies optional publishing and preservation policy. • Publishes bitstreams into data grid.

  7. Data Grid – Long term Stewardship • Implemented using grid technologies. • Use the existing prototype NARA/UMD/SDSC site. • Automated replication and integrity checking. • Enforces access control and preservation policy

  8. Ingestion Workflow • Negotiate Submission Agreement. • Workflow Initialization and Submission Information Packet (SIP) creation. • Transfer of SIPs to Data Grid site. • Validation of SIP transfer • Organization of data into collections and transfer into Data Grid.

  9. Submission Agreement • Create machine actionable set of rules describing items. • Final Submission Agreement is composed of: • METS document for application defaults • METS Constraint document to limit METS form to submission parameters

  10. METS Overview • Provides a framework for linking structural organization of objects with metadata. • Using XML namespace, metadata from various XML schema can be attached to objects • Ie, dublin core, FGDC, etc • Extensible for more complex metadata • http://www.loc.gov/standards/mets/

  11. Sample METS Document

  12. Why METS Constraints? • METS doesn’t provide a way to create machine interpretable rules describing a collection • Ie: allow only TIFF files in certain structural areas • METS profiles allow for developer interpretable rules, not machine interpretable

  13. METS Constraints • Allows structural, metadata, and file constraints. • Structural Constraints: • Restrict child div’s and restrict pointers to div, file, and other mets documents • File Constraints: • Restrict files by mime-type or validation tests • Metadata Constraints: • Restrict allowed metadata schema.

  14. METS Constraints - Template <?xml version="1.0" encoding="UTF-8"?> <mets …. > <!-- validation test section, referenced in the constraints document --> <amdSec> <techMD ID="xmltest"> <mdWrap MDTYPE="OTHER"> <xmlData> <val:validation NAME="xmltext" DESCRIPTION="Test for valid xml documents" MIMETYPE="text/xml"> <val:valgrp required="true"> <val:valtest name="gif" required="true"> <val:description>generic gif test for any file</val:description> </val:valtest> </val:valgrp> </val:validation> </xmlData> </mdWrap> </techMD> </amdSec> <!-- base div structure to use for all clients --> <structMap> <div ID="ID1" LABEL="Research &amp; Development Records"> <div ID="ID1.1" LABEL="Research &amp; Development Project Records"> <div ID="ID1.1.1" LABEL="R&amp;D Project Case Files"/> <div ID="ID1.1.2" LABEL="R&amp;D Record Series"/> </div> </div> </structMap> </mets>

  15. METS Constraints - Rules <?xml version="1.0" encoding="UTF-8"?> <metsconstraint …> <filegrp ID="FILE1" NAME="Text Document"> <!-- Files can be identified either by MIMETYPE, or TESTID in skeleton METS document or both --> <file NAME="html document" MIMETYPE="text/html"/> <file TESTID="xmltext" NAME="xml document" MIMETYPE="text/xml"/> </filegrp> <!-- Apply rules to predefined div's and link to required file/metadata tests above --> <divrule DIVID="ID1" RESTRICTDIV="true" RESTRICTFTPR="true" RESTRICTMPTR="true"/> <divrule DIVID="ID1.1" RESTRICTDIV="true" RESTRICTFTPR="true" RESTRICTMPTR="true"/> <divrule DIVID="ID1.1.1" RESTRICTMPTR="true"> <filetype FILEGROUPID="FILE1"/> </divrule> <divrule DIVID="ID1.1.2" RESTRICTMPTR="true"/> </metsconstraint>

  16. Ingestion Workflow • Negotiate Submission Agreement. • Workflow Initialization and Submission Information Packet creation. • Transfer of SIPs to Data Grid site. • Validation of SIP transfer • Organization of data into collections and transfer into Data Grid.

  17. Initialize Ingestion workflow • Instantiate Producer management server to track registered objects • Establish a working trust relationship with the Data Grid • Issue clients.

  18. Create SIP • Each client registers objects stored locally with producer management server • Register file types, validation tests, etc • Client follows rules in Submission Agreement • Producer-wide agents can arrange registered object to give a broader context

  19. Submission packet is designed to contain a self describing set of metadata that is self-validating SIP Example

  20. Client Interface

  21. Ingestion Workflow • Negotiate Submission Agreement. • Workflow Initialization and Submission Information Packet creation. • Transfer of SIPs to Data Grid site. • Validation of SIP transfer • Organization of data into collections and transfer into Data Grid.

  22. Transfer SIP to Data Grid • Retrieve previously registered SIP from producer management server • Authenticate to data grid • Update tracking information with new location of files in data grid • Data Grid acknowledges transfer completion to producer management server

  23. Ingestion Workflow • Negotiate Submission Agreement. • Workflow Initialization and Submission Information Packet creation. • Transfer of SIPs to Data Grid site. • Validation of SIP transfer • Organization of data into collections and transfer into Data Grid.

  24. Validation of SIP transfer • Check incoming SIP against constraints documents. • Ensure object integrity by verifying checksums/cryptographic digest • Validate bitstreams against necessary tests • Record validation results

  25. Ingestion Workflow • Negotiate Submission Agreement. • Workflow Initialization and Submission Information Packet creation. • Transfer of SIPs to Data Grid site. • Validation of SIP transfer • Organization of data into collections and transfer into Data Grid.

  26. Final transfer to Data Grid • Transfer objects to Data Grid • Update tracking information with new location in Data Grid • Transfer log of data activity into data grid • Return accept/reject messages to producer metadata server

  27. Component Overview

  28. Producer Components • Database to track registered objects • Certificate Authority management • Web service for receiving side security callback • Management server supplies web service interfaces to ingestion clients and management operations. • Clients are designed to be standalone, with security certificates issued by producer

  29. Receiving Components • Receiving servers validate connecting clients and validate SIPs • Validation Services are simple webservice calls. • Abstract I/O layer into data grid.

  30. Recap • Implemented using web technologies • Architecture independent • XML based metadata • METS based SIPs • Add-on constraints describing Submission Agreement • Target release dates: • Beta: April • Release: June/July

  31. More Information • ADAPT website • http://www.umiacs.umd.edu/research/adapt • Papers • Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments • PAWN: Producer - Archive Workflow Network in Support of Digital Preservation

More Related