1 / 21

SOAPI: a flexible toolkit for implementing ingest and preservation workflows

SOAPI: a flexible toolkit for implementing ingest and preservation workflows. Mark Hedges Centre for e-Research, King’s College London Arts and Humanities Data Service. Background. Arts & Humanities Data Service

micah
Download Presentation

SOAPI: a flexible toolkit for implementing ingest and preservation workflows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SOAPI: a flexible toolkit for implementing ingest and preservation workflows Mark Hedges Centre for e-Research, King’s College London Arts and Humanities Data Service

  2. Background • Arts & Humanities Data Service • Activities included management and preservation of research outputs from UK researchers in arts and humanities • Centre for e-Research, King’s College London (CeRch) • Activities will include management and preservation of research outputs from KCL researchers in all disciplines • Among other things …

  3. Context • Ingestion and preservation of complex material into digital repository (Fedora-based) • Unpredictable structures • Many formats • Formalised but manual procedures • Not scaleable • Functional limitations (e.g. preservation metadata, provenance)

  4. Schematic ingest process (simplified)

  5. Requirements • Handles complex/compound objects • Distributed architecture • Scalable • Automated processing and user input • Able to integrate specialised third-party tools (e.g. format conversion) • Preservation metadata management • Audit trail/provenance metadata

  6. Approach • Workflow management tool to create and execute workflows (jBPM) • Generic interfaces defining common preservation and ingest actions • Implementations of these interfaces encapsulating units of functionality • Generic interfaces to wrap third-party tools. • Web service (SOAP & REST) and local implementations

  7. jBPM • Chain together automated actions and user tasks to form a workflow or “Business Process” • Open source, flexible, extensible workflow management system • Bridges gap between users/developers by giving them a common language • Packaged as a J2EE application - can run on any J2EE application server such as JBoss.

  8. jPBM (design view)

  9. jBPM (XML view) A jPDL (XML) fragment defining (part of) a workflow

  10. jBPM (Nodes and Action Handlers)

  11. jBPM (execution view)

  12. Architecture (1)

  13. Architecture (2)

  14. Interfaces Interfaces: • local (java), SOAP and REST options • coarse-grained e.g.: • Create file characterisation • Identify file format • Migrate file format • Normalise file format • Check file integrity • …

  15. Service implementations • Configure use of particular implementations, e.g. • Format validation: JHOVE and others • Format identification: JHOVE, DROID, XENA • Format conversion: various • Metadata capture: PREMIS

  16. Workflow inputs & ouputs

  17. Re-use example – SHERPA DP 2 Project Objectives: • Investigate methods for the provision of distributed preservation services and alternative methods of content-service provider interaction. • Provide archiving for varied software repositories and web resources • Perform curatorial activities for diverse types of content, ranging from simple objects to highly structured research data. Website: http://www.sherpadp.org.uk Contact: stephen.grace@kcl.ac.uk; gareth.knight@kcl.ac.uk

  18. Re-use example – SHERPA DP 2 Content providers supported: • Repositories: Fedora, CDS Invenio, DSpace, EPrints, DigiTool • Website: Large dynamic sites, static sites. Automated ingest methods: • OAI-PMH: METS, MPEG21-DIDL, MarcXML, Dublin Core and other metadata formats supported. • SWORD: An ATOM application profile Content types supported: • Wide variety of supported content type - image collections, static and dynamic web sites, datasets and other types of research data. Website: http://www.sherpadp.org.uk Contact: stephen.grace@kcl.ac.uk; gareth.knight@kcl.ac.uk

  19. Issues • Lack of suitable tools in some areas – expensive, outputs unreliable • Preserving content – what do we actually want to preserve? • Significant properties – soft concept, hard to quantify (InSPECT) • Problems with jBPM

  20. Further work • Make code more robust and fill in gaps • Integrate task screens with other identity management systems (e.g. Shibboleth federation) • Incorporate content model-specific processing • Incorporate disseminators • Integrate service registry for selecting services to invoke • Resource discovery metadata generation

  21. Questions Contact: mark.hedges@kcl.ac.uk

More Related