370 likes | 646 Views
EuropeanaLocal Knowledge Sharing Workshop. Metadata Harvesting. Julie Verleyen Scientific Coordinator, Europeana Office. The Hague, 13 & 14 January 2009. Table Of Content. Harvesting in Europeana: workflow and requirements Best-practices Recommendations Common issues Tools / Software
E N D
EuropeanaLocal Knowledge Sharing Workshop Metadata Harvesting Julie Verleyen Scientific Coordinator, Europeana Office The Hague, 13 & 14 January 2009
Table Of Content • Harvesting in Europeana: workflow and requirements • Best-practices • Recommendations • Common issues • Tools / Software • Resources • Documentation
Harvesting in Europeana • Determine collections to be contributed • Questionnaire
Harvesting in Europeana • Obtain OAI-PMH repository parameters: • Absolute minimum (enough for fully implemented, tested and documented OAI repositories) • Server base URL • Very useful to have: • Mapping between described collection(s) and OAI-PMH set(s) • Prefix of metadata format to use preferably for Europeana (if not described in ListMetadataFormats response): ex: oai_dc, mods, tel, ese
Harvesting in Europeana • Configuration of harvester • Full harvest with ListRecords request • Records collected in XML files ≤ 10MB • Harvest stored in SVN
Best-practices: implementation • Compliancy to OAI-PMH 2.0 protocol specifications http://www.openarchives.org/OAI/openarchivesprotocol.html . Follow implementation guidelines OAI-PMH v2 for repository implementers http://www.openarchives.org/OAI/2.0/guidelines-repository.htm • Full functional tests!!
Best-practices: OAI validation OAI validation = Your OAI repository correctly implements the OAI-PMH! Correct response to all OAI-PMH requests: with arguments, various error conditions, every XML schema of every OAI response is valid,...
Recommended approach to OAI validation • Follow the Open Archive Initiative Protocol Testing • Validate your server using the validator supplied by the OAI. http://www.openarchives.org/data/registerasprovider.html Without registering clicking checkbox "only validate and do not register (you may then register later)."
http://www.openarchives.org/data/registerasprovider.html => bottom of the page
Issues and recommendations: sets • Set = "an optional construct for grouping items for the purpose of selective harvesting.“
Number of obstacles related to sets: • Interpreting how a repository has organized sets and determining which sets to harvest • Issue: setName not human understandable and/or no setDescription provided. • Issue: Large number of sets to sort through. • Knowing when there are records that belong to no sets • Issue: Items that belong to no sets are included in the OAI repository. • Knowing when there are empty sets • Issue: Data provider exposes sets with no records.
Number of obstacles related to sets: • Understanding relationships between sets • Issue: Relationships between sets are not expressed. • Mechanism to express relationships between hierarchical sets • But no mechanism to express relationships between overlapping sets! • The only way to know: harvest the identifiers or records which contain the header information sets record belongs to
Number of obstacles related to sets: • Knowing how many records there are within a set before harvesting • Issue: Not expressing how many records are within a set which can be expressed via a completeListSize attribute in a resumptionToken or within the set description. • Knowing when a set structure has been substantially changed • Issue: Changes in a set structure has not been communicated
Sets: recommendations • No single best practice for the organization of sets. • Realistically: data providers organize sets in a way which best meets the needs of their primary service provider and can be easily done within their own internal workflows. • Useful to organize the metadata items into sets according to the collections of resources they represent. • Concept of collections varies and not completely clear in Europeana. • Useful for harvester to understand notion of collection for data providers
Basic requirements • Repository implementation following OAI-PMH v2.0 + tested • Inform Europeana harvesting responsible of any repository changes / maintenance • No regular harvesting schema determined yet • “SLA” between data providers and harvesters
Common issues • Unavailability / unreliability of repository server • Implementation of OAI-PMH v2 incomplete • resumptionToken not supported • Only ListIdentifiers • XML syntax errors • Character encoding errors • Short lifetime of resumptionToken
Tools / Software TEL/Europeana OAI-PMH Harvester – Offline documentation • Harvester • Java standalone application with GUI • Multiple harvesting jobs • Resuming unfinished jobs • Logging • No scheduling, No configuration interface
Tools / Software REPOX - http://repox.ist.utl.pt/ • Repository + Harvester • Java standalone application with web GUI • Multiple harvesting jobs, Scheduler • Statistics • Management of XML metadata repository • Versioning and identification of records • Different metadata format • User interface to create metadata crosswalks: Schema mapper
Tools / Software OAIcat from OCLC - http://www.oclc.org/research/software/oai/cat.htm • Framework conforming to the OAI-PMH v2.0 • Repository + Harvesting • Java web application • Scheduling, logging • Limited scalability (~2M records)
Tools / Software (TELplus D2.1) Other implementations in different languages to plug-in into a Library Management System: • PHP: OAIbiblio data provider implementation of the OAI-PMH, version 2.0. This toolkit can be easily customized to communicate with an already existing, multi-table MySQL database • PERL: Celestial OAI aggregator/cache application that imports OAI metadata from version 1.0,1.1,2.0 OAI-compliant repositories, and re-exposes that metadata through either an aggregated or per-repository OAI-compliant 2.0 interface. Celestial requires oai-perl v2, MySQL, Perl 5.6.x and a CGI-capable web server • Ruby: ruby-oai Includes a client library, a server/provider library and a interactive harvesting shell • Python: pyoai package enables high-level access to an OAI-PMH Metadata Repository and also implements a framework for quickly creating OAI-PMH compliant servers
Tools / Software • ESE XML validation schemas developed by partners
Resources • The Open Archives Initiative Protocol for Metadata Harvesting v2.0 http://www.openarchives.org/OAI/openarchivesprotocol.html • TELplus D2.1, “OAI-PMH implementation and tools guidelines”, 21 pages • Protocol overview and description of main concepts • OAI-PMH implementation in libraries • References
Resources • Wiki “Best Practices for OAI Data Provider Implementations and Shareable Metadata”: Excellent source of guidelines, tutorials, recommendations, implementation softwares and tools, references etc... http://webservices.itcs.umich.edu/mediawiki/oaibp/index.php/Main_Page
Documentation in Europeana context • Requirements: • Europeana OAI-PMH Harvesting • Europeana OAI-PMH Repositories • ESE XML validation schema • Europeana OAI-PMH data providers registry & forum/mailing list • Local systems • OAI-PMH repository solution • Contact
Thank youQuestions? Remarks?... Julie.Verleyen@kb.nl