The oai protocol for metadata harvesting
1 / 20

The OAI Protocol for Metadata Harvesting - PowerPoint PPT Presentation

  • Uploaded on

The OAI Protocol for Metadata Harvesting. Andy Powell [email protected] UKOLN, University of Bath IVOA Registry Meeting, London March 2003. Contents. a brief history of OAI 10 technical things you should know about the OAI-PMH. OAI roots.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' The OAI Protocol for Metadata Harvesting' - oliana

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The oai protocol for metadata harvesting

The OAI Protocol for Metadata Harvesting

Andy Powell

[email protected]

UKOLN, University of Bath

IVOA Registry Meeting, London

March 2003


  • a brief history of OAI

  • 10 technical things you should know about the OAI-PMH

Oai roots
OAI roots

  • the roots of OAI lie in the development of eprint archives…

    • arXiv, CogPrints, NACA (NASA), RePEc, NDLTD, NCSTRL

  • each offered Web interface for deposit of articles and for end-user searches

  • difficult for end-users to work across archives without having to learn multiple different interfaces

  • recognised need for single search interface to all archives

    • Universal Pre-print Service (UPS)

Searching vs harvesting
Searching vs. harvesting

  • two possible approaches to building a single search interface to multiple eprint archives…

    • cross-searching multiple archives based on protocol like Z39.50

    • harvesting metadata into one or more ‘central’ services – bulk move data to the user-interface

  • US digital library experience in this area indicated that cross-searching not preferred approach

    • distributed searching of N nodes viable, but only for small values of N

Searching vs harvesting1

search service


search service

Searching vs. harvesting

Harvesting requirements
Harvesting requirements

  • in order that harvesting approach can work there need to be agreements about…

    • transport protocols – HTTP vs. FTP vs. …

    • metadata formats – DC vs. MARC vs. …

    • quality assurance – mandatory elements, mechanisms for naming of people, subjects, etc., handling duplicated records, best-practice

    • intellectual property and usage rights – who can do what with the records

  • work in this area resulted in the “Santa Fe Convention”

Development of oai pmh
Development of OAI-PMH

  • 2 year metamorphosis thru various names

    • Santa Fe Convention, OAI-PMH versions 1.0, 1.1…

    • OAI Protocol for Metadata Harvesting 2.0

  • development steered by international technical committee

  • inter-version stability helped developer confidence

  • move from focus on eprints to more generic protocol

    • move from OAI-specific metadata schema to mandatory support for DC

Bluffer s guide to oai
Bluffer’s guide to OAI

  • OAI-PMH is a low-cost mechanism for harvesting metadata records

    • from ‘data providers’ to ‘service providers’

  • allows ‘service provider’ to say ‘give me some or all of your metadata records’

    • where ‘some’ is based on date-stamps, sets, metadata formats

  • not limited to repositories of eprints

    • images, museum artefacts, learning objects, …

  • based on HTTP and XML

    • simple, Web-friendly, autonomous

    • fast, flexible deployment

Bluffer s guide to oai1
Bluffer’s guide to OAI

  • OAI-PMH is not a search protocol

    • but use can underpin search-based services based on Z39.50 or SRW or SOAP or…

  • OAI-PMH carries only metadata

    • content (e.g. full-text or image) made available separately – typically at URL in metadata

  • mandates simple DC as record format

    • but extensible to any XML format – IMS, ONIX, MARC, METS, etc.

  • extensible framework for metadata about

    • repository, resources, ‘items’, sets

    • can include rights metadata

Bluffer s guide to oai2
Bluffer’s guide to OAI

  • metadata and ‘content’ often made freely available – but not a requirement

    • OAI-PMH can be used between closed groups

    • or, can make metadata available but restrict access to content in some way

  • underlying HTTP protocol provides

    • access control – e.g. HTTP BASIC

    • compression mechanisms (for improving performance of harvesters)

    • could, in theory, also provide encryption if required

Resources items and records


Resources, items and records

all available metadata

about David

item = identifier


Dublin Core







Protocol requests
Protocol requests

  • six different request types

    • Identify

    • ListMetadataFormats

    • ListSets

    • ListIdentifiers

    • ListRecords

    • GetRecord

  • harvester need not use all types

  • repository must implement all types

  • required and optional arguments

    • on request types

Record structure
Record structure

  • metadata about a resource in a particular XML format

    • header (mandatory)

      • identifier (1)

      • datestamp (1)

      • setSpec elements (*)

      • status attribute for deleted item (?)

    • metadata (mandatory)

      • XML encoded metadata within root tag which provides namespace and schema

      • repositories must support Dublin Core

    • about (optional)

      • rights statements

      • provenance statements

Dublin core
Dublin Core

  • OAI-PMH mandates use of simple DC as lowest common denominator

  • agreed XML schema – ‘oai_dc’

    • simple DC – 15 metadata properties

    • all DC properties optional and repeatable

Oai demonstration
OAI demonstration

  • repository explorer demo

Oai and google
OAI and Google







DP9 gateway

OAI gatewaymakes harvested


available to


Implementing oai
Implementing OAI

  • OAI protocol is relatively simple

  • implementation and deployment tends to be very fast

  • lots of available toolkits

    • Java, Perl, PHP, etc.

  • complete tools also available

    • e.g. tools that sit in front ofexisting databases

  • see ‘tools’ area on theOAI Web site…

Creative commons
Creative Commons

  • CC is “devoted to expanding the range of creative work available for others to build upon and share”

  • provides ‘standard’ licences for content

    • attribution

    • noncommercial

    • no derivative works

    • share alike

  • mechanisms for indicating licence on Web pages

  • need similar mechanism in OAI