Open archives initiative protocol for metadata harvesting oai pmh
Sponsored Links
This presentation is the property of its rightful owner.
1 / 63

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on
  • Presentation posted in: General

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Alon Kadury. Content. Reminders History OAI overview Technical introduction Conclusions Demonstrations Resources. Definition- A Digital Library is a:. 1. Collection of digital objects

Download Presentation

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Open Archives Initiative Protocol for Metadata Harvesting(OAI-PMH)

Alon Kadury


Content

  • Reminders

  • History

  • OAI overview

  • Technical introduction

  • Conclusions

  • Demonstrations

  • Resources


Definition- A Digital Library is a:

1. Collection of digital objects

2. Collection of knowledge structures

3. Collection of library services

4. Domain/Focus/Topic

5. Quality Control

6. Preservation/Persistence


Types of DLs

  • Single Digital Library (SDL)

    • also Stand-alone, Self-contained

  • Federated Digital Library (FDL)

    • also confederated, distributed

  • Harvested Digital Library (HDL)


Single Digital Library (SDL)

  • A regular DL

  • Self-contained material:

    • purchased

    • scanned/digitized

  • Usually localized


Federated Digital Library (FDL)

  • Contains many autonomous libraries

  • Usually heterogeneous repositories

  • Connected via network

  • Forms a virtual distributed library

  • Transparent user interface

  • The major problem is interoperability.


Harvested Digital Library (HDL)

  • Does not contain data, just metadata

  • Objects harvested into summaries

  • Regular DL characteristics:

    • fine granularity

    • rich library services

    • high quality control

    • annotated


History

  • As the Web evolved, the number of Web sites and search engines increased.A similar process happened with e-prints and digital libraries.

  • The changes in the amount of DLs led to the development of the OAI-PMH protocol as we’re about to see.


History - Problems

The development of e-prints and digital libraries let to several problems like:

  • Many user interfaces -Each DL offered Web interface for deposit of articles and for end-user searches.The result: Difficult for end users to work across archives without having to learn multiple different interfaces.


History - Problems

  • Different queries’ syntax -The result: Difficult for the user to keep track of the searching syntax of each SDL and difficult to create an FDL that could query many SDLs.

  • Many metadata formats -SDL metadata could be kept in any format the SDL wanted.The result: Hard times for the FDLs which had to know the formats of each SDL they are harvesting.


History – Possible solutions

  • The problems led researchers to recognise the need for single search interface to all archives - Universal Pre-print Service (UPS).

  • Two possible approaches to building the UPS where considered:


History – Solution 1

Cross-searching multiple archive:In this approach a client sends requests to several servers and then combines the data.The client and server work with a known and agreed protocol (for example Z39.50).However, studies showed this approach is not the preferred approach for distributed searching of large values of nodes mainly due to problems like knowing which collections to search and performance issues.


History – Solution 2

Harvesting metadata into a ‘Central Server’:This approach harvests the metadata and stores it in a central server, on which searches are made.

  • The idea was demonstrated in a convention held at Santa Fe NM, October 21-22, 1999.

  • UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/

    More reading: http://www.dlib.org/dlib/february00/02contents.html


OAI overview- definitions

Lets start with a few definitions:

  • Interoperability

  • Open Archive Initiative (OAI)

  • Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH)


OAI overview- definitions

  • What is Interoperability?

  • Interoperability refers to the ability of two or more systems to interact with one another and exchange data according to a prescribed method in order to achieve predictable results.


OAI overview- definitions

  • In order to exchange data we need to agree on things like:

    • requests format

    • results format

    • transport protocols (HTTP vs FTP vs….)

    • Metadata formats (DC vs MARC vs…)

    • Usage rights (who can do what with the records)

  • We need someone to organize it and “set the rules”.


OAI overview- definitions

  • Who will organize it?

  • Open Archive Initiative -“The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content.” (http://www.openarchives.org/organization/index.html)


OAI overview- definitions

  • What will the interoperability standards be called?

    Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH)


OAI overview- Key players

  • When talking about OAI-PMH we see three main players:

    • Data Providers

    • Service Providers

    • The protocol (OAI-PMH)


OAI overview- Data Provider

  • Data Provider:

    • Handles deposit/publishing of resources in archive.

    • Expose metadata about resources in archive (using the OAI-PMH protocol\interface).

    • Data Providers may support any metadata format, but must support the metadata format Dublin Core (DC).

    • Offer free access to the archives (at least the metadata).

    • A network accessible server, able to process OAI-PMH requests correctly is often called a Repository.


OAI overview - Service Provider

  • Service Provider:

    • Harvest metadata from data providers and use it to offer single user-interface across all harvested metadata.

    • May enrich metadata.

    • Offer (value-added) services on the basis of the metadata.

    • Client application issuing OAI-PMH requests is often referred to as a Harvester.


OAI overview - Providers


Native

end-user

interface

Service

Provider

Native

harvesting

interface

Native

harvesting

interface

Data Provider

Input

interface

Data Provider

Native

end-user

interface

Native end-user

interface optional

(e.g., RePEc)

OAI overview - Providers

Input

interface


Data providers

Harvesting

based on

OAI-PMH

Service providers

OAI overview - Providers


Web interfaces

Layer 4

Service Provider - FDL\HDL

Layer 3

OAI-PMH

SDL

SDL

SDL

Layer 2

Web

Layer 1

OAI overview - Model


Technical introduction

Since the days of the Santa Fe convention the protocol had several versions.

Version 2.0 is the latest and is considered stable.The technical introduction refers to this version.


Santa Fe

convention

OAI-PMH

v.1.0/1.1

OAI-PMH

v.2.0

stable

nature

experimental

experimental

Dienst

verbs

OAI-PMH

OAI-PMH

requests

HTTP GET/POST

HTTP GET/POST

HTTP GET/POST

XML

responses

XML

XML

transport

HTTP

HTTP

HTTP

unqualified

Dublin Core

unqualified

Dublin Core

metadata

OAMS

document

like objects

resources

about

eprints

metadata

harvesting

metadata

harvesting

metadata

harvesting

model

Tech’- protocol versions


The requests of the protocol are HTTP based.

The response contents of the protocol are XML based.

Question: why?

Answer:

Simple protocol based on existing standards which allows rapid development & effortless implementation.

Systems can be deployed in variety of configurations.

Low barrier interoperability specification.

Internet/Firewall friendly.

Tech’- request & response


Requests (based on HTTP)

Metadata

Metadata

(Documents)

„Service”

Metadata (encoded in XML)

Harvester

Repository

Service Provider

Data Provider

Tech’- request & response

There are six request types which are called verbs.

The request type and additional information are passed as parameters using HTTP POST or GET methods.


Lets see a demonstration about how we can create a FDL and then we will look at the backstage of it.

Demo


Data Provider

e-prints

Requests:

Identify

ListMetadataformats

ListSets

ListIdentifiers

ListRecords

GetRecord

Repository

Data Provider

Images

Repository

Service Provider

Data Provider

OPAC

Repository

Data Provider

Harvester

Data Provider

Responses:

General information

Metadata formats

Set structure

Record identifier

Metadata

Museum

Repository

Data Provider

Archive

Repository

Tech’– more definition


Tech’–Request Types

  • Six different request types

    • Identify

    • ListMetadataFormats

    • ListSets

    • ListIdentifiers

    • ListRecords

    • GetRecord

  • Harvester does not have to use all types.

  • Repository must implement all request types fully (all required and optional arguments for each of the requests).


Tech’- Request Type: Identify

functionretrieve description and general information about an archive.

examplearchive.org/oai-script?verb=Identify

parametersnone

errors / exceptionsbadArgumente.g. archive.org/oai-script?verb=Identify&set=biology


Tech’- Request Type: Identify

Response format


Tech’- Request Type: Identify

Response in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=Identify


Tech’- Request Type: ListMetadataFormats

functionretrieve available metadata formats from archive.Remember that each archive must implement at least DC.

examplearchive.org/oai-script?verb=ListMetadataFormats

parametersidentifier (optional)

errors / exceptionsbadArgumentidDoesNotExiste.g. archive.org/oai-script?verb=ListMetadataFormats&identifier=really-wrong-identifiernoMetadataFormats


Tech’- Request Type: ListMetadataFormats

Response in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=ListMetadataFormats


Tech’- Request Type: ListSets

  • Q: What are Sets?A: Sets are logical partitioning of repositories.

  • Q: Why use sets?A: Sets function was aimed to enable selective harvesting.

  • Data providers don’t have to define sets.

  • Sets are not strictly hierarchical.


Tech’- Request Type: ListSets

functionretrieve set structure of a repository

examplearchive.org/oai-script?verb=ListSets

parametersresumptionToken (exclusive)

errors / exceptionsbadArgumentbadResumptionTokene.g. archive.org/oai-script?verb=ListSets&resumptionToken=any-wrong-token

noSetHierarchy


Tech’- Request Type: ListSets

Response in XML format http://theses.lub.lu.se/oai-service/xerxes/?verb=ListSets


Tech’- Request Type: ListIdentifiers

functionabbreviated form of ListRecords, retrieving only headers

examplearchive.org/oai-script?verb=ListIdentifiers&metadataPrefix=oai_dc&from=2002-12-01

parametersfrom(optional)until(optional)metadataPrefix(required)set(optional) resumptionToken (exclusive)

errors / exceptionsbadArgument, e.g. …&from=2002-12-01-13:45:00badResumptionTokencannotDisseminateFormatnoRecordsMatchnoSetHierarchy


Tech’- Request Type: ListIdentifiers

Response in XML format http://theses.lub.lu.se/oai-service/xerxes/?verb=ListIdentifiers&metadataPrefix=oai_dc


Tech’- Request Type: ListRecords

functionharvest records from a repository

examplearchive.org/oai-script?verb=ListRecords&metadataPrefix=oai_dc&set=biology

parametersfrom(optional)until(optional)metadataPrefix(required)set(optional) resumptionToken (exclusive)

errors / exceptionsbadArgumentbadResumptionTokencannotDisseminateFormatnoRecordsMatchnoSetHierarchy


Tech’- Request Type: GetRecord

functionretrieve individual metadata record from a repository

examplearchive.org/oai-script?verb=GetRecord&identifier=oai:HUBerlin.de:3000218&metadataPrefix=oai_dc

parametersidentifier(required)metadataPrefix(required)

errors / exceptionsbadArgumentcannotDisseminateFormatidDoesNotExist


resource

all available metadata

about David

item = identifier

item

Dublin Core

metadata

MARC

metadata

SPECTRUM

metadata

records

Tech’- Records, items & DCor setting the record straight


Tech’- Records, items & DC

A record consists of:

  • Header (mandatory)

    • identifier (1)

    • datestamp (1)

    • setSpec elements (*)

    • status attribute for deleted item (?)

  • Metadata (mandatory)

    • XML encoded metadata with root tag, namespace

    • repositories must support Dublin Core

  • About (optional)

    • rights statements

    • provenance statements


Tech’- Records, items & DC

  • OAI-PMH supports dissemination of multiple metadata formats from a repository.

  • Properties of metadata formats:

    • id string to specify the format (metadataPrefix)

    • metadata schema URL (XML schema to test validity)

    • XML namespace URI (global identifier for metadata format)

  • Repositories must be able to disseminate unqualified DC.

  • Arbitrary metadata formats can be defined and transported via the OAI-PMH.

  • Returned metadata must comply with XML namespace specification.


Tech’- Records, items & DC

As mentioned before the minimum standard is unqualified Dublin Core (http://dublincore.org/).

  • Dublin Core Metadata Element Set contains 15 elements.

  • All elements are optional.

  • All elements may be repeated.

    The Dublin Core Metadata Element Set:


Tech’- Records, items & DC

Response in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=GetRecord&identifier=oai:CiteSeerPSU:1&metadataPrefix=oai_dc


Tech’- Flow control

  • Some of the request commands can generate a very long response (for example think about requesting a CiteSeer or Library of Congress to list ALL their records using the GetRecords verb).

  • In order not to generate long responses that will over load the server, a flow control mechanism was added to the protocol.

  • It is only within the server responsibility to split long responses into shorter ones; the client has no control over length of the responses.


Tech’- Flow control

  • The flow control mechanism is referred to as “resumption token”, and in it, the server splits the long response into shorter ones and assigns at the end of each response a token that the client will pass on the next request the get the next part.


“want to have all your records”

archive.org/oai?verb=ListRecords&metadataPrefix=oai_dc

Data Provider

Service Provider

“have 267, but give you only 100”

100 records + resumptionToken “anyID1”

“want more of this”

archive.org/oai?resumptionToken=anyID1

“have 267, give you another 100”

Repository

Harvester

100 records + resumptionToken “anyID2”

“want more of this”

archive.org/oai?resumptionToken=anyID2

“have 267, give you my last 67”

67 records + resumptionToken “”

Tech’- Flow control


Conclusions and future use

  • We saw that the increasing number of digital libraries caused the different DL types some problems:

    • FDLs and HDLs had to overcome different obstacles in order to federate or harvest data from SDLs due to different metadata formats and different queries formats for example.

    • The user had to overcome the learning of different user interfaces each SDL offered.


Conclusions and future use

  • When looking at the OAI-PMH it seemed that putting the protocol in use will eliminate those problems.Service providers can lower the number of different user interfaces the user needs to handle and federating or harvesting would be much easier using a common standard.However…


Conclusions and future use

  • When putting the protocol in use in digital libraries environment, the lack of strict rules may cause new problems or make the old ones reappear in another way.

  • Lets take Citeseer for example.It contains 723140 records and its metadata size is around 1GB.If one would want to harvest citeseer efficiently for records dealing with a specific topic how could it be done?


Conclusions and future use

  • Since the searching for data within the metadata is done at the harvester size, it could not ask citeseer to give it only records dealing with "network computationת" for example.

  • Remember the sets? Could they be used to harvest only part of the information instead of handling a Giga of data?

  • The answer is no since citeseer contains only one set.


Conclusions and future use

  • The DC also might be a too low barrier which causes more and more SDLs to support not only DC but to create their own metadata formats (citeseer for example has two formats it supports).

    Nevertheless, OAI-PMH is becoming more and more a standard in digital libraries and is making a large contribution for the DLs and from the looks of it, it’s here to stay.


What's next

  • Riddle –

    • Improving harvesting and creation of HDLs.

    • Composition of HDLs.


Web interfaces

Layer 5

CHDL

Layer 4

HDL

Layer 3

OAI-PMH

SDL

SDL

SDL

Layer 2

Web

Layer 1

What's next


Demonstration

  • Independent queries.

  • Repositories explorer:http://re.cs.uct.ac.za/

  • OAISter (FDL):http://oaister.umdl.umich.edu/o/oaister/

  • Scirus (FDL):http://www.scirus.com/srsapp/

  • Riddle demo:http://riddle.dynalias.com:20055/riddle.html


Resources

  • OAI – official sitehttp://www.openarchives.org/

  • protocol specificationhttp://www.openarchives.org/OAI/openarchivesprotocol.html

  • general mailing listhttp://www.openarchives.org/mailman/listinfo/OAI-general/

  • implementers mailing listhttp://www.openarchives.org/mailman/listinfo/OAI-implementers/

  • Presentation which this presentation was based on: http://www.oaforum.org/otherfiles/lisb_tutorial.ppt

  • Z39.50:http://www.loc.gov/z3950/agency/


Questions


The end


  • Login