Open archives initiative protocol for metadata harvesting oai pmh
This presentation is the property of its rightful owner.
Sponsored Links
1 / 63

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) PowerPoint PPT Presentation


  • 101 Views
  • Uploaded on
  • Presentation posted in: General

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Alon Kadury. Content. Reminders History OAI overview Technical introduction Conclusions Demonstrations Resources. Definition- A Digital Library is a:. 1. Collection of digital objects

Download Presentation

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Open archives initiative protocol for metadata harvesting oai pmh

Open Archives Initiative Protocol for Metadata Harvesting(OAI-PMH)

Alon Kadury


Content

Content

  • Reminders

  • History

  • OAI overview

  • Technical introduction

  • Conclusions

  • Demonstrations

  • Resources


Definition a digital library is a

Definition- A Digital Library is a:

1. Collection of digital objects

2. Collection of knowledge structures

3. Collection of library services

4. Domain/Focus/Topic

5. Quality Control

6. Preservation/Persistence


Types of dls

Types of DLs

  • Single Digital Library (SDL)

    • also Stand-alone, Self-contained

  • Federated Digital Library (FDL)

    • also confederated, distributed

  • Harvested Digital Library (HDL)


Single digital library sdl

Single Digital Library (SDL)

  • A regular DL

  • Self-contained material:

    • purchased

    • scanned/digitized

  • Usually localized


Federated digital library fdl

Federated Digital Library (FDL)

  • Contains many autonomous libraries

  • Usually heterogeneous repositories

  • Connected via network

  • Forms a virtual distributed library

  • Transparent user interface

  • The major problem is interoperability.


Harvested digital library hdl

Harvested Digital Library (HDL)

  • Does not contain data, just metadata

  • Objects harvested into summaries

  • Regular DL characteristics:

    • fine granularity

    • rich library services

    • high quality control

    • annotated


History

History

  • As the Web evolved, the number of Web sites and search engines increased.A similar process happened with e-prints and digital libraries.

  • The changes in the amount of DLs led to the development of the OAI-PMH protocol as we’re about to see.


History problems

History - Problems

The development of e-prints and digital libraries let to several problems like:

  • Many user interfaces -Each DL offered Web interface for deposit of articles and for end-user searches.The result: Difficult for end users to work across archives without having to learn multiple different interfaces.


History problems1

History - Problems

  • Different queries’ syntax -The result: Difficult for the user to keep track of the searching syntax of each SDL and difficult to create an FDL that could query many SDLs.

  • Many metadata formats -SDL metadata could be kept in any format the SDL wanted.The result: Hard times for the FDLs which had to know the formats of each SDL they are harvesting.


History possible solutions

History – Possible solutions

  • The problems led researchers to recognise the need for single search interface to all archives - Universal Pre-print Service (UPS).

  • Two possible approaches to building the UPS where considered:


History solution 1

History – Solution 1

Cross-searching multiple archive:In this approach a client sends requests to several servers and then combines the data.The client and server work with a known and agreed protocol (for example Z39.50).However, studies showed this approach is not the preferred approach for distributed searching of large values of nodes mainly due to problems like knowing which collections to search and performance issues.


History solution 2

History – Solution 2

Harvesting metadata into a ‘Central Server’:This approach harvests the metadata and stores it in a central server, on which searches are made.

  • The idea was demonstrated in a convention held at Santa Fe NM, October 21-22, 1999.

  • UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/

    More reading: http://www.dlib.org/dlib/february00/02contents.html


Oai overview definitions

OAI overview- definitions

Lets start with a few definitions:

  • Interoperability

  • Open Archive Initiative (OAI)

  • Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH)


Oai overview definitions1

OAI overview- definitions

  • What is Interoperability?

  • Interoperability refers to the ability of two or more systems to interact with one another and exchange data according to a prescribed method in order to achieve predictable results.


Oai overview definitions2

OAI overview- definitions

  • In order to exchange data we need to agree on things like:

    • requests format

    • results format

    • transport protocols (HTTP vs FTP vs….)

    • Metadata formats (DC vs MARC vs…)

    • Usage rights (who can do what with the records)

  • We need someone to organize it and “set the rules”.


Oai overview definitions3

OAI overview- definitions

  • Who will organize it?

  • Open Archive Initiative -“The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content.” (http://www.openarchives.org/organization/index.html)


Oai overview definitions4

OAI overview- definitions

  • What will the interoperability standards be called?

    Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH)


Oai overview key players

OAI overview- Key players

  • When talking about OAI-PMH we see three main players:

    • Data Providers

    • Service Providers

    • The protocol (OAI-PMH)


Oai overview data provider

OAI overview- Data Provider

  • Data Provider:

    • Handles deposit/publishing of resources in archive.

    • Expose metadata about resources in archive (using the OAI-PMH protocol\interface).

    • Data Providers may support any metadata format, but must support the metadata format Dublin Core (DC).

    • Offer free access to the archives (at least the metadata).

    • A network accessible server, able to process OAI-PMH requests correctly is often called a Repository.


Oai overview service provider

OAI overview - Service Provider

  • Service Provider:

    • Harvest metadata from data providers and use it to offer single user-interface across all harvested metadata.

    • May enrich metadata.

    • Offer (value-added) services on the basis of the metadata.

    • Client application issuing OAI-PMH requests is often referred to as a Harvester.


Oai overview providers

OAI overview - Providers


Oai overview providers1

Native

end-user

interface

Service

Provider

Native

harvesting

interface

Native

harvesting

interface

Data Provider

Input

interface

Data Provider

Native

end-user

interface

Native end-user

interface optional

(e.g., RePEc)

OAI overview - Providers

Input

interface


Oai overview providers2

Data providers

Harvesting

based on

OAI-PMH

Service providers

OAI overview - Providers


Oai overview model

Web interfaces

Layer 4

Service Provider - FDL\HDL

Layer 3

OAI-PMH

SDL

SDL

SDL

Layer 2

Web

Layer 1

OAI overview - Model


Technical introduction

Technical introduction

Since the days of the Santa Fe convention the protocol had several versions.

Version 2.0 is the latest and is considered stable.The technical introduction refers to this version.


Tech protocol versions

Santa Fe

convention

OAI-PMH

v.1.0/1.1

OAI-PMH

v.2.0

stable

nature

experimental

experimental

Dienst

verbs

OAI-PMH

OAI-PMH

requests

HTTP GET/POST

HTTP GET/POST

HTTP GET/POST

XML

responses

XML

XML

transport

HTTP

HTTP

HTTP

unqualified

Dublin Core

unqualified

Dublin Core

metadata

OAMS

document

like objects

resources

about

eprints

metadata

harvesting

metadata

harvesting

metadata

harvesting

model

Tech’- protocol versions


Tech request response

The requests of the protocol are HTTP based.

The response contents of the protocol are XML based.

Question: why?

Answer:

Simple protocol based on existing standards which allows rapid development & effortless implementation.

Systems can be deployed in variety of configurations.

Low barrier interoperability specification.

Internet/Firewall friendly.

Tech’- request & response


Tech request response1

Requests (based on HTTP)

Metadata

Metadata

(Documents)

„Service”

Metadata (encoded in XML)

Harvester

Repository

Service Provider

Data Provider

Tech’- request & response

There are six request types which are called verbs.

The request type and additional information are passed as parameters using HTTP POST or GET methods.


Open archives initiative protocol for metadata harvesting oai pmh

Lets see a demonstration about how we can create a FDL and then we will look at the backstage of it.

Demo


Tech more definition

Data Provider

e-prints

Requests:

Identify

ListMetadataformats

ListSets

ListIdentifiers

ListRecords

GetRecord

Repository

Data Provider

Images

Repository

Service Provider

Data Provider

OPAC

Repository

Data Provider

Harvester

Data Provider

Responses:

General information

Metadata formats

Set structure

Record identifier

Metadata

Museum

Repository

Data Provider

Archive

Repository

Tech’– more definition


Tech request types

Tech’–Request Types

  • Six different request types

    • Identify

    • ListMetadataFormats

    • ListSets

    • ListIdentifiers

    • ListRecords

    • GetRecord

  • Harvester does not have to use all types.

  • Repository must implement all request types fully (all required and optional arguments for each of the requests).


Tech request type identify

Tech’- Request Type: Identify

functionretrieve description and general information about an archive.

examplearchive.org/oai-script?verb=Identify

parametersnone

errors / exceptionsbadArgumente.g. archive.org/oai-script?verb=Identify&set=biology


Tech request type identify1

Tech’- Request Type: Identify

Response format


Tech request type identify2

Tech’- Request Type: Identify

Response in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=Identify


Tech request type listmetadataformats

Tech’- Request Type: ListMetadataFormats

functionretrieve available metadata formats from archive.Remember that each archive must implement at least DC.

examplearchive.org/oai-script?verb=ListMetadataFormats

parametersidentifier (optional)

errors / exceptionsbadArgumentidDoesNotExiste.g. archive.org/oai-script?verb=ListMetadataFormats&identifier=really-wrong-identifiernoMetadataFormats


Tech request type listmetadataformats1

Tech’- Request Type: ListMetadataFormats

Response in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=ListMetadataFormats


Tech request type listsets

Tech’- Request Type: ListSets

  • Q: What are Sets?A: Sets are logical partitioning of repositories.

  • Q: Why use sets?A: Sets function was aimed to enable selective harvesting.

  • Data providers don’t have to define sets.

  • Sets are not strictly hierarchical.


Tech request type listsets1

Tech’- Request Type: ListSets

functionretrieve set structure of a repository

examplearchive.org/oai-script?verb=ListSets

parametersresumptionToken (exclusive)

errors / exceptionsbadArgumentbadResumptionTokene.g. archive.org/oai-script?verb=ListSets&resumptionToken=any-wrong-token

noSetHierarchy


Tech request type listsets2

Tech’- Request Type: ListSets

Response in XML format http://theses.lub.lu.se/oai-service/xerxes/?verb=ListSets


Tech request type listidentifiers

Tech’- Request Type: ListIdentifiers

functionabbreviated form of ListRecords, retrieving only headers

examplearchive.org/oai-script?verb=ListIdentifiers&metadataPrefix=oai_dc&from=2002-12-01

parametersfrom(optional)until(optional)metadataPrefix(required)set(optional) resumptionToken (exclusive)

errors / exceptionsbadArgument, e.g. …&from=2002-12-01-13:45:00badResumptionTokencannotDisseminateFormatnoRecordsMatchnoSetHierarchy


Tech request type listidentifiers1

Tech’- Request Type: ListIdentifiers

Response in XML format http://theses.lub.lu.se/oai-service/xerxes/?verb=ListIdentifiers&metadataPrefix=oai_dc


Tech request type listrecords

Tech’- Request Type: ListRecords

functionharvest records from a repository

examplearchive.org/oai-script?verb=ListRecords&metadataPrefix=oai_dc&set=biology

parametersfrom(optional)until(optional)metadataPrefix(required)set(optional) resumptionToken (exclusive)

errors / exceptionsbadArgumentbadResumptionTokencannotDisseminateFormatnoRecordsMatchnoSetHierarchy


Tech request type getrecord

Tech’- Request Type: GetRecord

functionretrieve individual metadata record from a repository

examplearchive.org/oai-script?verb=GetRecord&identifier=oai:HUBerlin.de:3000218&metadataPrefix=oai_dc

parametersidentifier(required)metadataPrefix(required)

errors / exceptionsbadArgumentcannotDisseminateFormatidDoesNotExist


Tech records items dc or setting the record straight

resource

all available metadata

about David

item = identifier

item

Dublin Core

metadata

MARC

metadata

SPECTRUM

metadata

records

Tech’- Records, items & DCor setting the record straight


Tech records items dc

Tech’- Records, items & DC

A record consists of:

  • Header (mandatory)

    • identifier (1)

    • datestamp (1)

    • setSpec elements (*)

    • status attribute for deleted item (?)

  • Metadata (mandatory)

    • XML encoded metadata with root tag, namespace

    • repositories must support Dublin Core

  • About (optional)

    • rights statements

    • provenance statements


Tech records items dc1

Tech’- Records, items & DC

  • OAI-PMH supports dissemination of multiple metadata formats from a repository.

  • Properties of metadata formats:

    • id string to specify the format (metadataPrefix)

    • metadata schema URL (XML schema to test validity)

    • XML namespace URI (global identifier for metadata format)

  • Repositories must be able to disseminate unqualified DC.

  • Arbitrary metadata formats can be defined and transported via the OAI-PMH.

  • Returned metadata must comply with XML namespace specification.


Tech records items dc2

Tech’- Records, items & DC

As mentioned before the minimum standard is unqualified Dublin Core (http://dublincore.org/).

  • Dublin Core Metadata Element Set contains 15 elements.

  • All elements are optional.

  • All elements may be repeated.

    The Dublin Core Metadata Element Set:


Tech records items dc3

Tech’- Records, items & DC

Response in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=GetRecord&identifier=oai:CiteSeerPSU:1&metadataPrefix=oai_dc


Tech flow control

Tech’- Flow control

  • Some of the request commands can generate a very long response (for example think about requesting a CiteSeer or Library of Congress to list ALL their records using the GetRecords verb).

  • In order not to generate long responses that will over load the server, a flow control mechanism was added to the protocol.

  • It is only within the server responsibility to split long responses into shorter ones; the client has no control over length of the responses.


Tech flow control1

Tech’- Flow control

  • The flow control mechanism is referred to as “resumption token”, and in it, the server splits the long response into shorter ones and assigns at the end of each response a token that the client will pass on the next request the get the next part.


Tech flow control2

“want to have all your records”

archive.org/oai?verb=ListRecords&metadataPrefix=oai_dc

Data Provider

Service Provider

“have 267, but give you only 100”

100 records + resumptionToken “anyID1”

“want more of this”

archive.org/oai?resumptionToken=anyID1

“have 267, give you another 100”

Repository

Harvester

100 records + resumptionToken “anyID2”

“want more of this”

archive.org/oai?resumptionToken=anyID2

“have 267, give you my last 67”

67 records + resumptionToken “”

Tech’- Flow control


Conclusions and future use

Conclusions and future use

  • We saw that the increasing number of digital libraries caused the different DL types some problems:

    • FDLs and HDLs had to overcome different obstacles in order to federate or harvest data from SDLs due to different metadata formats and different queries formats for example.

    • The user had to overcome the learning of different user interfaces each SDL offered.


Conclusions and future use1

Conclusions and future use

  • When looking at the OAI-PMH it seemed that putting the protocol in use will eliminate those problems.Service providers can lower the number of different user interfaces the user needs to handle and federating or harvesting would be much easier using a common standard.However…


Conclusions and future use2

Conclusions and future use

  • When putting the protocol in use in digital libraries environment, the lack of strict rules may cause new problems or make the old ones reappear in another way.

  • Lets take Citeseer for example.It contains 723140 records and its metadata size is around 1GB.If one would want to harvest citeseer efficiently for records dealing with a specific topic how could it be done?


Conclusions and future use3

Conclusions and future use

  • Since the searching for data within the metadata is done at the harvester size, it could not ask citeseer to give it only records dealing with "network computationת" for example.

  • Remember the sets? Could they be used to harvest only part of the information instead of handling a Giga of data?

  • The answer is no since citeseer contains only one set.


Conclusions and future use4

Conclusions and future use

  • The DC also might be a too low barrier which causes more and more SDLs to support not only DC but to create their own metadata formats (citeseer for example has two formats it supports).

    Nevertheless, OAI-PMH is becoming more and more a standard in digital libraries and is making a large contribution for the DLs and from the looks of it, it’s here to stay.


What s next

What's next

  • Riddle –

    • Improving harvesting and creation of HDLs.

    • Composition of HDLs.


What s next1

Web interfaces

Layer 5

CHDL

Layer 4

HDL

Layer 3

OAI-PMH

SDL

SDL

SDL

Layer 2

Web

Layer 1

What's next


Demonstration

Demonstration

  • Independent queries.

  • Repositories explorer:http://re.cs.uct.ac.za/

  • OAISter (FDL):http://oaister.umdl.umich.edu/o/oaister/

  • Scirus (FDL):http://www.scirus.com/srsapp/

  • Riddle demo:http://riddle.dynalias.com:20055/riddle.html


Resources

Resources

  • OAI – official sitehttp://www.openarchives.org/

  • protocol specificationhttp://www.openarchives.org/OAI/openarchivesprotocol.html

  • general mailing listhttp://www.openarchives.org/mailman/listinfo/OAI-general/

  • implementers mailing listhttp://www.openarchives.org/mailman/listinfo/OAI-implementers/

  • Presentation which this presentation was based on: http://www.oaforum.org/otherfiles/lisb_tutorial.ppt

  • Z39.50:http://www.loc.gov/z3950/agency/


Questions

Questions


The end

The end


  • Login