1 / 21

mod_oai: Metadata Harvesting for Everyone

mod_oai: Metadata Harvesting for Everyone. Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango {mln,aelango}@cs.odu.edu {herbertv,liu_x}@lanl.gov DLF 2004 Fall Forum Baltimore MD October 25-27, 2004. mod_oai is sponsored by the Andrew Mellon Foundation. Outline.

lixue
Download Presentation

mod_oai: Metadata Harvesting for Everyone

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango {mln,aelango}@cs.odu.edu {herbertv,liu_x}@lanl.gov DLF 2004 Fall Forum Baltimore MD October 25-27, 2004 mod_oai is sponsored by the Andrew Mellon Foundation

  2. Outline • mod_oai • crawling vs. harvesting • complex objects & OAI-PMH • how mod_oai works • scenarios • demos • More information • http://www.modoai.org/ • http://www.openarchives.org/

  3. Inefficient Web Crawlers what documents have been modified since 2003-11-15? www.getty.edu doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc3; last mod 2003-11-29 doc4; last mod 2002-10-03 … doc100; last mod 2003-09-113 robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG

  4. what documents have been modified since 2003-11-15? www.getty.edu with OAI-PMH doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc3; last mod 2003-11-29 doc4; last mod 2002-10-03 … doc100; last mod 2003-09-113 A More Efficient Way…

  5. mod_oai • Goal: integrate OAI-PMH functionality into the web server itself… • mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server • written in C • respects values in .htaccess, httpd.conf • Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) • www.foo.edu/modoai?ListIdentifiers&metdataPrefix=oai_dc&from=2004-09-15&set=video:mpeg

  6. resource item Dublin Core metadata MARCXML metadata MPEG-21 DIDL METS records OAI-PMH data model OAI-PMH identifier = entry point to all records pertaining to the resource metadata pertaining to the resource modeled representation of the resource simple model complex model complex model more expressive model

  7. OAI-PMH and complex models • OAI-PMH record == modeled representation of the resource • Can be selectively harvested via OAI-PMH ~ datestamp, set • Resource can be: • simple object (1 file) • compound object (multiple files) • OAI-PMH records can contain: • Typical metadata • Actual resource(s) • By-Value – base64 encoded • By-Reference – http address of resource • both • Identifiers of metadata and resource(s), unambiguously mapped to the identified data • A variety of secondary information

  8. Complex Objects & OAI-PMH • LANL Repository • OAI-PMH as a Repository Access Protocol to access metadata and content represented as DIDLs • APS/LANL/LoC Mirroring • OAI-PMH transfer of APS content represented in application neutral format (DIDLs) • LANL DSpace Plug-in • Exposes MPEG-21 DIDL documents through built-in DSpace OAI-PMH infrastructure

  9. How mod_oai works • Install on an Apache 2.0 server • compile & edit httpd.conf http://www.foo.edu/ now has an OAI-PMH baseURL of: http://www.foo.edu/modoai

  10. OAI-PMH characteristics: Typical Repository

  11. OAI-PMH Data Model in mod_oai resource OAI Identifier == URL of Resource http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf DC, HTTP, DIDL Modeled Representations Set membership == MIME type item Dublin Core metadata HTTP headers DIDL: base64 or urls + HTTP headers records

  12. OAI-PMH characteristics: mod_oai

  13. OAI-PMH Concepts

  14. http_header

  15. Use Cases • Regular Web Crawling • use ListIdentifiers to discover URLs • add new URLs to the list of URLs to be crawled • Harvesting Resources w/ OAI-PMH • use ListRecords to extract the entire resource as an MPEG-21 DIDL AIP

  16. Regular Crawling: ListIdentifiers harvester issues a ListIdentifiers, finds the updates, and does HTTP GETs on just the updates

  17. Resource Harvesting: ListRecords harvester issues a ListRecords, and gets the updates in DIDLs (http headers + by-value or by-ref resources)

  18. Demo • Repository Explorer • http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai • http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai?archive=http://whiskey.cs.odu.edu/modoai • Direct URLs • http://whiskey.cs.odu.edu/modoai?verb=Identify • http://whiskey.cs.odu.edu/modoai?verb=ListMetadataFormats • http://whiskey.cs.odu.edu/modoai?verb=ListIdentifiers&metadataPrefix=oai_dc • http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadataPrefix=http_header • http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadataPrefix=oai_didl

  19. Datestamps and Etags L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf • Procedure • 16 harvests over 1 month of 465,374 .dk domains • 5,543,470 possible downloads • 5,182,034 successful downloads • 599,143 changes Datestamp and Etag Example

  20. Errors in Datestamps and EtagsIndicating Change 40.1 % of pages without Etags 0.07% of pages without Datestamps L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf

  21. is: a simple way to more efficiently harvest web pages a possible impact on robots.txt fully OAI-PMH compliant works with existing harvesters is not: yet suitable for dynamic files a replacement for DSpace Fedora eprints.org other digital libraries / repositories / cms mod_oai…

More Related