1 / 22

A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting. Michael Nelson Computer Science Department Old Dominion University Herbert Van de Sompel Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory. Her.

melba
Download Presentation

A New Model for Web Resource Harvesting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A New Model for Web Resource Harvesting Michael Nelson Computer Science Department Old Dominion University Herbert Van de Sompel Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory Her This work supported in part by the Andrew Mellon Foundation & Library of Congress

  2. Outline (0) The Problem (1) mod_oai (2) Future Research

  3. WWW and DL: Separated at Birth The Good: XML, BitTorrent, Web Services The Bad: RSS The Ugly: Semantic Web WWW WWW DL DL The Good: OAIS, DOI, OAI-PMH The Bad: Dublin Core The Ugly: SRU/W Today 1994 The problem is not that the WWW doesn’t work; it clearly does. The problem is that our expectations have been lowered.

  4. what is this file? what are its relationships to other files? how often does it change? Web Robots what documents have been modified since 2003-11-15 ? www.getty.edu … doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc100; last mod 2003-09-11 robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG

  5. <co> <metadata/> <link/> <link/> <change/> … </co> A More Efficient Way what documents have been modified since 2003-11-15 ? www.getty.edu with mod_oai … doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc100; last mod 2003-09-11

  6. Outline (0) The Problem (1) mod_oai (2) Future Research

  7. mod_oai approach • Goal: integrate OAI-PMH functionality into the web server itself… • mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server • written in C • respects values in .htaccess, httpd.conf • compile mod_oai on http://www.foo.edu/ • baseURL is now http://www.foo.edu/modoai • Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) • http://www.foo.edu/modoai? verb=ListIdentifiers & metdataPrefix=oai_dc & from=2004-09-15 & set=mime:video:mpeg

  8. resource OAI-PMH sets MIME type item HTTP header metadata Dublin Core metadata MPEG-21 DIDL records OAI-PMH data model in mod_oai http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf OAI-PMH identifier = entry point to all records pertaining to the resource metadata pertaining to the resource

  9. OAI-PMH concepts : typical repository

  10. OAI-PMH concepts : mod_oai

  11. Resource Discovery: ListIdentifiers harvester • issues a ListIdentifiers, • finds URLs of updated resources • does HTTP GETs updates only • can get URLs of resources with specified MIME types

  12. Preservation: ListRecords harvester • issues a ListRecords, • Gets updates as MPEG-21 DIDL documents (HTTP headers, resource By Value or By Reference) • can get resources with specified MIME types

  13. performance of mod_oai and wget on www.cs.odu.edu

  14. Readings • Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Terry L. Harrison, Nathan McFarland. mod_oai: An Apache Module for Metadata Harvesting. http://arxiv.org/abs/cs.DL/0503069

  15. Outline (0) The Problem (1) mod_oai (2) Future Research

  16. Issues and Future Work • For a given server, there are a set of URLs, U, and a set of files F • Apache maps U F • mod_oai maps F U • Neither function is 1-1 nor onto • We can easily check if a single u maps to F, but given F we cannot (easily) generate U • Short-term issues: • dynamic files • exporting unprocessed server-side files would be a security hole • IndexIgnore • httpd will “hide” valid URLs • File permissions • httpd will advertise files it cannot read • Long-term issues • Alias, Location • files can be covered up by the httpd • UserDir • interactions between the httpd and the filesystem

  17. IndexIgnore & File Permissions

  18. Alias: Covering Up Files httpd.conf: Alias /A /usr/local/web/htdocs/B Alias /B /usr/local/web/htdocs/A the files “A” and “B” will be different from the URLs http://server/A http://server/B

  19. UserDir: “Just in Time” mounting of directories whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home liu_x/ mln/ whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso /home/tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home liu_x/ mln/ tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf %

  20. Looking Further Down the Road for mod_oai • “Reverse” the method of URL discovery • cannot look to the files; • listen to incoming requests and build a list of valid URLs • could be seeded with files at start • also the method for handling server processed files / URLs • Plug-ins for descriptive metadata • DC tags in HTML • MS Office formats, PDF • Tags from JPEG, TIFF, MP3, etc. • Additional metadata in the DIDL • technical metadata from JHOVE • estimated change rate • cf. Cho & Garcia-Molina, ACM TOIT 28(4) • http log access as separate metadata formats • cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)

  21. Expanding OAI-PMH / Complex Object Access • OAI-PMH / CO access for: • blogs • message boards • native file systems • e.g. Mac OS X “Spotlight” • More aggressive use of OAI-PMH / CO for preservation • recently funded NSF DIGARCH program • use for preservation: • Usenet • Email • Multicasting

  22. OAI-PMH + Complex Objects:A New Model for Web Resource Harvesting • Better web harvesting can be achieved through: • OAI-PMH: structured access to updates • Complex object formats: modeled representation of digital objects • Use cases: • Preservation (ListRecords) • Web crawling (ListIdentifiers) • mod_oai: reference implementation • Better performance than wget • static files only; dynamic files in the future • not a replacement for DSpace, Fedora, eprints.org, etc. • More info: • http://www.modoai.org/ • http://whiskey.cs.odu.edu/

More Related