A New Model for Web Resource Harvesting
Sponsored Links
This presentation is the property of its rightful owner.
1 / 45

A New Model for Web Resource Harvesting PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on
  • Presentation posted in: General

A New Model for Web Resource Harvesting. Michael L. Nelson Old Dominion University joint work with:. Her. Herbert Van de Sompel Xiaoming Liu. Carl Lagoze Simeon Warner. Terry Harrison. This work supported in part by the Andrew Mellon Foundation & Library of Congress.

Download Presentation

A New Model for Web Resource Harvesting

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A New Model for Web Resource Harvesting

Michael L. Nelson

Old Dominion University

joint work with:

Her

Herbert Van de Sompel

Xiaoming Liu

Carl Lagoze

Simeon Warner

Terry Harrison

This work supported in part by the Andrew Mellon Foundation & Library of Congress


My Research Interests

  • Digital Objects and Repositories

    • Interaction between them: roles, responsibilities, architecture, scalability

    • Bumper sticker: Free the Object from the Tyranny of the Repository

  • Digital Preservation

    • Shared infrastructure models, automation, large-scale best effort strategies

    • Bumper sticker: We Need Fewer Heroes

  • User / System Co-Evolution

    • Discerning intent from access large-scale patterns

    • Bumper sticker: Freedom From Choice


Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research


WWW and DL: Separated at Birth

The Good: XML, BitTorrent, Web Services

The Bad: RSS

The Ugly: Semantic Web

WWW

WWW

DL

DL

The Good: OAIS, DOI, OAI-PMH

The Bad: Dublin Core

The Ugly: SRU/W

Today

1994

The problem is not that the WWW doesn’t work; it clearly does.

The problem is that our expectations have been lowered.

more on DL/WWW, from the NSF Post-DL Workshop:

http://www.sis.pitt.edu/~dlwkshop/paper_sompel.html

http://www.sis.pitt.edu/~dlwkshop/paper_lagoze.html


what is this file?

what are its relationships to other files?

how often does it change?

Web Robots

what documents have been

modified since 2003-11-15 ?

www.getty.edu

doc1; last mod

2003-03-12

doc2; last mod

2002-07-19

doc100; last mod

2003-09-11

robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG


<co>

<metadata/>

<link/>

<link/>

<change/>

</co>

A More Efficient Way

what documents have been

modified since 2003-11-15 ?

www.getty.edu

with mod_oai

doc1; last mod

2003-03-12

doc2; last mod

2002-07-19

doc100; last mod

2003-09-11


Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research


A Very Brief OAI-PMH History

  • Universal Preprint Service

    • A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives

      • not distributed searching

    • Demonstrated at Santa Fe NM, October 21-22, 1999

      • D-Lib Magazine, 6(2) 2000 (2 articles)

        • http://www.dlib.org/dlib/february00/02contents.html

    • UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/

  • The OAI has authored, among other things, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

    • in use around the world, 600+ known instances

      • registration not required; many unknown instances

        • http://gita.grainger.uiuc.edu/registry/

        • http://celestial.eprints.org/cgi-bin/status

    • used by Google ca. late 2004

      • http://www.nla.gov.au/digicoll/oai/


Data Providers /

Repositories

Service Providers /

Harvesters

“A repository is a network accessible server that can process the 6 OAI-PMH requests …

A repository is managed by a data provider to expose metadata to harvesters.” 

“A harvester is a client application that issues OAI-PMH requests.  A harvester is operated by a service provider as a means of collecting metadata from repositories.”


Aggregators

  • aggregators allow for:

    • scalability for OAI-PMH

    • load balancing

    • community building

    • discovery

data providers

(repositories)

service providers

(harvesters)

aggregator


resource

OAI-PMH sets

OAI-PMH identifier

item

OAI-PMHidentifier

metadataPrefix

datestamp

Dublin Core

metadata

MARCXML

metadata

records

OAI-PMH data model

entry point to all records pertaining to the resource

metadata pertaining

to the resource


Overview of OAI-PMH Verbs

metadata

about the

repository

harvesting

verbs

most verbs take arguments: dates, sets, ids, metadata formats

and resumption token (for flow control)


Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research


Resource Harvesting: Use cases

  • Discovery: use content itself in the creation of services

    • search engines that make full-text searchable

    • citation indexing systems that extract references from the full-text content

    • browsing interfaces that include thumbnail versions of high-quality images from cultural heritage collections

  • Preservation:

    • periodically transfer digital content from a data repository to one or more trusted digital repositories

    • trusted digital repositories need a mechanism to automatically synchronize with the originating data repository

Ideas first presented in Van de Sompel, Nelson, Lagoze & Warner, http://www.dlib.org/dlib/december04/vandesompel/12vandesompel.html


Existing OAI-PMH based approaches

Typical scenario:

  • An OAI-PMH harvester harvests Dublin Core records from the OAI-PMH repository.

  • The harvester analyzes each Dublin Core record, extracting dc.identifier information in order to determine the network location of the described resource.

  • A separate process, out-of-band from the OAI-PMH, collects the described resource from its network location.


Existing OAI-PMH based approaches : Issue 1

  • Locating the resource based on information provided in dc.identifier

    • dc.identifier used to convey a variety of identifier: (simultaneously) URL DOI, bibliographic citation, … Not expressive enough to distinguish between identifier, locator.

      • Several dereferencing attempts required

    • URI provided in dc.identifier is commonly that of a bibliographic “splash page”

      • How to know it is a bibliographic “splash page”, not the resource?

      • If it is a bibliographic “splash page”, where is the resource?


Existing OAI-PMH based approaches : Issue 2

  • Using the OAI-PMH datestamp of the Dublin Core record to trigger incremental harvesting:

    • Datestamp of DC record does not necessarily change when resource changes


Existing OAI-PMH based approaches : Conventions

  • Cannot really address issue 2 (datestamps) with metadata conventions

  • Issue 1 (identifier & locator of the resource) is currently addressed with a range of conventions

    • First dc.identifier is locator of the resource

      • what if the resource is not digital?

    • Use of dc.format and/or dc.relation to convey locator


splash page

locator of resource

Existing OAI-PMH based approaches : Conventions


splash page

locator of resource

Existing OAI-PMH based approaches : Conventions


splash page

splash page

locator of resource

Existing OAI-PMH based approaches : Conventions


Existing OAI-PMH based approaches : Other attempts

  • dc.identifier leads to splash page & splash page contains special purpose XHTML link to resource(s)

    • What if there is no splash page?

    • How does a harvester recognize this situation?

  • OA-X: protocol extension

    • OK in local context

    • Strategic problem to generalize

    • How to consolidate with OAI-PMH data model

  • Qualified Dublin Core

    • Could bring expressiveness to distinguish between locator & identifier

    • But what about the datestamp issue?


Proposed OAI-PMH based approach

  • Use metadata formats that were specifically created for representation of digital objects:

    • Complex Object Formats as OAI-PMH metadata formats

    • MPEG-21 DIDL, METS, ..


resource

item

Dublin Core

metadata

MARCXML

metadata

MPEG-21

DIDL

METS

records

OAI-PMH data model

OAI-PMH identifier

= entry point to all records pertaining to the resource

metadata pertaining

to the resource

simple

more expressive

highly

expressive

highly

expressive


Complex Object Formats : characteristics

  • Representation of a digital object by means of a wrapper XML document.

  • Represented resource can be:

    • simple digital object (consisting of a single datastream)

    • compound digital object (consisting of multiple datastreams)

  • Unambiguous approach to convey identifiers of the digital object and its constituent datastreams.

  • Include datastream:

    • By-Value: embedding of base64-encoded datastream

    • By-Reference: embedding network location of the datastream

    • not mutually exclusive; equivalent

  • Include a variety of secondary information

    • By-Value

    • By-Reference

    • Descriptive metadata, rights information, technical metadata, …


Complex Object Formats & OAI-PMH

  • Resource represented via XML wrapper => OAI-PMH <metadata>

  • Uniform solution for simple & compound objects

  • Unambiguous expression of locator of datastream

  • Disambiguation between locators & identifiers

  • OAI-PMH datestamp changes whenever the resource (datastreams & secondary information) changes

  • OAI-PMH semantics apply: “about” containers, set membership


OAI-PMH based approach using Complex Object Format

Typical scenario:

  • An OAI-PMH harvester checks for support of a locally understood complex object format using the ListMetadataFormats verb

  • The harvester harvests the complex object metadata. Semantics of the OAI-PMH datestamp guarantee that new and modified resources are detected.

  • A parser at the end of the harvesting application analyzes each harvested complex object record:

    • The parser extracts the bitstreams that were delivered By-Value.

    • The parser extracts the unambiguous references to the network location of bitstreams delivered By-Reference.

  • A separate process, out-of-band from the OAI-PMH, collects the bitstreams delivered By-Reference from the extracted network locations.


Complex Object Formats & OAI-PMH : issues

  • Which Complex Object Format(s)

  • How to Profile Complex Object Format(s) for OAI-PMH Harvesting

  • Large records

  • Making resources re-harvestable

  • Because the resource is represented as <metadata>, can rights pertaining to the resource be expressed according to the “rights for metadata” OAI-rights guideline?

  • Tools:

    • Software library to write compliant complex objects

    • Integration of this library with repository systems (Fedora, DSpace, eprints.org, ….)


Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research


mod_oai approach

  • Goal: integrate OAI-PMH functionality into the web server itself…

  • mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server

    • written in C

    • respects values in .htaccess, httpd.conf

  • compile mod_oai on http://www.foo.edu/

  • baseURL is now http://www.foo.edu/modoai

    • Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets)

      • http://www.foo.edu/modoai?

        verb=ListIdentifiers &

        metdataPrefix=oai_dc &

        from=2004-09-15 &

        set=mime:video:mpeg


resource

OAI-PMH sets

MIME type

item

HTTP header

metadata

Dublin Core

metadata

MPEG-21

DIDL

records

OAI-PMH data model in mod_oai

http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf

OAI-PMH identifier

= entry point to all records pertaining to the resource

metadata pertaining

to the resource


OAI-PMH concepts : typical repository


OAI-PMH concepts : mod_oai


Resource Discovery: ListIdentifiers

harvester

  • issues a ListIdentifiers,

  • finds URLs of updated resources

  • does HTTP GETs updates only

  • can get URLs of resources with specified MIME types


Preservation: ListRecords

harvester

  • issues a ListRecords,

  • Gets updates as MPEG-21 DIDL documents (HTTP headers, resource By Value or By Reference)

  • can get resources with specified MIME types


performance of mod_oai and wget

on www.cs.odu.edu

for more detail: “mod_oai: An Apache Module for Metadata Harvesting “

http://arxiv.org/abs/cs.DL/0503069


Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research


Issues and Future Work

  • For a given server, there are a set of URLs, U, and a set of files F

    • Apache maps U F

    • mod_oai maps F U

  • Neither function is 1-1 nor onto

    • We can easily check if a single u maps to F, but given F we cannot (easily) generate U

  • Short-term issues:

    • dynamic files

      • exporting unprocessed server-side files would be a security hole

    • IndexIgnore

      • httpd will “hide” valid URLs

    • File permissions

      • httpd will advertise files it cannot read

  • Long-term issues

    • Alias, Location

      • files can be covered up by the httpd

    • UserDir

      • interactions between the httpd and the filesystem


IndexIgnore & File Permissions


Alias: Covering Up Files

httpd.conf:

Alias /A /usr/local/web/htdocs/B

Alias /B /usr/local/web/htdocs/A

the files “A” and “B” will be different from the URLs

http://server/A

http://server/B


UserDir: “Just in Time” mounting of directories

whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home

liu_x/ mln/

whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso

/home/tharriso/

whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home

liu_x/ mln/ tharriso/

whiskey.cs.odu.edu:/ftp/WWW/conf %


Looking Further Down the Road for mod_oai

  • “Reverse” the method of URL discovery

    • cannot look to the files;

    • listen to incoming requests and build a list of valid URLs

      • could be seeded with files at start

      • also the method for handling server processed files / URLs

  • Plug-ins for descriptive metadata

    • DC tags in HTML

    • MS Office formats, PDF

    • Tags from JPEG, TIFF, MP3, etc.

  • Additional metadata in the DIDL

    • technical metadata from JHOVE

    • estimated change rate

      • cf. Cho & Garcia-Molina, ACM TOIT 28(4)

  • http log access as separate metadata formats

    • cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)


Expanding OAI-PMH / Complex Object Access

  • OAI-PMH / CO access for:

    • blogs

    • message boards

    • native file systems

      • e.g. Mac OS X “Spotlight”

  • More aggressive use of OAI-PMH / CO for preservation

    • recently funded NSF DIGARCH program

    • use for preservation:

      • Usenet

      • Email

      • Multicasting


OAI-PMH + Complex Objects:A New Model for Web Resource Harvesting

  • Better web harvesting can be achieved through:

    • OAI-PMH: structured access to updates

    • Complex object formats: modeled representation of digital objects

  • Use cases:

    • Preservation (ListRecords)

    • Web crawling (ListIdentifiers)

  • mod_oai: reference implementation

    • Better performance than wget

    • static files only; dynamic files in the future

    • not a replacement for DSpace, Fedora, eprints.org, etc.

  • More info:

    • http://www.modoai.org/

    • http://whiskey.cs.odu.edu/


  • Login