A New Model for Web Resource Harvesting
This presentation is the property of its rightful owner.
Sponsored Links
1 / 45

A New Model for Web Resource Harvesting PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on
  • Presentation posted in: General

A New Model for Web Resource Harvesting. Michael L. Nelson Old Dominion University joint work with:. Her. Herbert Van de Sompel Xiaoming Liu. Carl Lagoze Simeon Warner. Terry Harrison. This work supported in part by the Andrew Mellon Foundation & Library of Congress.

Download Presentation

A New Model for Web Resource Harvesting

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A new model for web resource harvesting

A New Model for Web Resource Harvesting

Michael L. Nelson

Old Dominion University

joint work with:

Her

Herbert Van de Sompel

Xiaoming Liu

Carl Lagoze

Simeon Warner

Terry Harrison

This work supported in part by the Andrew Mellon Foundation & Library of Congress


My research interests

My Research Interests

  • Digital Objects and Repositories

    • Interaction between them: roles, responsibilities, architecture, scalability

    • Bumper sticker: Free the Object from the Tyranny of the Repository

  • Digital Preservation

    • Shared infrastructure models, automation, large-scale best effort strategies

    • Bumper sticker: We Need Fewer Heroes

  • User / System Co-Evolution

    • Discerning intent from access large-scale patterns

    • Bumper sticker: Freedom From Choice


Outline

Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research


Www and dl separated at birth

WWW and DL: Separated at Birth

The Good: XML, BitTorrent, Web Services

The Bad: RSS

The Ugly: Semantic Web

WWW

WWW

DL

DL

The Good: OAIS, DOI, OAI-PMH

The Bad: Dublin Core

The Ugly: SRU/W

Today

1994

The problem is not that the WWW doesn’t work; it clearly does.

The problem is that our expectations have been lowered.

more on DL/WWW, from the NSF Post-DL Workshop:

http://www.sis.pitt.edu/~dlwkshop/paper_sompel.html

http://www.sis.pitt.edu/~dlwkshop/paper_lagoze.html


A new model for web resource harvesting

what is this file?

what are its relationships to other files?

how often does it change?

Web Robots

what documents have been

modified since 2003-11-15 ?

www.getty.edu

doc1; last mod

2003-03-12

doc2; last mod

2002-07-19

doc100; last mod

2003-09-11

robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG


A new model for web resource harvesting

<co>

<metadata/>

<link/>

<link/>

<change/>

</co>

A More Efficient Way

what documents have been

modified since 2003-11-15 ?

www.getty.edu

with mod_oai

doc1; last mod

2003-03-12

doc2; last mod

2002-07-19

doc100; last mod

2003-09-11


Outline1

Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research


A very brief oai pmh history

A Very Brief OAI-PMH History

  • Universal Preprint Service

    • A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives

      • not distributed searching

    • Demonstrated at Santa Fe NM, October 21-22, 1999

      • D-Lib Magazine, 6(2) 2000 (2 articles)

        • http://www.dlib.org/dlib/february00/02contents.html

    • UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/

  • The OAI has authored, among other things, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

    • in use around the world, 600+ known instances

      • registration not required; many unknown instances

        • http://gita.grainger.uiuc.edu/registry/

        • http://celestial.eprints.org/cgi-bin/status

    • used by Google ca. late 2004

      • http://www.nla.gov.au/digicoll/oai/


A new model for web resource harvesting

Data Providers /

Repositories

Service Providers /

Harvesters

“A repository is a network accessible server that can process the 6 OAI-PMH requests …

A repository is managed by a data provider to expose metadata to harvesters.” 

“A harvester is a client application that issues OAI-PMH requests.  A harvester is operated by a service provider as a means of collecting metadata from repositories.”


Aggregators

Aggregators

  • aggregators allow for:

    • scalability for OAI-PMH

    • load balancing

    • community building

    • discovery

data providers

(repositories)

service providers

(harvesters)

aggregator


Oai pmh data model

resource

OAI-PMH sets

OAI-PMH identifier

item

OAI-PMHidentifier

metadataPrefix

datestamp

Dublin Core

metadata

MARCXML

metadata

records

OAI-PMH data model

entry point to all records pertaining to the resource

metadata pertaining

to the resource


Overview of oai pmh verbs

Overview of OAI-PMH Verbs

metadata

about the

repository

harvesting

verbs

most verbs take arguments: dates, sets, ids, metadata formats

and resumption token (for flow control)


Outline2

Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research


Resource harvesting use cases

Resource Harvesting: Use cases

  • Discovery: use content itself in the creation of services

    • search engines that make full-text searchable

    • citation indexing systems that extract references from the full-text content

    • browsing interfaces that include thumbnail versions of high-quality images from cultural heritage collections

  • Preservation:

    • periodically transfer digital content from a data repository to one or more trusted digital repositories

    • trusted digital repositories need a mechanism to automatically synchronize with the originating data repository

Ideas first presented in Van de Sompel, Nelson, Lagoze & Warner, http://www.dlib.org/dlib/december04/vandesompel/12vandesompel.html


Existing oai pmh based approaches

Existing OAI-PMH based approaches

Typical scenario:

  • An OAI-PMH harvester harvests Dublin Core records from the OAI-PMH repository.

  • The harvester analyzes each Dublin Core record, extracting dc.identifier information in order to determine the network location of the described resource.

  • A separate process, out-of-band from the OAI-PMH, collects the described resource from its network location.


Existing oai pmh based approaches issue 1

Existing OAI-PMH based approaches : Issue 1

  • Locating the resource based on information provided in dc.identifier

    • dc.identifier used to convey a variety of identifier: (simultaneously) URL DOI, bibliographic citation, … Not expressive enough to distinguish between identifier, locator.

      • Several dereferencing attempts required

    • URI provided in dc.identifier is commonly that of a bibliographic “splash page”

      • How to know it is a bibliographic “splash page”, not the resource?

      • If it is a bibliographic “splash page”, where is the resource?


Existing oai pmh based approaches issue 2

Existing OAI-PMH based approaches : Issue 2

  • Using the OAI-PMH datestamp of the Dublin Core record to trigger incremental harvesting:

    • Datestamp of DC record does not necessarily change when resource changes


Existing oai pmh based approaches conventions

Existing OAI-PMH based approaches : Conventions

  • Cannot really address issue 2 (datestamps) with metadata conventions

  • Issue 1 (identifier & locator of the resource) is currently addressed with a range of conventions

    • First dc.identifier is locator of the resource

      • what if the resource is not digital?

    • Use of dc.format and/or dc.relation to convey locator


Existing oai pmh based approaches conventions1

splash page

locator of resource

Existing OAI-PMH based approaches : Conventions


Existing oai pmh based approaches conventions2

splash page

locator of resource

Existing OAI-PMH based approaches : Conventions


Existing oai pmh based approaches conventions3

splash page

splash page

locator of resource

Existing OAI-PMH based approaches : Conventions


Existing oai pmh based approaches other attempts

Existing OAI-PMH based approaches : Other attempts

  • dc.identifier leads to splash page & splash page contains special purpose XHTML link to resource(s)

    • What if there is no splash page?

    • How does a harvester recognize this situation?

  • OA-X: protocol extension

    • OK in local context

    • Strategic problem to generalize

    • How to consolidate with OAI-PMH data model

  • Qualified Dublin Core

    • Could bring expressiveness to distinguish between locator & identifier

    • But what about the datestamp issue?


Proposed oai pmh based approach

Proposed OAI-PMH based approach

  • Use metadata formats that were specifically created for representation of digital objects:

    • Complex Object Formats as OAI-PMH metadata formats

    • MPEG-21 DIDL, METS, ..


Oai pmh data model1

resource

item

Dublin Core

metadata

MARCXML

metadata

MPEG-21

DIDL

METS

records

OAI-PMH data model

OAI-PMH identifier

= entry point to all records pertaining to the resource

metadata pertaining

to the resource

simple

more expressive

highly

expressive

highly

expressive


Complex object formats characteristics

Complex Object Formats : characteristics

  • Representation of a digital object by means of a wrapper XML document.

  • Represented resource can be:

    • simple digital object (consisting of a single datastream)

    • compound digital object (consisting of multiple datastreams)

  • Unambiguous approach to convey identifiers of the digital object and its constituent datastreams.

  • Include datastream:

    • By-Value: embedding of base64-encoded datastream

    • By-Reference: embedding network location of the datastream

    • not mutually exclusive; equivalent

  • Include a variety of secondary information

    • By-Value

    • By-Reference

    • Descriptive metadata, rights information, technical metadata, …


Complex object formats oai pmh

Complex Object Formats & OAI-PMH

  • Resource represented via XML wrapper => OAI-PMH <metadata>

  • Uniform solution for simple & compound objects

  • Unambiguous expression of locator of datastream

  • Disambiguation between locators & identifiers

  • OAI-PMH datestamp changes whenever the resource (datastreams & secondary information) changes

  • OAI-PMH semantics apply: “about” containers, set membership


Oai pmh based approach using complex object format

OAI-PMH based approach using Complex Object Format

Typical scenario:

  • An OAI-PMH harvester checks for support of a locally understood complex object format using the ListMetadataFormats verb

  • The harvester harvests the complex object metadata. Semantics of the OAI-PMH datestamp guarantee that new and modified resources are detected.

  • A parser at the end of the harvesting application analyzes each harvested complex object record:

    • The parser extracts the bitstreams that were delivered By-Value.

    • The parser extracts the unambiguous references to the network location of bitstreams delivered By-Reference.

  • A separate process, out-of-band from the OAI-PMH, collects the bitstreams delivered By-Reference from the extracted network locations.


Complex object formats oai pmh issues

Complex Object Formats & OAI-PMH : issues

  • Which Complex Object Format(s)

  • How to Profile Complex Object Format(s) for OAI-PMH Harvesting

  • Large records

  • Making resources re-harvestable

  • Because the resource is represented as <metadata>, can rights pertaining to the resource be expressed according to the “rights for metadata” OAI-rights guideline?

  • Tools:

    • Software library to write compliant complex objects

    • Integration of this library with repository systems (Fedora, DSpace, eprints.org, ….)


Outline3

Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research


Mod oai approach

mod_oai approach

  • Goal: integrate OAI-PMH functionality into the web server itself…

  • mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server

    • written in C

    • respects values in .htaccess, httpd.conf

  • compile mod_oai on http://www.foo.edu/

  • baseURL is now http://www.foo.edu/modoai

    • Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets)

      • http://www.foo.edu/modoai?

        verb=ListIdentifiers &

        metdataPrefix=oai_dc &

        from=2004-09-15 &

        set=mime:video:mpeg


Oai pmh data model in mod oai

resource

OAI-PMH sets

MIME type

item

HTTP header

metadata

Dublin Core

metadata

MPEG-21

DIDL

records

OAI-PMH data model in mod_oai

http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf

OAI-PMH identifier

= entry point to all records pertaining to the resource

metadata pertaining

to the resource


Oai pmh concepts typical repository

OAI-PMH concepts : typical repository


Oai pmh concepts mod oai

OAI-PMH concepts : mod_oai


Resource discovery listidentifiers

Resource Discovery: ListIdentifiers

harvester

  • issues a ListIdentifiers,

  • finds URLs of updated resources

  • does HTTP GETs updates only

  • can get URLs of resources with specified MIME types


Preservation listrecords

Preservation: ListRecords

harvester

  • issues a ListRecords,

  • Gets updates as MPEG-21 DIDL documents (HTTP headers, resource By Value or By Reference)

  • can get resources with specified MIME types


A new model for web resource harvesting

performance of mod_oai and wget

on www.cs.odu.edu

for more detail: “mod_oai: An Apache Module for Metadata Harvesting “

http://arxiv.org/abs/cs.DL/0503069


Outline4

Outline

(0) The Problem

(1) OAI-PMH Mechanics

(2) OAI-PMH for Resource Harvesting

(3) mod_oai

(4) Future Research


Issues and future work

Issues and Future Work

  • For a given server, there are a set of URLs, U, and a set of files F

    • Apache maps U F

    • mod_oai maps F U

  • Neither function is 1-1 nor onto

    • We can easily check if a single u maps to F, but given F we cannot (easily) generate U

  • Short-term issues:

    • dynamic files

      • exporting unprocessed server-side files would be a security hole

    • IndexIgnore

      • httpd will “hide” valid URLs

    • File permissions

      • httpd will advertise files it cannot read

  • Long-term issues

    • Alias, Location

      • files can be covered up by the httpd

    • UserDir

      • interactions between the httpd and the filesystem


Indexignore file permissions

IndexIgnore & File Permissions


Alias covering up files

Alias: Covering Up Files

httpd.conf:

Alias /A /usr/local/web/htdocs/B

Alias /B /usr/local/web/htdocs/A

the files “A” and “B” will be different from the URLs

http://server/A

http://server/B


Userdir just in time mounting of directories

UserDir: “Just in Time” mounting of directories

whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home

liu_x/ mln/

whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso

/home/tharriso/

whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home

liu_x/ mln/ tharriso/

whiskey.cs.odu.edu:/ftp/WWW/conf %


Looking further down the road for mod oai

Looking Further Down the Road for mod_oai

  • “Reverse” the method of URL discovery

    • cannot look to the files;

    • listen to incoming requests and build a list of valid URLs

      • could be seeded with files at start

      • also the method for handling server processed files / URLs

  • Plug-ins for descriptive metadata

    • DC tags in HTML

    • MS Office formats, PDF

    • Tags from JPEG, TIFF, MP3, etc.

  • Additional metadata in the DIDL

    • technical metadata from JHOVE

    • estimated change rate

      • cf. Cho & Garcia-Molina, ACM TOIT 28(4)

  • http log access as separate metadata formats

    • cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)


Expanding oai pmh complex object access

Expanding OAI-PMH / Complex Object Access

  • OAI-PMH / CO access for:

    • blogs

    • message boards

    • native file systems

      • e.g. Mac OS X “Spotlight”

  • More aggressive use of OAI-PMH / CO for preservation

    • recently funded NSF DIGARCH program

    • use for preservation:

      • Usenet

      • Email

      • Multicasting


Oai pmh complex objects a new model for web resource harvesting

OAI-PMH + Complex Objects:A New Model for Web Resource Harvesting

  • Better web harvesting can be achieved through:

    • OAI-PMH: structured access to updates

    • Complex object formats: modeled representation of digital objects

  • Use cases:

    • Preservation (ListRecords)

    • Web crawling (ListIdentifiers)

  • mod_oai: reference implementation

    • Better performance than wget

    • static files only; dynamic files in the future

    • not a replacement for DSpace, Fedora, eprints.org, etc.

  • More info:

    • http://www.modoai.org/

    • http://whiskey.cs.odu.edu/


  • Login