CS 502
1 / 29

CS 502 - PowerPoint PPT Presentation

  • Updated On :

CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel [email protected] Lecture 5 A research perspective on Digital Libraries. DL Ancestry. URLs to some of these DLs. ADS: http://adswww.harvard.edu/

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'CS 502 ' - Patman

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg
CS 502 Computing Methods for Digital LibrariesCornell University – Computer ScienceHerbert Van de [email protected]

Lecture 5 A research perspective on Digital Libraries

Urls to some of these dls l.jpg
URLs to some of these DLs

ADS: http://adswww.harvard.edu/

NCSTRL: http://www.ncstrl.org

UCSTRI: http://www.cs.indiana.edu:800/cstr/cover.html

arXiv: http://arXiv.org

LTRS: http://techreports.larc.nasa.gov/ltrs/

NTRS: http://techreports.larc.nasa.gov/cgi-bin/NTRS

Dl architectural review l.jpg
DL Architectural Review

Assumptions made in this perspective

  • things start with TCP/IP connectivity

  • distribute full content (reports, software, etc.)

    • not only metadata

Dl architecture history approach 1 l.jpg
DL Architecture History approach1

1. Build special client and server (generally using Motif/X11, Tcl/Tk, etc.), and use TCP/IP as the transport protocol only

  • pros: rich functionality

  • cons: high development cost, client distribution problem

  • observation: many of these projects spent more time building the interfaces, protocols, searching, etc. than populating their DL!

Dl architecture history approach 2 l.jpg
DL Architecture History approach2

2. use standard protocols built upon TCP/IP: SMTP, FTP, Gopher, WAIS, HTTP, etc.

  • con: less functionality (restricted by protocol)

  • pros: less development cost, uses commonly available clients

  • observation: this approach is now the most common

  • The ones listed on slide 2 fit into this category

Early tcp ip dls l.jpg
Early TCP/IP DLs

  • a very old one: IETF: http://www.ietf.org/

  • Internet RFC’s

  • Very first TCP/IP DL?

Early tcp ip dls8 l.jpg
Early TCP/IP DLs

  • Netlib

    • http://www.netlib.org/

    • begun in 1985, distributing mathematical software via e-mail (SMTP)

    • other access methods and protocols added (ftp, X11 client, http)

Netlib 1995 l.jpg
Netlib 1995

Netlib 2001 l.jpg
Netlib 2001

Los alamos arxiv l.jpg
Los Alamos arXiv

  • Physics pre-print server

    • http://xxx.lanl.gov/ == http://arXiv.org

    • begun in 1991 as an e-mail service to exchange TeX source of pre-prints in high energy physics

    • ftp, http access added shortly

    • Now THE communication channel in Physics

    • Paul Ginsparg

Characteristics of early tcp ip non http dls l.jpg
Characteristics of early TCP/IP, non-HTTP DLs

  • Useful

    • could get the “thing” that you were looking for

  • Constrained by transport protocol

    • SMTP, FTP, etc. interface inherently “clunky”

    • Higher level services such as searching, sophisticated browsing, etc. difficult to implement

  • Small scale

    • would the same systems work well if the holdings went from 100’s or 1000’s to millions?

Characteristics of early tcp ip http dls l.jpg
Characteristics of early TCP/IP, HTTP DLs

  • Initial HTTP implementations / conversions pretty much provided incremental steps in DL improvement

    • a “nice” ftp interface, maybe with better searching and browsing

    • but the nature of the DLs changed little

      • LTRS is an example of a http DL that is really: FTP+Searching(WAIS)+Browsing

      • http://techreports.larc.nasa.gov/ltrs/

      • Also check out user interface of http://arXiv.org

Early tcp ip http dls l.jpg

  • But http is a very general transport protocol, and it is possible to build even higher level protocols on top of it

  • Combine this with the expressive HTTP client (web browser), and there is a lot of potential

  • Dienst

    • (http://www.ncstrl.org/Dienst/htdocs/Info/protocol4.html)

    • builds an actual DL protocol on top of HTTP

      • 1994 -- the first to do so?

  • Open Archives Initiative: metadata harvesting protocol on top of HTTP

Sophistication increases tracks meet l.jpg
Sophistication increases, tracks meet

library automation track


research track




LTRS, e-print, Netlib, etc.

ftp / gopher



A framework for distributed digital object services l.jpg
A Framework for Distributed Digital Object Services

Kahn/Wilensky Framework [Kahn 1995]

  • 1995

  • A high level document

  • Almost a definition of key concepts, terminologies, … for next generation DLs

  • Foundation for a research discipline?

  • Not detailed enough to be a real architecture.

  • Architecture is independent of the type of data stored in the DL

Kwf key terms l.jpg
KWF: key terms

  • digital object (do)

    • A do is a data structure that contains

      • Digital data; data is typed (cf MIME)

      • Persistent Key Metadata; especially handle

      • Other metadata (for instance Terms and Conditions)

  • handle

    • a handle is a unique, persistent name for a do

  • repository

    • The place where do’s live

    • Has unique global name

  • Repository Access Protocol (RAP)

    • To deposit/access do’s in repositories

Kwf flow l.jpg

makes a


which consists of

Transaction record per do

handle comes

from a handle


  • Key-Metadata

  • handle

at which point the do becomes a stored do

which can go in a


Properties record per do

  • Key metadata: handle

  • Other metadata:

    • Terms and conditions


which registers the do’s handle with a handle server

Accesses/Deposits the do in repositories by means of the Repository Access Protocol

What the client receives as a result of an access to a do is a dissemination.



at which point the do becomes

a registered do


KWF: flow


digital object

Digital objects l.jpg
Digital objects

  • do = data + key-metadata

    • data is typed; core types include:

      • bit-sequence / set-of-bit-sequences

      • digital-object / set-of-digital-objects

      • handle / set-of-handles

    • other types can be defined, and registered with a global type registry

      • definition and registration left undefined

      • ~ similar to MIME

    • key-metadata includes handle

    • possibly other metadata (left undefined in KWF)

Digital objects20 l.jpg
Digital objects

  • Composite do’s:

    • a do with data of type digital-object

    • non-composite do’s are elementaldo’s

    • composite do’s can – for instance -- be used to collect similar works together

      • composite do than contains a do for each work of Shakespeare...

Changing digital objects l.jpg
Changing digital objects

  • Mutabledo’s can be changed once placed in a repository

    • key-metadata cannot be changed

    • the do’s handle does never change!

  • Immutabledo’s cannot be changed once placed in a repository

    • however, they can be deleted

Handles l.jpg

  • Guest lecture by Professor Arms 02/19

Repositories l.jpg

  • A network accessible storage system in which digital objects may be stored for possible subsequent access or retrieval

  • A storeddo is a do that resides in a repository

  • A registereddo is a do that the repository has registered with a handle server

    • storing and registering can be the same or different processes

Repositories24 l.jpg

  • A repository keeps a properties record for each do

    • contains key-metadata and any other metadata the repository chooses to keep

  • A do may have a transaction record associated with it in a repository

Repository access protocol l.jpg
Repository Access Protocol

  • “Protocol” may be misleading, its really just the concept for a protocol

  • RAP is designed to be simple; higher level services should come from other protocols

  • KWF defines 3 basic operation classes:

    • ACCESS_DO [metadata; key-metadata, digital object]

      • A dissemination of a do is the result of a request to access a do

    • DEPOSIT_DO [metadata; key-metadata, digital object]


      • this is a means to tell the world about other ways (protocols) to access do’s in the repository.

Terms and conditions l.jpg
Terms and Conditions

  • TC are attached to:

    • each do

    • each dissemination

    • each repository

  • TC are a precondition for any operation on the above

  • Repositories responsible for enforcing TC

Terms and conditions27 l.jpg
Terms and Conditions



terms and







digital object










terms and



terms and



Figure 1 from 95 TR-1593

Digital objects terms and conditions l.jpg
Digital Objects: Terms and Conditions

  • Set by originator and/or repository

  • Can be arbitrarily complex, but generally consist of:

    • permissions: read, write, etc.

    • authentication - person, group, etc.

    • payment

    • 3rd party intervention (possibly in support of the above)

Readings l.jpg

  • Kahn, R. & Wilensky, R. 1995. A Framework for Distributed Digital Object Services

  • http://WWW.CNRI.Reston.VA.US/home/cstr/arch/k-w.html

  • Arms, W.Y. 1995. Key Concepts in the Architecture of the Digital Library. In: D-Lib Magazine. http://www.dlib.org/dlib/July95/07arms.html

  • Marc VanHeyningen. 1994. The Unified Computer Science Technical Report Index: Lessons in indexing diverse resources. http://www.cs.indiana.edu/ucstri/paper/paper.html