1 / 18

University Library Experience CDL Case Study

University Library Experience CDL Case Study. 30 June 2005 John Kunze, California Digital Library. California Digital Library. A university library with no books, students, or faculty Central services for 10 campus libraries

tanisha
Download Presentation

University Library Experience CDL Case Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

  2. California Digital Library • A university library with no books, students, or faculty • Central services for 10 campus libraries • Content hosting: electronic texts, web-based material, datasets, finding aids • Linked: California museums & archives • Plus a Digital Preservation Program

  3. What’s digital preservation? • Safeguarding electronic information • Viability (intact bit streams) • Renderability (by machines) • Understandability (by humans) • There’s no preservation if we don’t know what it’s called • CDL core need for persistent identifiers

  4. What’s a persistent identifier? • An identifier that is valid for long enough • valid, enough: these are service/user dependent • What’s an identifier? It’s an association between a string and a thing. It follows that: • An id is not a string of data (good) • An id is a matter of opinion, not fact; there will be at least one other provider, serial if not in parallel, or your objects die with you (inconvenient) • Same thing, two strings; or same string, two things • Often: same string, different metadata • Often: same string, parallel things diverging over time due to different preservation practices (eg, migrations)

  5. Accepting some disorder • Long term preservation won’t happen unless objects can change residence and diverge • Campus snapshot to CDL; subsequent snapshots • Publisher to dim CDL archive; later CDL to SS? • Better if object lives in several places at once • Eventually, Producer loses control of copies • Multiple opinions and practices will flourish • Static, id-based persistence claims soon irrelevant • “urn:…”, “hdl:…”, etc. reflect hopes of people long gone • Not pretty, but the alternative (loss) is worse

  6. Agreeing to disagree • What we say, but shouldn’t (not loudly): • Don’t re-assign a persistent id to something else • Or don’t replace a persistent object with another • What we do: • Knowingly replace our persistent objects (typos, drafts, format conversions, home page redesign) • Honestly provide a real kind of persistence, but with very different replacement policies • Won’t have one way within CDL, let alone without

  7. Diverse persistence practice • How dissimilar must two objects be before they get different ids? • CDL’s home-grown Digital Preservation Repository (open source) is self-service: • Lets the Submitter decide • Makes preservation a joint responsibility • Requirement: need to be able to tell users what flavor of permanence is in effect

  8. CDL Persistent Ids Must… • Identify, whether or not the object is at hand • It may not be convenient, helpful, or permitted for you to inspect the object itself -- metadata needed • Convey different flavors of permanence • Lead to access (if authorized) • Not strictly an “identification” problem, but it is the “404 not found” that we need to fix • Be valid for some longish period • Be carried on, in, or with the object

  9. How to choose an id scheme • All CDL requirements are purely about service • Candidate schemes: URL, PURL, URN, ARK, Handle, DOI, MD5, GUID, ISxx, … • CDL gets no direct service help from any scheme; no scheme or syntax confers persistence of any kind • We then ask which schemes are lowest cost and lowest risk?

  10. Myths to fight against • Harmful Fallacy 1. A URL is a location, and is therefore inherently unstable. (ridiculous) • Harmful Fallacy 2. Explicit server/resolver names make URLs inherently unstable. • So “loc.gov” is less stable than “handle.net” and the implicit global resolvers that it depends on? • Harmful Fallacy 3. HTTP-based resolvers will not scale for persistent access. (google) • Harmful Fallacy 4. URLs are the problem. • “Cool URLs don’t break” -- Tim Berners-Lee

  11. Impersistence - big factors • Bankruptcy - no successor found • Loss of funding - no successor found • Loss of political support • War, social upheaval, natural disaster • Scheme impact: zero

  12. Impersistence - lesser factors Deliberately or accidentally, objects are • Removed • Replaced • Moved without setting up a redirect • Everyone has an indirection mechanism, though most don’t use it • Scheme impact: zero

  13. Impersistence - small factors Your org likes persistent ids in principle, but • It lacks knowledge that vanilla web servers trivially support 500,000 redirect directives • It lacks the expertise or staff to maintain a web server, a two-column database table, and a nightly server config file report writer • Scheme impact: zero

  14. Scheme costs and risks • Every modern service needs to support indefinitely and find or be given replacements for at least • Web server, web browser, and DNS • In addition, URN, Handle, and DOI resolution need a global proxy or a plugin for every access • ARK could use a plugin, but doesn’t need it • Handle and DOI also require • You to maintain an extra local server • The community to maintain a set of global servers • For the CDL • Handle and DOI come with highest risk • ARK comes with lowest risk

  15. Persistence - indirect factors CDL’s persistence requirements call for an id scheme (not service) connecting users to • metadata • whether and what kind of persistence • sub-object and variant inferences • core ids on proxy failure (gracefully) • Scheme impact: ARK provides these • A scheme is not a service (DOI is not CrossRef) • When choosing a scheme, we wanted to remain independent of extra external service providers

  16. Our Stuff vs Their Stuff • Persistence can be split into • the Our Stuff Problem • the Their Stuff Problem • It makes no sense for CDL to assign persistent ids to Their Stuff • Their Stuff can be hugely important to our users, but we don’t control it and cannot vouch for it • Where we can afford it, we track them with PURLs • CDL does assign persistent ids to Our Stuff

  17. Distribution of Id Assignment • Objects ingested in flows from other libraries per submission agreements • Each object has an ARK after ingest • Either it has it already • Or we give it one upon entry • Campuses can mint their own ARKs or rely on our minting service • Their own campus ARK namespace is theirs to divide up as they wish

  18. Opaque ids with semantic extensions • CDL dilemma: • opaque ids are needed for names that age and travel well • Semantically laden ids are helpful in providing many id services • Hybrid: • opaque ids are used to name abstract preservation objects • Semantic and sometimes transient extensions address components inside of objects (the set of components evolves over time anyway)

More Related