Identifiers and types
1 / 37

Identifiers and Types - PowerPoint PPT Presentation

  • Uploaded on

Identifiers and Types. CS431 – Architecture of Web Information Systems Carl Lagoze – Cornell University – Feb. 7 2005. Identity Change Persistence. Paradox: reality contains things that persist and change over time Heraclitus and Plato: can you step into the same river twice?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Identifiers and Types' - kalea

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Identifiers and types

Identifiers and Types

CS431 – Architecture of Web Information Systems

Carl Lagoze – Cornell University – Feb. 7 2005

Identity change persistence
Identity Change Persistence

  • Paradox: reality contains things that persist and change over time

    • Heraclitus and Plato: can you step into the same river twice?

    • Ship of Theseus: over the years, the Athenians replaced each plank in the original ship of Theseus as it decayed, thereby keeping it in good repair. Eventually, there was not a single plank left of the original ship. So, did the Athenians still have one and the same ship that used to belong to Theseus


  • Provide a key or handle linking abstract concepts to physical or perceptible entities

  • Provide us with a necessary figment of persistence

  • They are perhaps the one essential and common form of metadata

  • Why bother?

    • Finding things

    • Referring to things (Citations)

    • Asserting ownership over things

I have lots of identifiers
I have lots of identifiers

  • Carl Jay Lagoze, Dad, Hey you

  • 123-456-7890 (SSN)

  • 1234-5678-1234-1234 (Visa Card)

  • FZBMLH (US Airways locator on January 18 flight to San Diego)

Identifier issues
Identifier Issues

  • Object granularity

  • Identifier Context

    • Object atomicity

    • Part/whole relationships

  • Location independence

  • Global uniqueness

  • Persistent across time

  • Human vs. machine generation

  • Machine resolution

  • Administration (centralized vs. decentralized)

  • Intrinsic semantics

  • Type specificity

Two common pre digital identifiers
Two common pre-digital identifiers

  • ISBN (International Standard Book Number)

    • Uniquely identifies every monograph (book)

    • One ISBN for each format

      • HP & SS hardback 0590353403

      • HP & SS softcover 059035342X

    • Number is semantically meaningful (components)

    • International administration (>150 countries)

  • ISSN (International Standard Serial Number)

    • Uniquely identifies every serial (not issue or volume)

    • Semantically meaningless

    • International administration

Uri universal resource identifier
URI: Universal Resource Identifier

  • Generic syntax for identifiers of resources

  • Defined by RFC 2396

  • Syntax: <scheme>://<authority><path>?<query>

    • Scheme

      • Defines semantics of remainder of URI

      • ftp, gopher, http, mailto, news, telnet

    • Authority

      • Authority governing namespace for remainder of URI

      • Typically Internet-based server

    • Path

      • Identification of data within scope of authority

    • Query

      • String of information to be interpreted by authority

Why is rfc 2396 so big
Why is RFC 2396 so big?

  • Character encodings

  • Partial and relative URIs

Url universal resource locator
URL: Universal Resource Locator

  • String representation of the location for a resource that is available via the Internet

  • Use URI syntax

  • Scheme has function of defining the access (protocol) method. Used by client to determine the protocol to “speak”.

    • - open socket to on port 80 and issue a GET for index.html

    • - open socket to on port 21, open ftp session, issue ftp get for index.html….

Url issues
URL Issues

  • Persistence

  • Location dependence

  • Valid only at the item level

    • What about works, expressions, manifestations

  • Multiple resolution

    • “get the one that is cheapest, most reliable, most recent, most appropriate for my hardware, etc.”

  • Non-digital resources?

  • Sub-parts

Urc uniform resource characteristic catalog
URC – Uniform Resource Characteristic (Catalog)

  • Failed but interesting effort

    • Multiple resolution

    • Describe resource by its characteristics

  • Provide adequate bundled information about a resource (metadata) to create identification block for any given resource (including locations)

    • Exactly what are the common set of characteristics for describing different types of resources?

    • Where are these characteristics stored?

Robust hyperlinks
Robust Hyperlinks

  • Characteristic of document (metadata) is computed automatically via fingerprint of its content.

  • “Lexical” signatures: The top n words of a document chosen for rarity, subject to heuristic filters to aid robustness.

    • “a TF-IDF-like” measure

    • Five or so words are sufficient

  • Can be used to locate document (via search engine) after it is moved

Robust hyperlinks why does this work
Robust Hyperlinks – Why does this work?

  • Number of terms on Web is reportedly close to 10,000,000.

  • If terms were distributed independently, the probability of 5 even moderately common terms occurring in more than one document is very small.

    • In fact, picking 3 terms restricted to those occurring in 100,000 documents works pretty well.

    • Many documents contain very infrequently used words.

    • There is lots of room for independence to be off, and to play with term selection for robustness, etc..


Urn universal resource name
URN – Universal Resource Name

  • “globally unique, persistent names”

  • Independence from location and location methods

    <URN> ::= "urn:" <NID> ":" <NSS>

    • NID: namespace identifier

    • NSS: namespace-specific string

    • examples:

      • urn:ISSN:1234-5678

      • urn:isbn:9044107642

      • urn:doi:10.1000/140

Why isn t dns sufficient parenthetical comment
Why isn’t DNS sufficient (parenthetical comment)

  • Issue of semantic vs. non-semantic names

  • Changing ownership

  • Hierarchical legacy of DNS is sometimes inappropriate

Handles names for internet resources
Handles: Names for Internet Resources

  • Naming system for location-independent, persistent names

  • One name, multiple resolutions


The resource named by a Handle can be:

• A library item

• A collection of library items

• A catalog record

• A computer

• An e-mail address

• A public key for encryption

• etc., etc., etc. ....

Syntax of handles
Syntax of Handles





10.1234/1995.;9 (date-time stamp)

cornell.cs/cstr-94.45 (mnemonic name)

loc/a43v-8940cgr (random string)

Example of a handle and its data used to identify two locations
Example of a Handle and its DataUsed to Identify Two Locations

Data type

Handle data






Use of handles in a digital library
Use of Handles in a Digital Library




Search System

Handle System

Replication for performance and reliability
Replication for Performance and Reliability

Example: the Global Handle System

Los Angeles, CA

Washington, DC

Proxies to resolve handles
Proxies to Resolve Handles

A Web browser can resolve Handles via a proxy. For example, the following URL can be used to resolve the Handle loc.ndlp.amrlp/3a16616:

Proxy resolution
Proxy Resolution

URL to Proxy






Handle System





Doi digital object identifier
DOI – Digital Object Identifier

  • Technology and social infrastructure for naming

  • Established by publishers for persistent naming of entities (articles, journals, conference proceedings)

  • Cognizant of FRBR elements

  • Underlying technology is handle system

    • “persistent” names

      • Persistence is fortified by social underpinnings

      • Rules for establishing registration agencies

    • Multiple resolution

    • Registration/mechanism has metadata associated with it

  • doi:10.1000/186

Oclc s persistent url purl
OCLC's Persistent URL (PURL)

• A PURL is a URL

-> Is fully compatible with today's Internet browsers

-> Users need no special software

• Has some of the desirable features of URNs

• Lacks some desirable features of URNs

-> Resolves only to a URL

-> Does not support multiple resolution

• Developed by OCLC

• Software openly available

Purl syntax


resolver address


PURL Syntax

• A PURL is a URL.

• PURL resolvers use standard http redirects to return the actual URL.

Purl namespaces
PURL Namespaces

A PURL provides a local (not-global namespace)

is different from

Oclc purl resolution
OCLC PURL Resolution













Making links context sensitive
Making links context sensitive

  • Why?

    • “Appropriate item” differs for each user

    • Licensing locality

    • Some users may want a choice (abstract, full text, etc.)

  • Conceptualize link as service rather than object targeted.

  • OpenURL

    • Transports metadata about the work to…

    • A localized service that interprets the metadata and provides contextualized choices to the user.

Identifiers and types














link source




OpenURL linking

transportation of

metadata & identifiers





resolution of

metadata & identifiers into services

provision of OpenURL

Identifiers and types

OpenURL 0.1 syntax


  • id=doi:10.111/12345&

  • genre=article&

  • aulast=Weibel&aufirst=Stu&ISSN=35345353

  • &year=2001&volume=14&issue=3&spage=44&

  • pid=2829393&

  • sid=OCLC:Inspec

Why haven t urns caught on beyond certain communities
Why haven’t URNs caught on beyond certain communities?

  • Complexity of systems

  • One size does not fit all - special purpose URN schemes have been successful, e.g., PubMed ID, Astrophysics BibCode

  • No guarantee of persistence – longevity is an organizational not technical issue

  • Requires well-regulated administrative systems

  • Absence of “killer” applications – although reference linking is emerging

Types not all data and content is the same
Types: Not all data and content is the same

  • Format or Genre

    • How you sense it

    • What you can do with it

    • E.G. – audio, video, map, book

  • Type

    • What you need to process it

    • What is its bit layout

  • Compression or encoding

Multipurpose internet mail extensions
Multipurpose Internet Mail Extensions

  • RFC 822 – define textualformat of email messages

  • RFC 2045-2049 – Extend textual email to allow

    • Character sets other than US-ASCII

    • Extensible set of non-ASCII types for message bodies

    • Definition of multi-part mail (attachments)

Mime types
MIME Types

  • Two part type hierarchy

    • Top level type

      • text

      • audio

      • video

      • image

      • application

      • multipart

    • Examples

      • text/plain image/gif application/postscript

  • Extensions are handled by IANA

Mime in http content negotiation
MIME in HTTP (Content Negotiation)

  • Accept in request-header

    • Accept: text/plain; q=0.5, text/html, text/x-dvi; q=0.8, text/xml

      • text/plain and text/xml are preferred, then text/x-dvi, then text/html

  • Content-Type in response-header

    • Content-Type: text/html

Mime is too limited
MIME is too limited

  • Two-level type depth is simplistic

  • Multi-media documents

  • “Documents” that have many types or views

    • FEDORA