identifiers and types
Skip this Video
Download Presentation
Identifiers and Types

Loading in 2 Seconds...

play fullscreen
1 / 37

Identifiers and Types - PowerPoint PPT Presentation

  • Uploaded on

Identifiers and Types. CS431 – Architecture of Web Information Systems Carl Lagoze – Cornell University – Feb. 7 2005. Identity Change Persistence. Paradox: reality contains things that persist and change over time Heraclitus and Plato: can you step into the same river twice?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Identifiers and Types' - kalea

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
identifiers and types

Identifiers and Types

CS431 – Architecture of Web Information Systems

Carl Lagoze – Cornell University – Feb. 7 2005

identity change persistence
Identity Change Persistence
  • Paradox: reality contains things that persist and change over time
    • Heraclitus and Plato: can you step into the same river twice?
    • Ship of Theseus: over the years, the Athenians replaced each plank in the original ship of Theseus as it decayed, thereby keeping it in good repair. Eventually, there was not a single plank left of the original ship. So, did the Athenians still have one and the same ship that used to belong to Theseus
  • Provide a key or handle linking abstract concepts to physical or perceptible entities
  • Provide us with a necessary figment of persistence
  • They are perhaps the one essential and common form of metadata
  • Why bother?
    • Finding things
    • Referring to things (Citations)
    • Asserting ownership over things
i have lots of identifiers
I have lots of identifiers
  • Carl Jay Lagoze, Dad, Hey you
  • 123-456-7890 (SSN)
  • 1234-5678-1234-1234 (Visa Card)
  • FZBMLH (US Airways locator on January 18 flight to San Diego)
identifier issues
Identifier Issues
  • Object granularity
  • Identifier Context
    • Object atomicity
    • Part/whole relationships
  • Location independence
  • Global uniqueness
  • Persistent across time
  • Human vs. machine generation
  • Machine resolution
  • Administration (centralized vs. decentralized)
  • Intrinsic semantics
  • Type specificity
two common pre digital identifiers
Two common pre-digital identifiers
  • ISBN (International Standard Book Number)
    • Uniquely identifies every monograph (book)
    • One ISBN for each format
      • HP & SS hardback 0590353403
      • HP & SS softcover 059035342X
    • Number is semantically meaningful (components)
    • International administration (>150 countries)
  • ISSN (International Standard Serial Number)
    • Uniquely identifies every serial (not issue or volume)
    • Semantically meaningless
    • International administration
uri universal resource identifier
URI: Universal Resource Identifier
  • Generic syntax for identifiers of resources
  • Defined by RFC 2396
  • Syntax: <scheme>://<authority><path>?<query>
    • Scheme
      • Defines semantics of remainder of URI
      • ftp, gopher, http, mailto, news, telnet
    • Authority
      • Authority governing namespace for remainder of URI
      • Typically Internet-based server
    • Path
      • Identification of data within scope of authority
    • Query
      • String of information to be interpreted by authority
why is rfc 2396 so big
Why is RFC 2396 so big?
  • Character encodings
  • Partial and relative URIs
url universal resource locator
URL: Universal Resource Locator
  • String representation of the location for a resource that is available via the Internet
  • Use URI syntax
  • Scheme has function of defining the access (protocol) method. Used by client to determine the protocol to “speak”.
    • - open socket to on port 80 and issue a GET for index.html
    • - open socket to on port 21, open ftp session, issue ftp get for index.html….
url issues
URL Issues
  • Persistence
  • Location dependence
  • Valid only at the item level
    • What about works, expressions, manifestations
  • Multiple resolution
    • “get the one that is cheapest, most reliable, most recent, most appropriate for my hardware, etc.”
  • Non-digital resources?
  • Sub-parts
urc uniform resource characteristic catalog
URC – Uniform Resource Characteristic (Catalog)
  • Failed but interesting effort
    • Multiple resolution
    • Describe resource by its characteristics
  • Provide adequate bundled information about a resource (metadata) to create identification block for any given resource (including locations)
    • Exactly what are the common set of characteristics for describing different types of resources?
    • Where are these characteristics stored?
robust hyperlinks
Robust Hyperlinks
  • Characteristic of document (metadata) is computed automatically via fingerprint of its content.
  • “Lexical” signatures: The top n words of a document chosen for rarity, subject to heuristic filters to aid robustness.
    • “a TF-IDF-like” measure
    • Five or so words are sufficient
  • Can be used to locate document (via search engine) after it is moved
robust hyperlinks why does this work
Robust Hyperlinks – Why does this work?
  • Number of terms on Web is reportedly close to 10,000,000.
  • If terms were distributed independently, the probability of 5 even moderately common terms occurring in more than one document is very small.
    • In fact, picking 3 terms restricted to those occurring in 100,000 documents works pretty well.
    • Many documents contain very infrequently used words.
    • There is lots of room for independence to be off, and to play with term selection for robustness, etc..
urn universal resource name
URN – Universal Resource Name
  • “globally unique, persistent names”
  • Independence from location and location methods

<URN> ::= "urn:" <NID> ":" <NSS>

    • NID: namespace identifier
    • NSS: namespace-specific string
    • examples:
      • urn:ISSN:1234-5678
      • urn:isbn:9044107642
      • urn:doi:10.1000/140
why isn t dns sufficient parenthetical comment
Why isn’t DNS sufficient (parenthetical comment)
  • Issue of semantic vs. non-semantic names
  • Changing ownership
  • Hierarchical legacy of DNS is sometimes inappropriate
handles names for internet resources
Handles: Names for Internet Resources
  • Naming system for location-independent, persistent names
  • One name, multiple resolutions

The resource named by a Handle can be:

• A library item

• A collection of library items

• A catalog record

• A computer

• An e-mail address

• A public key for encryption

• etc., etc., etc. ....

syntax of handles
Syntax of Handles





10.1234/1995.;9 (date-time stamp)

cornell.cs/cstr-94.45 (mnemonic name)

loc/a43v-8940cgr (random string)

example of a handle and its data used to identify two locations
Example of a Handle and its DataUsed to Identify Two Locations

Data type

Handle data






use of handles in a digital library
Use of Handles in a Digital Library




Search System

Handle System

replication for performance and reliability
Replication for Performance and Reliability

Example: the Global Handle System

Los Angeles, CA

Washington, DC

proxies to resolve handles
Proxies to Resolve Handles

A Web browser can resolve Handles via a proxy. For example, the following URL can be used to resolve the Handle loc.ndlp.amrlp/3a16616:

proxy resolution
Proxy Resolution

URL to Proxy






Handle System





doi digital object identifier
DOI – Digital Object Identifier
  • Technology and social infrastructure for naming
  • Established by publishers for persistent naming of entities (articles, journals, conference proceedings)
  • Cognizant of FRBR elements
  • Underlying technology is handle system
    • “persistent” names
      • Persistence is fortified by social underpinnings
      • Rules for establishing registration agencies
    • Multiple resolution
    • Registration/mechanism has metadata associated with it
  • doi:10.1000/186
oclc s persistent url purl
OCLC\'s Persistent URL (PURL)

• A PURL is a URL

-> Is fully compatible with today\'s Internet browsers

-> Users need no special software

• Has some of the desirable features of URNs

• Lacks some desirable features of URNs

-> Resolves only to a URL

-> Does not support multiple resolution

• Developed by OCLC

• Software openly available

purl syntax


resolver address


PURL Syntax

• A PURL is a URL.

• PURL resolvers use standard http redirects to return the actual URL.

purl namespaces
PURL Namespaces

A PURL provides a local (not-global namespace)

is different from

oclc purl resolution
OCLC PURL Resolution













making links context sensitive
Making links context sensitive
  • Why?
    • “Appropriate item” differs for each user
    • Licensing locality
    • Some users may want a choice (abstract, full text, etc.)
  • Conceptualize link as service rather than object targeted.
  • OpenURL
    • Transports metadata about the work to…
    • A localized service that interprets the metadata and provides contextualized choices to the user.














link source




OpenURL linking

transportation of

metadata & identifiers





resolution of

metadata & identifiers into services

provision of OpenURL


OpenURL 0.1 syntax

  • id=doi:10.111/12345&
  • genre=article&
  • aulast=Weibel&aufirst=Stu&ISSN=35345353
  • &year=2001&volume=14&issue=3&spage=44&
  • pid=2829393&
  • sid=OCLC:Inspec
why haven t urns caught on beyond certain communities
Why haven’t URNs caught on beyond certain communities?
  • Complexity of systems
  • One size does not fit all - special purpose URN schemes have been successful, e.g., PubMed ID, Astrophysics BibCode
  • No guarantee of persistence – longevity is an organizational not technical issue
  • Requires well-regulated administrative systems
  • Absence of “killer” applications – although reference linking is emerging
types not all data and content is the same
Types: Not all data and content is the same
  • Format or Genre
    • How you sense it
    • What you can do with it
    • E.G. – audio, video, map, book
  • Type
    • What you need to process it
    • What is its bit layout
  • Compression or encoding
multipurpose internet mail extensions
Multipurpose Internet Mail Extensions
  • RFC 822 – define textualformat of email messages
  • RFC 2045-2049 – Extend textual email to allow
    • Character sets other than US-ASCII
    • Extensible set of non-ASCII types for message bodies
    • Definition of multi-part mail (attachments)
mime types
MIME Types
  • Two part type hierarchy
    • Top level type
      • text
      • audio
      • video
      • image
      • application
      • multipart
    • Examples
      • text/plain image/gif application/postscript
  • Extensions are handled by IANA
mime in http content negotiation
MIME in HTTP (Content Negotiation)
  • Accept in request-header
    • Accept: text/plain; q=0.5, text/html, text/x-dvi; q=0.8, text/xml
      • text/plain and text/xml are preferred, then text/x-dvi, then text/html
  • Content-Type in response-header
    • Content-Type: text/html
mime is too limited
MIME is too limited
  • Two-level type depth is simplistic
  • Multi-media documents
  • “Documents” that have many types or views
    • FEDORA