Identifiers and types
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Identifiers and Types PowerPoint PPT Presentation


  • 47 Views
  • Uploaded on
  • Presentation posted in: General

Identifiers and Types. CS431 – Architecture of Web Information Systems Carl Lagoze – Cornell University – Feb. 7 2005. Identity Change Persistence. Paradox: reality contains things that persist and change over time Heraclitus and Plato: can you step into the same river twice?

Download Presentation

Identifiers and Types

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Identifiers and types

Identifiers and Types

CS431 – Architecture of Web Information Systems

Carl Lagoze – Cornell University – Feb. 7 2005


Identity change persistence

Identity Change Persistence

  • Paradox: reality contains things that persist and change over time

    • Heraclitus and Plato: can you step into the same river twice?

    • Ship of Theseus: over the years, the Athenians replaced each plank in the original ship of Theseus as it decayed, thereby keeping it in good repair. Eventually, there was not a single plank left of the original ship. So, did the Athenians still have one and the same ship that used to belong to Theseus


Identity change persistence1

Identity Change Persistence


Identifiers

Identifiers

  • Provide a key or handle linking abstract concepts to physical or perceptible entities

  • Provide us with a necessary figment of persistence

  • They are perhaps the one essential and common form of metadata

  • Why bother?

    • Finding things

    • Referring to things (Citations)

    • Asserting ownership over things


I have lots of identifiers

I have lots of identifiers

  • Carl Jay Lagoze, Dad, Hey you

  • 123-456-7890 (SSN)

  • 1234-5678-1234-1234 (Visa Card)

  • FZBMLH (US Airways locator on January 18 flight to San Diego)


Identifier issues

Identifier Issues

  • Object granularity

  • Identifier Context

    • Object atomicity

    • Part/whole relationships

  • Location independence

  • Global uniqueness

  • Persistent across time

  • Human vs. machine generation

  • Machine resolution

  • Administration (centralized vs. decentralized)

  • Intrinsic semantics

  • Type specificity


Two common pre digital identifiers

Two common pre-digital identifiers

  • ISBN (International Standard Book Number)

    • Uniquely identifies every monograph (book)

    • One ISBN for each format

      • HP & SS hardback 0590353403

      • HP & SS softcover 059035342X

    • Number is semantically meaningful (components)

    • International administration (>150 countries)

  • ISSN (International Standard Serial Number)

    • Uniquely identifies every serial (not issue or volume)

    • Semantically meaningless

    • International administration


Uri universal resource identifier

URI: Universal Resource Identifier

  • Generic syntax for identifiers of resources

  • Defined by RFC 2396

  • Syntax: <scheme>://<authority><path>?<query>

    • Scheme

      • Defines semantics of remainder of URI

      • ftp, gopher, http, mailto, news, telnet

    • Authority

      • Authority governing namespace for remainder of URI

      • Typically Internet-based server

    • Path

      • Identification of data within scope of authority

    • Query

      • String of information to be interpreted by authority


Why is rfc 2396 so big

Why is RFC 2396 so big?

  • Character encodings

  • Partial and relative URIs


Url universal resource locator

URL: Universal Resource Locator

  • String representation of the location for a resource that is available via the Internet

  • Use URI syntax

  • Scheme has function of defining the access (protocol) method. Used by client to determine the protocol to “speak”.

    • http://an.org/index.html - open socket to an.org on port 80 and issue a GET for index.html

    • ftp://an.org/index.html - open socket to an.org on port 21, open ftp session, issue ftp get for index.html….


Url issues

URL Issues

  • Persistence

  • Location dependence

  • Valid only at the item level

    • What about works, expressions, manifestations

  • Multiple resolution

    • “get the one that is cheapest, most reliable, most recent, most appropriate for my hardware, etc.”

  • Non-digital resources?

  • Sub-parts


Urc uniform resource characteristic catalog

URC – Uniform Resource Characteristic (Catalog)

  • Failed but interesting effort

    • Multiple resolution

    • Describe resource by its characteristics

  • Provide adequate bundled information about a resource (metadata) to create identification block for any given resource (including locations)

    • Exactly what are the common set of characteristics for describing different types of resources?

    • Where are these characteristics stored?


Robust hyperlinks

Robust Hyperlinks

  • Characteristic of document (metadata) is computed automatically via fingerprint of its content.

  • “Lexical” signatures: The top n words of a document chosen for rarity, subject to heuristic filters to aid robustness.

    • “a TF-IDF-like” measure

    • Five or so words are sufficient

  • Can be used to locate document (via search engine) after it is moved


Robust hyperlinks why does this work

Robust Hyperlinks – Why does this work?

  • Number of terms on Web is reportedly close to 10,000,000.

  • If terms were distributed independently, the probability of 5 even moderately common terms occurring in more than one document is very small.

    • In fact, picking 3 terms restricted to those occurring in 100,000 documents works pretty well.

    • Many documents contain very infrequently used words.

    • There is lots of room for independence to be off, and to play with term selection for robustness, etc..

  • http://www.cs.berkeley.edu/~phelps/Robust/


Urn universal resource name

URN – Universal Resource Name

  • “globally unique, persistent names”

  • Independence from location and location methods

    <URN> ::= "urn:" <NID> ":" <NSS>

    • NID: namespace identifier

    • NSS: namespace-specific string

    • examples:

      • urn:ISSN:1234-5678

      • urn:isbn:9044107642

      • urn:doi:10.1000/140


Why isn t dns sufficient parenthetical comment

Why isn’t DNS sufficient (parenthetical comment)

  • Issue of semantic vs. non-semantic names

  • Changing ownership

  • Hierarchical legacy of DNS is sometimes inappropriate


Handles names for internet resources

Handles: Names for Internet Resources

  • Naming system for location-independent, persistent names

  • One name, multiple resolutions

  • http://www.handle.net

The resource named by a Handle can be:

• A library item

• A collection of library items

• A catalog record

• A computer

• An e-mail address

• A public key for encryption

• etc., etc., etc. ....


Syntax of handles

Syntax of Handles

<naming_authority>/<locally_unique_string>

or

hdl:<naming_authority>/<locally_unique_string>

Examples

10.1234/1995.02.12.16.42.21;9 (date-time stamp)

cornell.cs/cstr-94.45 (mnemonic name)

loc/a43v-8940cgr(random string)


Example of a handle and its data used to identify two locations

Example of a Handle and its DataUsed to Identify Two Locations

Data type

Handle data

Handle

loc.ndlp.amrlp/123456

URL

http://www.loc.gov/.....

RAP

loc/repository-1r4589


Use of handles in a digital library

Use of Handles in a Digital Library

Repository

User

interface

Search System

Handle System


Replication for performance and reliability

Replication for Performance and Reliability

Example: the Global Handle System

Los Angeles, CA

Washington, DC


Proxies to resolve handles

Proxies to Resolve Handles

A Web browser can resolve Handles via a proxy. For example, the following URL can be used to resolve the Handle loc.ndlp.amrlp/3a16616:


Proxy resolution

Proxy Resolution

URL to Proxy

WWW

browser

Proxy

server

URL

hdl.handle.net

Handle System

URL

HTTP

server

Resource


Doi digital object identifier

DOI – Digital Object Identifier

  • Technology and social infrastructure for naming

  • Established by publishers for persistent naming of entities (articles, journals, conference proceedings)

  • Cognizant of FRBR elements

  • Underlying technology is handle system

    • “persistent” names

      • Persistence is fortified by social underpinnings

      • Rules for establishing registration agencies

    • Multiple resolution

    • Registration/mechanism has metadata associated with it

  • doi:10.1000/186


Oclc s persistent url purl

OCLC's Persistent URL (PURL)

•A PURL is a URL

-> Is fully compatible with today's Internet browsers

-> Users need no special software

•Has some of the desirable features of URNs

•Lacks some desirable features of URNs

-> Resolves only to a URL

-> Does not support multiple resolution

•Developed by OCLC

•Software openly available

http://www.purl.org


Purl syntax

http://purl.oclc.org/keith/home

protocol

resolver address

name

PURL Syntax

•A PURL is a URL.

•PURL resolvers use standard http redirects to return the actual URL.


Purl namespaces

PURL Namespaces

A PURL provides a local (not-global namespace)

http://purl.oclc.org/keith/home

is different from

http://purl.stanford.edu/keith/home


Oclc purl resolution

OCLC PURL Resolution

WWW

browser

PURL

PURL

server

PURL

database

URL

URL

HTTP

server

Resource


Making links context sensitive

Making links context sensitive

  • Why?

    • “Appropriate item” differs for each user

    • Licensing locality

    • Some users may want a choice (abstract, full text, etc.)

  • Conceptualize link as service rather than object targeted.

  • OpenURL

    • Transports metadata about the work to…

    • A localized service that interprets the metadata and provides contextualized choices to the user.


Identifiers and types

link

link

link

link

link

destination

link

destination

link

destination

link

destination

OpenURL

link source

linking

server

OpenURL

OpenURL linking

transportation of

metadata & identifiers

user-specific

.

reference

context-sensitive

resolution of

metadata & identifiers into services

provision of OpenURL


Identifiers and types

OpenURL 0.1 syntax

  • http://www.mysrv.org/menu?

  • id=doi:10.111/12345&

  • genre=article&

  • aulast=Weibel&aufirst=Stu&ISSN=35345353

  • &year=2001&volume=14&issue=3&spage=44&

  • pid=2829393&

  • sid=OCLC:Inspec


Why haven t urns caught on beyond certain communities

Why haven’t URNs caught on beyond certain communities?

  • Complexity of systems

  • One size does not fit all - special purpose URN schemes have been successful, e.g., PubMed ID, Astrophysics BibCode

  • No guarantee of persistence – longevity is an organizational not technical issue

  • Requires well-regulated administrative systems

  • Absence of “killer” applications – although reference linking is emerging


Types not all data and content is the same

Types: Not all data and content is the same

  • Format or Genre

    • How you sense it

    • What you can do with it

    • E.G. – audio, video, map, book

  • Type

    • What you need to process it

    • What is its bit layout

  • Compression or encoding


Multipurpose internet mail extensions

Multipurpose Internet Mail Extensions

  • RFC 822 – define textualformat of email messages

  • RFC 2045-2049 – Extend textual email to allow

    • Character sets other than US-ASCII

    • Extensible set of non-ASCII types for message bodies

    • Definition of multi-part mail (attachments)


Mime types

MIME Types

  • Two part type hierarchy

    • Top level type

      • text

      • audio

      • video

      • image

      • application

      • multipart

    • Examples

      • text/plain image/gif application/postscript

  • Extensions are handled by IANA


Mime in http content negotiation

MIME in HTTP (Content Negotiation)

  • Accept in request-header

    • Accept: text/plain; q=0.5, text/html, text/x-dvi; q=0.8, text/xml

      • text/plain and text/xml are preferred, then text/x-dvi, then text/html

  • Content-Type in response-header

    • Content-Type: text/html


Mime is too limited

MIME is too limited

  • Two-level type depth is simplistic

  • Multi-media documents

  • “Documents” that have many types or views

    • FEDORA


  • Login