1 / 40

DCC–Persistent Identifiers for Representation Information

This article discusses the use of persistent identifiers for representation information in the context of digital curation and preservation. It explores different possible persistent identifier systems and provides conclusions on their implementation.

gwendolyns
Download Presentation

DCC–Persistent Identifiers for Representation Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Digital Curation Centre a centre of expertise in data curation and preservation DCC–Persistent Identifiers for Representation Information D Giaretta http://www.dcc.ac.uk http://dev.dcc.ac.uk Funders:

  2. Outline • DCC Development work • Beginning with OAIS Reference Model • Motivation for use of Persistent IDs • Simple case! • Discussion of some possible Persistent ID systems • Conclusions

  3. OAIS Reminder • OAIS is a standard about the long-term preservation of information • An Information Objects is made up of a Data Object plus its accompanying Representation Information (RepInfo)

  4. Information Object 1+ interpreted interpreted using Data Representation 1+ using Object Information Physical Digital Object Object 1+ Bit Sequence Information Objects

  5. Representation Information • The Data Object is “interpreted using” the RepInfo • The Reference Model is designed to ensure that an OAIS is NOT set the impossible task of having to provide ALL possible RepInfo immediately • Hence: • Take account of the Designated Community and its associated Knowledge Base

  6. Representation Information • The information that maps a Data Object into more meaningful concepts. An example is the ASCII definition that describes how a sequence of bits (i.e., a Data Object) is mapped into a symbol.

  7. Representation Information • The Representation Information accompanying a physical object, like a moon rock, may give additional meaning • It typically is a result of some analysis of the physically observable attributes of the rock • The Representation Information accompanying a digital object, or sequence of bits, is used to provide additional meaning. • It typically maps the bits into commonly recognized data types such as character, integer, and real and into groups of these data types. • It associates these with higher level meanings which can have complex inter-relationships that are also described

  8. Recursive Nature ofRepresentation Information • Structure Information • Semantic Information • Other Representation Information

  9. Examples (cont) • “504b0304140000000800f696….” • “This is a ZIP file which contains Word files, each of which contains an encoded message which needs the key ‘!D$G^AJU*KI’ to decode it using encryption method SHA7”

  10. Examples (cont) • LaTex file containing an EPS (Encapulated Postscript) version of an image • Web page containing Java Applet generating random numbers • SWISS-PROT data • Foreign Language emails

  11. Further RepInfo Classification

  12. Why classify? • “This is a Word file” • “This is a ZIP file which contains Word files” • “This is a ZIP file which contains Word files, each of which contains an encoded message which needs the key ‘!D$G^AJU*KI’ to decode it using encryption method SHA7” • “This is a ZIP file which contains Word files, each of which contains an encoded message which needs the key ‘!D$G^AJU*KI’ to decode it using encryption method SHA8” • To avoid repetition • To facilitate automation

  13. Structure – including Formats • Distinguish • formats which are used mainly for rendering – to be followed by human inspection, and • formats used for automated processing • Distinguish: • Things with unknown structure – needs software • proprietary software e.g. MS Word • Open Source software e.g. CDF • Things with known structure • ASCII file, FITS file etc • Document the format • Use description language if possible e.g. EAST • The EAST tools are themselves Representation Information which in due course will have to be fully defined – the closure of their Representation Nets will be the EAST standard • Higher level definitions should include useful scientific objects and humanities objects

  14. Layered Model from OAIS

  15. Semantics • Meaning/ Relationships • Hard problem • Probably start with Data Dictionaries • Add RDF etc

  16. Time Dependent Information • Many, perhaps most, datasets change over time and the state at each particular moment in time may be important. It may be useful to break the issue into separate parts. • at each moment in time we could, in principle, take a snapshot and store it. That snapshot has its associated Representation Net. • efficient storage of a series of snapshots may lead one to store differences or include time tags in the data • Additional Representation Information would be needed which describes how to get to a particular time's snapshot from the efficiently encoded version. • Also applies to ANNOTATION – who said what about which and when did they say it

  17. Actions and Processes (Behaviour) • Some information has, as an integral part of its content, an implicit or explicit process associated with it • An examples of this is a database or other time dependent or reactive system such as a Neural Net. • Emulations • Universal Virtual Computer (UVC) • A very well specified VM e.g. JVM

  18. Is saying “it’s XML” enough? • <?xml version='1.0'?> • <VOTABLE version="1.1" • xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" • xsi:schemaLocation="http://www.ivoa.net/xml/VOTable/v1.1 http://www.ivoa.net/xml/VOTable/v1.1" • xmlns="http://www.ivoa.net/xml/VOTable/v1.1"> • <!-- • ! VOTable written by uk.ac.starlink.votable.VOTableWriter • !--> • <RESOURCE> • <TABLE name="6dfgs_E7_subset" nrows="875"> • <PARAM arraysize="*" datatype="char" name="Original Source" value="http://www-wfau.roe.ac.uk/6dFGS/6dfgs_E7.fld.gz"> • <DESCRIPTION>URL of data file used to create this table.</DESCRIPTION> • </PARAM> • <PARAM arraysize="*" datatype="char" name="Credits" value="Column explanations provided by Mike Read (ROE) from 6dfGS project."/> • <PARAM arraysize="*" datatype="char" name="Conversion" value="Converted from 6dfgs_E7.fld.gz by Mark Taylor (Starlink) using STIL."/> • <PARAM arraysize="*" datatype="char" name="Comment" value="Cut down 6dfGS dataset for TOPCAT demo usage."/> • <FIELD arraysize="15" datatype="char" name="TARGET"> • <DESCRIPTION>Target name</DESCRIPTION> • </FIELD> • <FIELD arraysize="11" datatype="char" name="RA" unit="HMS"> • <DESCRIPTION>Right Ascension J2000</DESCRIPTION> • </FIELD> • <FIELD arraysize="11" datatype="char" name="DEC" unit="DMS"> • <DATA> • <FITS> • <STREAM encoding='base64'> • U0lNUExFICA9ICAgICAgICAgICAgICAgICAgICBUIC8gU3RhbmRhcmQgRklUUyBm • b3JtYXQgICAgICAgICAgICAgICAgICAgICAgICAgICBCSVRQSVggID0gICAgICAg • ICAgICAgICAgICAgIDggLyBDaGFyYWN0ZXIgZGF0YSAgICAgICAgICAgICAgICAg • ICAgICAgICAgICAgICAgIE5BWElTICAgPSAgICAgICAgICAgICAgICAgICAgMCAv • IE5vIGltYWdlLCBqdXN0IGV4dGVuc2lvbnMgICAgICAgICAgICAgICAgICAgICAg Or here NO!

  19. Why not embed in the object? • Do we have to repeat things each time? • Does every archive have to do everything? • What happens when the Designated Community Knowledge Base changes?

  20. Registries • A place to register something • A place to look something up (find something)

  21. Examples • http://www.loc.gov/film/nfr2004.html • http://hul.harvard.edu/gdfr/ • http://sunsite.berkeley.edu/~rbeaubie/metsimpl/ • http://metadata.net/registries.html • http://uddi.microsoft.com/default.aspx

  22. Simplest cases: • Data object has an identifier pointing to Representation Information (RepInfo) • Services: Given an identifier return associated contents of Repository • Writer of RepInfo needs to be able to find related stuff (i.e. has someone already done the work?) • Services: must be able to SEARCH registry in various ways • Updater of RepInfo – someone/something needs to be able to add, extend (add RepInfo for the RepInfo), correct

  23. High Level Conceptual View The Digital Object could have RepInfo packed with it Example of use of Representation Information Labelling

  24. Possible ways to attachment ID • DOI metadata • SRB attribute • METS/XFDU attribute • Object-based Storage Devices (OSD) attribute • NB local caching is possible • Simple buy-in

  25. Example Label:

  26. Persistent ID – Digital Objects • Persistent Identifiers of Persistent Objects • Uniqueness (over time) • “Actionable” i.e. actually allows one to get something • Bootstrap step • Sequence of resolutions • Terminal step

  27. Uniqueness • Hierarchy of name spaces • In each namespace: • Unique (how many?) • Final namespace: e.g. • Unique (probably out of a larger number) • Repository assigned: e.g. Sequential, Hashed etc • Repository or Depositor assigned or Distributed system: e.g. UUID based

  28. Resolvability • BsXsYs(Z)T • B: Bootstrap step • s: Separators – may be different • X, Y: Sequence of intermediate resolver steps • Z (implicit) : terminal resolver service • T: terminal token

  29. Persistence Requirements • External to Repository: • Bootstrap step • Each resolver step • Within the control of a Repository • Terminal resolver • The Digital Object

  30. Bootstrap step • Fixed root • ISO based • ISO/IEC 6523-1:1998 (rolodex?) • ISO/IEC 8824-1:1998 • DOI • Handle • PURL • Mutable root • ARK • [http://NMAH/]ark:/NAAN/Name • URN • LSID

  31. Two Forms of ISO Highest-Level Identifiers - from ISO 8824 1. iso(1) standard(0) and 2. iso(1) identified-organization(3) • Form 1 • Requirements on the standard, if any, are not currently known • Can ‘standard’ simply define procedures for ID assignments? • Must ‘standard’ explicitly give all identifiers to be used? • Form 2 • “Identified Organization” is to be identified using ISO 6523 • ISO 6523-1,-2 (1998) extensively revised from 1984 version

  32. ISO 6523-1 (1998) • Rules for ICD registration, and usage of 3 additional fields • ICD identifies organization registration system, 1-4 characters (e.g., ICD ‘112’ is system for registering top level standards organizations) • Organization Identifier (OI), up to 35 characters (e.g., ‘4’ assigned to CCSDS) • Optional Organization Part Identifier (OPI), up to 35 characters identifying sub-org., services, or entity (e.g., ‘1’ could be assigned to CCSDS CA Agent) • Optional OPI Source (OPIS), 1 character, identifies ‘who’ assigned the OPI (e.g., ‘1’ says ‘identified organization (CCSDS) assigned the OPI’) • Interpretation of identification ‘string’ under 6523 requires full knowledge of context of usage • Fields can be in any order • No syntax specified

  33. Implications for Registered Identifier Usage • All identifiers are ambiguous without context of usage • No string is globally unique • Need syntax specification including meaning of included fields • In most contexts of usage, full iso string not needed • Sending and receiving parties understand context • May need to broaden context of usage in some cases • Can employ full string • Map into new identifier string syntax and semantics - not automatic

  34. Investigation Status • ICD 112 has been obtained by ISO for identification of standards developing organizations • ICD 112 is under control of ISO JTC1/SC32 (in 2000 the contact was)

  35. ID = 1 3 0112 4 “X” Potential ID Construction(abstract level) • Using ISO/ICDs • “x” distinguishes among CCSDS defined domains (TBD) • Maintained by CCSDS Secretariat “P2 CA ADID services” 1 NSSD 0233 (Panel 2) ICD OI OPI 1 5 2…. (Panel 3) “P3 SLE services” OPIS • Using ISO/ISO Standards • “x” is number of ISO standard “13764” NSSD 0233 (Panel 2) ID = 1 0 “X” “P3 Top Level SLE Standard” 5 2…. (Panel 3)

  36. ISO 8824-1 Naming Tree

  37. Persistent ID - roles • Who/What (role?) can update a Registry entry in the long term? • Who/What (role?) can access a Registry entry? • Authorisation? • Encryption keys?

  38. What can be relied on? • Organisational/ Procedural/ Sociological issues are important • What can be relied on? • Organisations? • Internet? DNS? • Nothing

  39. Example <reference> <identifier> <value>e1fe9271-cd48-4418-a63e-b112ebf792c7</value> <resolver resolverType="ark">http://foobar.zaf.org/ark:/64269/</resolver> <resolver resolverType="doi">10.123456/</resolver> </identifier> <description>For example – something registered with both ARK and DOI</description> </reference>

  40. Conclusions There is a need for Persistent Identifiers for persistent objects There are many systems – some may be more believable than others None can actually be trusted in the really long term

More Related