slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files PowerPoint Presentation
Download Presentation
File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files

Loading in 2 Seconds...

play fullscreen
1 / 27

File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files - PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on

File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files. Xiaoming Liu (1) , Luda Balakireva (1) , Patrick Hochstenbach (2) and Herbert Van de Sompel (1) (1) Digital Library Research & Prototyping Team

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files' - sawyer


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files

Xiaoming Liu (1), Luda Balakireva (1), Patrick Hochstenbach (2) and Herbert Van de Sompel (1)

(1) Digital Library Research & Prototyping Team

Research Library, Los Alamos National Laboratory

(2) University Library

Ghent University

liu_x@lanl.gov , ludab@lanl.gov , patrick.hochstenbach@ugent.be , herbertv@lanl.gov

disclaimer
Disclaimer
  • The term Digital Object (DO) will be used as in Kahn/Wilensky:
    • Compound object
    • Multiple datastreams of different mime types
    • Secondary information pertaining to object and datastreams
    • Identifiers for object (and datastreams)
  • This is ~ OAIS Content Information
xml based representation of dos
XML-based representation of DOs
  • Growing interest in XML-based representation of DOs in Digital Library architectures:
    • Platform-independence,
    • Industry-support
    • Longevity, potential migration paths
    • Processing tools, validation capabilities
  • XML-based Compound Object formats:
    • ISO/IEC 21000-2 MPEG-21 DID & DIDL
    • METS
    • IMS/CP
    • CCDS XFDU
  • Typical functionality:
    • By-Value (base64) and/or By-Reference provision of constituent datastreams
    • By-Value and/or By-Reference provision of secondary information
    • Provision of identifiers
storing xml based representations of dos
Storing XML-based representations of DOs
  • Existing approaches:
    • storage of the XML-representations as individual files in a file system:
      • Poor access performance
      • Poor backup performance
    • storage of the XML-representations in (SQL, XML, object) databases
      • Long term? Data are dependent on the underlying system
    • storage of the XML-representations by concatenating many such documents into a single file such as tar or zip
      • Not XML aware, hence, no use of off-the-shelf XML tools
      • Increasing storage space (base64-encoding of the constituent datastreams)
adore xmltape arcfile solution
aDORe XMLtape/ARCfile solution
  • Part of LANL aDORe repository effort:
    • Standards-based, modular repository architecture
      • Distributed architecture
      • Protocol-based interactions between modules
      • Usable to create interoperable federations of heterogeneous repositories
    • Actual implementation of the architecture at LANL
    • Components of aDORe software will be released
  • Inspired by Internet Archive ARC file approach:
    • File-based mechanism to store datastreams resulting from Web-crawling
    • Concatenation of multiple datastreams into a single file
    • Metadata as seperators between datastreams
    • But not OK to store XML-based representations of DOs:
      • Metadata capabilities very limited & crawling related
      • Lose power of XML processing tools
adore xmltape arcfile solution1
aDORe XMLtape/ARCfile solution
  • Two interconnected file-based storage mechanisms:
    • XMLtapes:File storage of XML-based representations of Digital Objects
    • ARCfiles: File storage of constituent datastreams of Digital Objects
  • The ARC files are interconnected with one or more XMLtapes during the ingestion process
  • A protocol-based access mechanism is introduced:
    • XMLtape is exposed as an autonomous OAI-PMH repository
    • ARCfile is exposed as an OpenURL Resolver
  • Write once - Read many:
    • Files remain stable
    • Protocol-based access mechanism remains stable
    • Indexing mechanisms can change as technologies evolve
  • Storage approach is independent from the compound object format used to represent DOs as XML
    • aDORe uses MPEG-21 DIDL
iso iec 21000 2 mpeg 21 did didl

based on

based on

has XML

serialization

MPEG-21 Abstract Model

MPEG-21 DIDL

ISO/IEC 21000-2: MPEG-21 DID & DIDL

has XML

serialization

has declaration

Digital Item Declaration

DIDL document

Digital Item

representing dos using mpeg 21 did

Digital

Object

Package

Representing DOs using MPEG-21 DID

sample DIDL document

adore xmltape
aDORe XMLtape
  • An XML file that concatenates the XML-based representations of multiple DOs
  • Structure is defined by an XML Schema
    • http://purl.lanl.gov/aDORe/schemas/2005-08/XMLtape.xsd
    • tape-level administrative section:
      • Open-ended content
      • Plug-in for processing-related information, indication of related ARCfiles:
        • http://purl.lanl.gov/aDORe/schemas/2005-08/XMLtapeBasics.xsd
    • concatenation of records, each of which consists of:
      • record-level administrative section
        • identifier and datestamp of the contained record
        • other record-level administrative information
      • a record (can be from any XML Namespace). DIDL in case of aDORe:
        • http://purl.lanl.gov/aDORe/schemas/2005-08/DIDL.xsd
  • An XMLtape is a valid and well-formed XML file
  • Independent from chosen XML-based Compound Object Format
adore xmltape1
aDORe XMLtape

<?xml version="1.0" encoding="UTF-8"?>

<ta:tape xmlns:ta="http://library.lanl.gov/2005-08/aDORe/XMLtape/"

<ta:tapeAdmin>

...

</ta:tapeAdmin>

<ta:tapeRecord>

<ta:tapeRecordAdmin>

<ta:identifier>oai:aps.org:PhysRevA.71.040101</ta:identifier>

<ta:date>2005-03-29T04:31:22Z</ta:date>

<ta:recordAdmin>

...

</ta:recordAdmin>

</ta:tapeRecordAdmin>

<ta:record>

<didl:DIDL>...</didl:DIDL>

</ta:record>

</ta:tapeRecord>

</ta:tape>

aDORe ta:tape

sample XMLtape

slide11

record

record

record

record

record

record

record

record

aDORe XMLtape index

XMLtape

index

identifier

datestamp of ingestion

identifier

datestamp of ingestion

identifier

datestamp of ingestion

Indexing:

  • Can be achieved with a variety of technologies
  • Current implementation: Berkeley DB Java Edition

<ta:tapeRecordAdmin>

slide12

record

record

record

record

record

record

record

record

aDORe XMLtape as OAI-PMH repository

XMLtape

index

OAI-PMH request

DIDL document

OAI-PMH identifier =

identifier from <ta:tapeRecordAdmin>

OAI-PMH datestamp =

datetime from <ta:tapeRecordAdmin>

OAI-PMH response =

content of <ta:record>

internet archive arcfile
Internet Archive ARCfile
  • Concatenation of binary files
  • Designed and used by the Internet Archive (Wayback machine)
    • > 400 TB web data
  • Under revision by the International Internet Preservation Consortium (IIPC): WARC file format
    • Input from LANL to facilitate non-Web-crawling use case
  • The ARC file format is structured as follows:
    • file header that provides administrative information about the ARC file itself
    • a sequence of document records, consisting of:
      • a header line containing some, mainly crawl-related, metadata.
        • URI of the crawled document
        • timestamp of acquisition of the data
        • size of the data block
      • a response to a protocol request such as an HTTP GET
internet archive arc file
Internet Archive ARC file

filedesc://IA-001102.arc 0 19960923142103 text/plain 761 0 Alexa InternetURL IP-address Archive-date Content-type Archive-length

http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202HTTP/1.0 200 Document followsDate: Mon, 04 Nov 1996 14:21:06 GMTServer: NCSA/1.4.1Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMTContent-length: 30<HTML>Hello World!!!

</HTML>

sample ARC file

internet archive arc file in adore
Internet Archive ARC file in aDORe

filedesc://singletape.arc 0.0.0.0 20050922142103 text/plain 76 1 0

Internet Archive

URL IP-address Archive-date Content-type Archive-length

info:lanl-repo/ds/39c2fa93-fa22-4c19-90af-b5f58b9b989a0.0.0.0 20050907221344 application/pdf 415025

%PDF-1.3

%âãÏÓ

290

0 obj

<<

/Linearized 1

/O 295

/H [ 3642 1057 ]

/L 415025

sample aDORe ARC file

sample ARCfile

slide16

Internet Archive ARC file

ARC

index

URL

datastream

URL

datastream

URL

datastream

datastream

datastream

datastream

datastream

Indexing:

  • Can be achieved with a variety of technologies
  • Current implementation in aDORe: Heritrix toolkit

datastream

URL IP-address Archive-date Content-type Archive-length

slide17

ARC file as OpenURL Resolver

index

ARC file

datastream

OpenURL

OpenURL request

datastream

datastream

datastream

datastream

datastream

datastream

datastream

datastream

Referent Identifier =

datastream identifier =

URL from ARC record header

Resolver Identifier =

identifier of ARC file

associating an xmltape with arc files 1
Associating an XMLtape with ARC Files (1)
  • A Digital Object is represented using an XML-based Complex Object format (e.g. MPEG-21 DID)
  • The resulting package (e.g. DIDL document) is stored in an XMLtape
  • Constituent datastreams of the Digital Object are provided By-Reference:
    • Using the ref attribute of the Resource element in MPEG-21 DID
    • The value of the network location of the constituent datastream is compliant with the NISO OpenURL Framework:

baseURL(ARCfile OpenURL Resolver)?

url_ver = Z39.88-2004 &

rft_id = Datastream Identifier &

res_id = ARCfile identifier

associating an xmltape with arc files 11
Associating an XMLtape with ARC Files (1)

<?xml version="1.0" encoding="UTF-8"?>

<didl:DIDL>

……

<didl:Component id="uuid-ddec9dbb-90e5-4b8a-93f3-dd1c8b781547">

<didl:Descriptor>

<didl:Statement mimeType="application/xml; charset=utf-8">

<dii:Identifier … >

info:lanl-repo/ds/ba0797d3-9414-42d0-90e8-f5397e74892b

</dii:Identifier>

</didl:Statement>

</didl:Descriptor>

<didl:Resource mimeType="application/pdf“

ref="http://purl.lanl.gov/aDORe/demo/adore-arcfile-resolver/resolver?

url_ver=Z39.88-2004

res_id=info:lanl-repo/arc/2001_4acb6e28-1ef9-11da-9e1e-d8ccd1d6c8f2

rft_id=info:lanl-repo/ds/ba0797d3-9414-42d0-90e8-f5397e74892b“/>

</didl:Component>

……

</didl:DIDL>

Extract from DIDL

associating an xmltape with arc files 2
Associating an XMLtape with ARC Files (2)
  • An XMLtape is associated with its corresponding ARCfiles through a plug-in for the XMLtape-level administrative section.
associating an xmltape with arc files 21
Associating an XMLtape with ARC Files (2)

<?xml version="1.0" encoding="UTF-8"?>

<ta:tape xmlns:ta="http://library.lanl.gov/2005-08/aDORe/XMLtape/">

<ta:tapeAdmin>

<tb:XMLtapeBasics xmlns:tb="http://library.lanl.gov/2005-08/aDORe/XMLtapeBasics/“>

<tb:XMLtapeId>info:lanl-repo/xmltape/singlescitape</tb:XMLtapeId>

<tb:ARCfileId>info:lanl-repo/arc/singlescitape</tb:ARCfileId>

<tb:processSoftware>gov.lanl.xmltape.SingleTapeWriter</tb:processSoftware>

<tb:processTime>2005-09-07T22:13:39Z</tb:processTime>

</tb:XMLtapeBasics>

</ta:tapeAdmin>

<ta:tapeRecord>

<ta:tapeRecordAdmin>

</ta:tape>

XMLtape header

slide22

DIDL document

List of (baseURL, DIDLDocument-id)

DIDLDocument-id

or content-id

DIDLDocument-id

or content-id

datastream

ref

datastream-id

ref

creation

datetime

index

datastream-id

index

DIDLDocument-id

index

Identifier Locator

DIDLDocument- id

datastream id

OpenURL

XMLtape

ARC file

AGENT

implementation
Implementation
  • XMLtapes:
    • Berkeley DB Java Edition
    • OCLC OAICat
  • ARCfiles:
    • Heritrix
    • OCLC OpenURL software
  • XMLtape Registry
    • MySQL db
    • OCLC OAICat
  • ARCfile Registry:
    • MySQL db
    • OCLC OAICat
performance indicators
Performance indicators
  • System:
    • Model: Dell 2650 2U rack-mount server
    • CPU: dual 2.8 GHz Intel Xeon processors
    • RAM: 5GB RAM
    • Disks: 10k RPM SCSI disks
  • XMLtape:
    • 1786 MB, 201872 DIDL records
    • download 100 consecutive DIDL records (787 KB) => 0.18 second
    • download static file of same size => 0.09 second
  • ARCfile:
    • 272 MB,  4910 files
    • download a sample PDF file (312 KB) => 0.24 second
    • download static file of same size => 0.036 second
software
Software
  • Software - ARC files:
    • Heritrix: the internet archive's open-source, extensible, web-scale, archival-quality web crawler project. http://crawler.archive.org/
    • NetArchive.dk: a project that plans for the preservation of Denmark's cultural heritage on the internet for future generations. http://www.netarchive.dk/
    • Many other tools: http://archive-access.sourceforge.Net
  • XMLtapes:
    • Perl tool, XML::Tape (LANL & Ghent University), http://search.cpan.org/~hochsten/XML-Tape/
  • Combined aDORe XMLtape/ARCfile environment:
    • Java tool (LANL), soon to be released on SourceForge
conclusion
Conclusion
  • The file-based approach is inherently simple, and reduces dependency on database system.
  • The autonomy of the indexes allows retaining the files over time, while the indexes can be created using other techniques as technologies evolve.
  • The protocol-based nature of the access increases the flexibility in light of evolving technologies as it introduces another layer of abstraction.
  • The XMLtape approach is inspired by the ARC file format, but provides several additional attractive features:
    • Off-the-shelf XML tools can be used to parse/validate an XMLtape
    • All DO metadata can be stored in XML-based compound object format

Presentation available via http://public.lanl.gov/herbertv/

Install TSCC codec for avi movies