collection based persistent archives n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Collection-based Persistent Archives PowerPoint Presentation
Download Presentation
Collection-based Persistent Archives

Loading in 2 Seconds...

play fullscreen
1 / 45

Collection-based Persistent Archives - PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on

Collection-based Persistent Archives. Reagan W. Moore Associate Director, Data Intensive Computing San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE. Topics. Experiences learned building a prototype Persistent Archive Information model

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Collection-based Persistent Archives' - jayden


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
collection based persistent archives

Collection-based Persistent Archives

Reagan W. Moore

Associate Director, Data Intensive Computing

San Diego Supercomputer Center

moore@sdsc.edu

http://www.npaci.edu/DICE

topics
Topics
  • Experiences learned building a prototype Persistent Archive
    • Information model
    • Hierarchical levels of information
    • Interoperability mechanisms
  • Application to workshop topics
    • Ingestion methodology
    • Data set identification
    • Certification of archives
persistent archive goals
Persistent Archive Goals
  • Provide collection based archive
    • Data set relevance is organized by the collection
  • Provide information model to describe the context for the data collection
    • Enough information is needed to be able to dynamically create the collection from archived information
  • Decouple collection creation from digital object archiving
  • Provide accessioning system to turn data sets into digital objects
    • Accessioning is independent of the final collection
nara persistent archive prototype
NARA Persistent Archive Prototype
  • Demonstrate ability to ingest, archive, recreate, query, and present a digital object from a 1 million record E-mail collection (RFC1036)
    • 2.5 GB of data
    • 6 required fields
    • 13 optional fields
    • User defined fields (over 1000)
  • Determine information model needed for persistent archive
key concepts learned
Key Concepts Learned
  • Information model
    • Semi-structured representation of information - XML
    • Infrastructure independent representation of information context - XML DTD
    • Differentiation between information context for digital objects,collection and presentation
      • DTD for objects
      • DTD for collection
      • XSL style sheets for presentation
    • Instantiation software for creating the collection from the information model
      • XML databases now appearing
hierarchy of information contexts
Hierarchy of Information Contexts
  • Digital object context
    • Meta-data to define the structure of the object
    • When publishing a digital object, must also publish the context of the object
  • Use collections to organize objects
    • Meta-data to define the structure of the collection
    • When publishing a collection, must also publish the information needed to organize the collection.
  • Use presentation context to control access
    • Meta-data to define structure of presentation
key concepts learned1
Key Concepts Learned
  • Digital object encapsulation
    • Minimize the number of times a digital object must be touched
    • Once archived, a digital object should only be retrieved when requested by a user
  • Implies meta-data stored with digital objects should only describe the objects
  • Collection and presentation meta-data should be stored separately
persistent archive requirements
Persistent Archive Requirements
  • Distributed environment to ensure separable components
    • Accession workbench
    • Archive
    • Presentation platform
  • Data handling mechanisms for interoperability as basis for system evolution
    • No tightly coupled systems
    • Unique names are only used by the data handling system
    • Use of containers to aggregate digital objects for storage
    • Implies a hierarchical naming scheme
      • Collection / container / digital object
slide12

FTP

FTP

Electronic Records Archive (ERA)

TRANSFER

ACCESSION

ARCHIVES

REFERENCE

Accessioning

Work Bench

(snapin)

Reference

Workbench

(snapin)

Retrieve

Records

Media

Handlers

Catalog

METADATA

REPOSITORY

RECORDS

REPOSITORY

Internet

Intranet

Text

Image

Photo

Video

Audio

Geographical Information System

Compound Records

WEB

Database

Arrangement

A

R

C

Query &

Reference

Tools

TAPE

TAPE

CD

U

N

W

R

A

P

P

E

R

CD

W

R

A

P

P

E

R

DISK

DISK

record

Presentation

Metadata wrapper

Order

Fulfillment

federation of data collections into digital libraries

CEED / ESA

NASA

Catalog

DPOSS Sky Survey

REINAS

2MASS Sky Survey

U Md Archive

ADL

NS

Dig

Lib

Elib - Flora

Wash. Brain Image

UCLA Brain Image

MSU Brain Image

UCSD Neuroscience

ESS

Dig

Lib

MS

Dig

Lib

Wash U Genome

U H Mol Trajectory

Protein Data Bank

Federation of Data Collections into Digital Libraries

UC

Calif

Finding

Aids

NARA Persistent Archive

AMICO Image Library

UMDL Social Science

U Wisc. Video Lib.

Pacific Rim DL

conclusions
Conclusions
  • Ingestion
    • Infrastructure independent representation for digital objects
    • Infrastructure independent representation for information model
      • Turn data sets into digital objects by adding attribute tags
    • Aggregate digital objects in containers for storage
conclusions1
Conclusions
  • Data set identification
    • Unique names only required by data handling system
      • Attribute based access through collection
    • Hierarchical naming
      • Collection / Container / Digital object
      • Finding Aid for collection / Data handling system ID for container / Unique ID for object
conclusion
Conclusion
  • Certification of persistent archive
    • Demonstrate that can provide infrastructure independent representation for
      • Finding aids for locating collections
      • Information model for building collection
      • Data handling system container Ids for storage access
      • Digital object attribute tags
    • Demonstrate that can use information models to create finding aids, collections, and access interfaces on new technology
    • Demonstrate that can independently migrate any component of architecture
further information
Further Information

http://www.npaci.edu/DICE

context based objects
Context Based Objects
  • For data to be useful, the context must be defined
    • Data format - binary/integer representation
    • Physical meaning - units
    • Structure - geometry
    • Relevance - feature annotation
    • Semantics - data dictionary for attributes
  • Context is preserved as meta-data attributes
information models for organization of data
Information Models for Organization of Data

Digital Object Attributes

Collection Attributes

Presentation Attributes

information models for access to data
Information Models for Access to Data

Presentation of data from multiple digital libraries

Collections from federated databases

Digital object Attributes

common information model
Common Information Model
  • eXtensible Markup Language (XML)
    • Use tags to define semantic context for components of the data set
  • Document Type Definition (DTD)
    • Provides semi-structured representation for organizing tags that can be applied to groups of digital objects
  • Development of standards for tags
    • Digital sky, Protein Data Bank, Neuroscience brain images
    • California Digital Library - Art Museum Image Consortium
information management hierarchy
Information Management Hierarchy
  • Presentation / Information Discovery / Analysis
    • Visualization - Shastra, 3D visualization tools
    • Presentation information model - XSL style sheet
  • Collection organization
    • Meta-data catalog - MCAT
    • Collection information model - XML DTD
  • Data handling
    • Storage Resource Broker - SRB
  • Storage
    • Archival storage system - HPSS
    • Digital object model - XML DTD
open grid architecture to encourage interoperability
Open Grid Architecture to Encourage Interoperability

Application

Data Model

Management

Remote

Procedure

Execution

Information

Discovery

Data Handling

Systems

Dynamic

Info

Discovery

Storage

System

Description

Storage

Resources

technology sources
Technology Sources
  • Archive Community
    • IEEE Mass Storage Systems Technical Committee
    • Scalable storage systems
  • Digital Library Community
    • NSF Digital Library Initiative, Phase II
    • Information management mediation - XML
  • Supercomputer Community
    • Scalable analysis platforms
  • Grid Forum
    • Data handling systems for interoperability
  • Archivist Community / Library Community
    • Management policies and standards
technology sources1
Technology Sources

Application

Data Model

Management

Digital Library

Remote

Procedure

Execution

Information

Discovery

Data Handling

Systems

Dynamic

Info

Discovery

Storage

System

Description

Storage

Resources

Computational Grid

information management architecture
Information Management Architecture
  • Digital library community technologies
    • Distributed information resources
      • Digital library interoperability protocols - SDLIP
      • Mediation of information using XML - MIX
  • Grid Forum technologies
    • Support for distributed services / procedures
    • Inter-realm authentication
      • GSI Grid Security Infrastructure
    • Data handling system
      • Storage Resource Broker, Meta-data Catalog
grid forum data access architecture
Grid Forum Data Access Architecture

API that provides “glue” to underlying data handling systems (security, scheduling, QoS, access protocol, data format/model, adaptivity, info discovery, location control)

Application

+ authentication

+ authorization

Data Model

Management

Remote

Procedure

Execution

Armada

D’agents,

FEL, ADR

GRAM,

SRB, Java, CORBA

Information

Discovery

Data Handling

Systems

LDAP, Database, Flat file, Object database

Condor, GASS, NILE, SRB, I-2 caching, ADR

Dynamic

Info

Discovery

Storage

System

Description

API that provides

“glue” to underlying

storage, QoS, etc.

[GASS, IBP, SRB]

Storage

Resources

DPSS, DFS, NFS

HPSS, ADSM, DMF, Unitree, NASstore, DB2, Oracle, Informix, Sybase,

O2, ObjectStore, Objectivity

GloPerf, Netlogger, NWS

DTD, ADR, object class

data handling system capabilities
Data Handling System Capabilities
  • SDSC Storage Resource Broker
    • Protocol transparency
      • Common API for access to remote data resources
      • Explicit drivers for each type of storage system
    • Name transparency
      • Attribute based access to data
    • Location transparency
      • Distribution of collection across multiple physical resources
    • Time transparency
      • Minimization of latency for data access
slide30

File SID

DBLobj SID

Obj SID

SRB

Unix

DB2

Oracle

ADSM

HPSS

SDSC Storage Resource Broker

& Meta-data Catalog

Application

Resource

User

MCAT

Dublin Core

Application

Meta-data

time transparency
Time Transparency
  • How to minimize latency
    • Prefetch data to local high performance disk, so that all accesses can be done at high speed from local resources
  • How to maximize access rate
    • Composite or aggregate data into a single data set to avoid multiple accesses
    • Stream data at high rates using parallel I/O, amortizing the access latency by the volume of data that is delivered.
  • How to avoid congestion
    • Replicate data across multiple servers
srb containers managing archive latency
SRB Containers - Managing Archive Latency

SRB client

  • Create container in a logical storage resource containing at least one “cacheable” resource
  • Create objects in containers
  • “Cache” daemon will move filled containers to archive
  • synch and purge API’s

SRB Server

UNIX

HPSS

HPSS

container

cached containers

Distributed Storage Resources

generality of information infrastructure
Generality of Information Infrastructure
  • Same information model needed to manage
    • Federation in space
      • Metacomputing environment
      • Interoperable services for digital libraries
    • Migration over time
      • Collection creation and update
      • Persistent archive
  • Same storage systems needed to support
    • Supercomputer center data
    • Discipline specific data collections
    • Digital library collections
art museum image consortium
Art Museum Image Consortium
  • Demonstrated
    • Support for heterogeneous digital objects
    • Automated conversion of meta-data to XML DTD
    • Validation of meta-data
    • XSL style sheet for presenting information
national partnership for advanced computational infrastructure
National Partnership for Advanced Computational Infrastructure
  • Facilitate the conduct of science through development of knowledge resources
    • Publish - Data collection infrastructure
    • Info discovery - Digital Library infrastructure
    • Data access - Data handling infrastructure
  • Apply to federal, state, and university projects
    • NSF / DOE / NASA / USPTO / NARA / Census Bureau
    • California Digital Library
    • UCSD - Pacific Rim Digital Library Alliance
publishing scientific data

Data

Storage

Archival

Storage

Applications

Collection

Building

Information

Management

Digital

Library

Digital Sky

Neuroscience

Protein Data Bank

Molecular Structures

Earth Systems Science

CDL

UCB - Elib

UCSB - ADL

Stanford - SDLIP

U Michigan - UMDL

Publishing Scientific Data

Applications

Libraries

npaci is a national partnership of partnerships
NPACI is a National Partnership of Partnerships

46 institutions

20 states

4 countries

5 national labs

Many projects (new and old)

Vendors and industry

Government agencies

national partnership for advanced computational infrastructure1
National Partnership for Advanced Computational Infrastructure
  • Provide Teraflops / Petabyte capable systems for use by national academic community
    • Current systems at the San Diego Supercomputer Center
      • 250 Gflops peak computation rate
        • IBM SP, CRAY T3E
      • 250 Terabyte archive capacity, 100 TB in archive
        • High Performance Storage System
    • By end of year
      • 1 TFlop peak computation rate
        • IBM SP
      • 500 Terabyte archive capacity
challenges
Challenges
  • Facilitate access to high-end resources
    • Support data intensive computing
  • Facilitate access to distributed data resources
    • Support information discovery
  • Minimize complexity of user interfaces
    • Provide unifying data access system
  • Requires information management infrastructure