1 / 29

Stanford Archival Repository Project

Stanford Archival Repository Project. Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University. Data does not live forever. Much data is stored digitally (perhaps exclusively) Text Multimedia (images, sound, etc.) Scientific data

smorales
Download Presentation

Stanford Archival Repository Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

  2. Data does not live forever • Much data is stored digitally (perhaps exclusively) • Text • Multimedia (images, sound, etc.) • Scientific data • But digital storage is currently unreliable • Magnetic tapes decay, break or lose magnetism • Disks crash • Buildings burn down • Users delete data (accidentally or maliciously)

  3. Data does not live forever • Digital information already lost: • Early NASA records • U.S. Census Information • Toxic waste records • Decay time for common media: • Magnetic Tapes: 10-20 years • CD-ROM: 5-50 years • Hard Drive: 3-5 years

  4. Digital archiving • Digital archivists need: • A reliable system to store digital data for long periods without losing it • Convenient tools to add new data and manage data already archived • Methods for finding the “best” configuration • Most reliable • Most cost effective • Etc.

  5. Archival Repository Project • Goal: Reliably archive digital information for long periods of time (decades or centuries) • Focus on “preserving bits” • Preserving meaning: future work • Strategy • Replicate objects • Automatically detect and correct errors • Our project • Stanford Archival Vault (SAV) – reliably archives data • InfoMonitor – automatically adds newly created data to the archive • ArchSim – a simulation tool to model archives

  6. Internet InfoMonitor Archived Archived data data Architecture Local archive Remote archive Users Users SAV Archive SAV Archive Filesystem

  7. Object Store Reliability Layer SAV architecture Data Creation/Import User Interface Upper Layers Remote SAV Sites “Core” SAV components Upper layers Application/user level

  8. Write-once repository • Deletions/modifications disallowed • Any object deleted or modified must have been corrupted, and is replaced • Challenges • Constructing structures of objects • Object references constrained to point from new to old objects • Representing modifications • Archive new version of objects = version chain • Finding objects • Indexes

  9. Write once repository: Indexes • Key to performance • Locate an object quickly using its signature, “Who points to me?” problem, etc. • Disposable indexes • Can be rebuilt at any time from SAV objects • “Bookmarks” used to find collections of objects using indexed name

  10. Write once repository: Indexes SAV Bookmark (with well-known name)

  11. Replication: Site networks • Sites form “replication agreements” • Agree to replicate data • Specify data to replicate in agreement • May be a subset of all of the data in the archive • Periodically connect and compare data, looking for errors Strongly connected Weakly connected

  12. Replication: Data sets • SAV replicates different data sets separately • E.g., web pages under agreement A, Usenet articles under agreement B • “Replication sets” should grow without human intervention • Traverse link structure to find objects in set

  13. Replication: Data sets SAV Start traversal

  14. User interface

  15. User interface

  16. Object store performance

  17. Reliability layer performance

  18. The InfoMonitor • Goal • Create a convenient, transparent mechanism for getting data from existing stores into the archive • Architecture Users Users SAV Archive InfoMonitor Filesystem

  19. Detecting new data • Must find and archive new data • Filesystem will not signal data writes • Users should not have to explicitly “check-in” data • Scanning • Quick scan: detect changes using timestamps • Slow scan: detect changes using file contents • Filtering • Automatically decide what to archive • Use filtering rules

  20. User interface

  21. User interface

  22. InfoMonitor performance

  23. Designing Archival Repositories • Designer needs to answer questions like: • What is the minimum number of copies of a documents that are needed to ensure its preservation? • What is a more cost efficient, to store the information on one expensive disk with low failure rates or on two inexpensive disks with high failure rate? • Are two sites enough to guarantee preservation? • How often should we scan the repositories for errors? • What’s the MTTF of this design?

  24. Contributions • A comprehensive model for an Archival Repository • A powerful simulation tool: ArchSim, for evaluating Archival Repositories and the available strategies. • A detailed case study for an hypothetical TR Repository operated between Stanford and MIT

  25. How important is having good disks?

  26. Preventive maintenance

  27. Current and future work • New models for replication agreements and “data trading” • Archiving the World Wide Web • Modeling cost • Managing “meaning” • Security • Alternative object naming schemes • Other “upper layers,” e.g. user access, metadata, etc.

  28. Conclusion • Digital librarians need tools to preserve data • Our project addresses this need • Reliable storage: SAV • Convenient access: InfoMonitor • Finding the best configuration: ArchSim • More work must be done to refine these models • More automation • More flexibility • Answer a wider range of design questions

  29. For more information http://www-db.stanford.edu/archivalrep Brian Cooper: cooperb@db.stanford.edu Arturo Crespo: crespo@db.stanford.edu Hector Garcia-Molina: hector@db.stanford.edu

More Related