grad 521 research data management winter 2014 lecture 15 amanda l whitmire asst professor n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor PowerPoint Presentation
Download Presentation
GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor

Loading in 2 Seconds...

play fullscreen
1 / 37

GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

Plan for Archiving & Preservation of Data. GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor. Logistics. Heads up/reminder on the final: data management plan. Survey responses: thank you!. Today’s lesson.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
grad 521 research data management winter 2014 lecture 15 amanda l whitmire asst professor

Plan for Archiving

& Preservation

of Data

GRAD 521, Research Data Management

Winter 2014 – Lecture 15

Amanda L. Whitmire, Asst. Professor

logistics
Logistics

Heads up/reminder on the final: data management plan

Survey responses: thank you!

today s lesson
Today’s lesson
  • Basic archival processes: data selection, format migration, checksums, auditing, etc.
  • Address the need for conversion to standard formats needed for re-use
  • Options for a long-term sustainable preservation strategy/policy for your data
  • Costs & timelines for data storage, management tools and services
archive stage actions
Archive-stage actions
  • Data selection or appraisal
  • Format selection
  • Perform checksums
  • Select archive location
  • Periodic file- and bit-level audits
1 data appraisal
1. Data appraisal
  • “… the process of distinguishing recordsof continuing value from those of no further value so that the latter may be eliminated.”
  • The National Archives (UK)
appraisal criteria
Appraisal criteria

Relevance to mission

Historical value

Uniqueness

Potential or redistribution

Non-replicability

Economic case

Full documentation

For a full discussion of the appraisal process, see this guide: Whyte, A. & Wilson, A. (2010). "How to Appraise and Select Research Data for Curation". DCC How-to Guides. Edinburgh: Digital Curation Centre. http://www.dcc.ac.uk/resources/how-guides

2 format selection
2. Format selection

Ideal: non-proprietary or open formats

For more info. on data formats: http://guides.library.oregonstate.edu/data-management-types-formats

archive stage actions1
Archive-stage actions
  • Data selection or appraisal
  • Format selection
  • Perform checksums
  • Select archive location
  • Periodic file- and bit-level audits
3 checksums
3. Checksums

Checksums provide a way to:

ensure the integrity of your data

create a comprehensive list of your files

data integrity
Data integrity

What is an MD5 checksum?

is like a fingerprint of a file

used to verify whether two files are identical

Each time you run a checksum:

a number string for each file is created

even if 1 byte of data has been altered or corrupted that string will change

if the checksums match, the data has not altered

checksums
Checksums

Here is an example data collection:

Folder: C:\ … \datamanagementstuff

checksums1
Checksums

Here is a MS Word document in that folder:

fastsum
FastSum
  • FastSum is a free MD5 checksum tool for windows available at http://www.fastsum.com/
  • 1. Download and install the trial version
  • 2. Run the Program
creating a checksum
Creating a checksum

The wizard has created list of‘Checksum\State’in FastSum

It has also created a text file in the \datamanagementstufffolder

creating a checksum1
Creating a checksum

Open up the text file and this is what you find:

*a checksum string and a list of file names*

using a checksum
Using a checksum
  • In this example:
  • Reopened the Word document from earlier
  • Deleted a period, saved, and closed the document
  • When you run the checksum wizard again, the value for the ‘Datamanagement.doc’ file should change
comparing checksums
Comparing checksums
  • Before …

… After

comparing checksums1
Comparing checksums
  • Notice how the values for Datamanagement.doc have changed:
    • 0CA9E83E612447E793D4758BF7A5244D
    • 91BAE7EC0C642D967585D01DD6AA4096
  • - values for the other files stay the same
  • - values stay the same across machines unless a file has changed
creating a file list
Creating a file list

FastSum has created a list of all the files in the folder it was pointed it toward:

archive stage actions2
Archive-stage actions
  • Data selection or appraisal
  • Format selection
  • Perform checksums
  • Select archive location
  • Periodic file- and bit-level audits
4 select archive location
4. Select archive location
  • Considerations
  • Costs
  • Size of dataset
  • Public vs. private access
  • Length of preservation
  • Hands-on vs. hands-off
  • Security of platform

Locations

Individual

Department/College

University-wide

Discipline-specific

3rd-party

Archive vs. sharingmechanism

archive stage actions3
Archive-stage actions
  • Data selection or appraisal
  • Format selection
  • Perform checksums
  • Select archive location
  • Periodic file- and bit-level audits
data in real life
Data in Real Life

Images courtesy of Heather Henkel

  • A design firm was handling their own backups. The system was working fine and the backup software was reporting that the data was successfully backed up.
data in real life1
Data in Real Life

CC Image courtesy of angielauw on Flickr

  • The administrator checked the backups immediately after they were done and confirmed they were good.
data in real life2
Data in Real Life

After a computer virus erased most of their files, they went back to their backups. Unfortunately they found that the backups were all blank and all of the data was gone. Only after some investigation did they discover that the computer tapes (which contained the backups) were placed against a wall that had an elevator on the other side of it. When the elevator went past, the magnets inside erased all of the tapes.

Take home message: had they checked their backups again, they probably would have noticed this issue before there was an emergency & complete loss of files.

preservation strategy
Preservation strategy
  • Create an archive backup policy that clearly identifies:
    • roles
    • responsibilities
    • where the data is backed up
    • how often the files are backed up
    • how to access the files
    • recommended file formats to be used &
    • policies for migrating data to assure data are not lost due to media degradation or changing formats or programs
  • Review your backup policy & plan periodically to ensure it is still valid and applicable
    • Update contacts, if appropriate
best practices
Best Practices
  • Minimize or remove reliance on users to perform manual backups (if possible)
    • Implement standardized and automatic backups
    • If possible, put experts in charge of this task (computer staff) as they are more likely to keep up-to-date regarding software updates, hardware issues, best practices, etc.
  • Don’t assume backups are being performed for you
    • You don’t want to find out after the fact that no backups have been performed
    • If you are using third-party software (like Yahoo or Google Mail), what happens if they lose your files?
a typical osu researcher
A typical OSU researcher

> 55% produce 100 GB or less per project

archive on your own
Archive on your own
  • You buy & manage hardware, replication, backups and networking (if applicable, for offsite access)
  • OK for unrestricted, sensitive (FERPA), and protected data

Costs (100 GB dataset)

Ranges

(but generally cheap)

$

archive w department it
Archive w/ department IT
  • 30-day backup/recovery window for files on personal or departmental storage
  • RAID protected, backed up online storage
  • Accessible (to you) remotely (via VPN)

Costs (100 GB dataset in COSINe)

($0/year * 4 GB) + ($60/100 GB/year)

=

$60/year (ongoing)

$300 for 5 years

$

archive @ osu w cn
Archive @ OSU w/ CN
  • Storage is in 2 separate data centers & backups retained for 3 months
  • Accessible (to you) remotely (via VPN)
  • OK for unrestricted, sensitive (FERPA), and protected data

Costs (100 GB dataset)

($0/year * 5 GB) + ($4/GB/year * 95 GB)

=

$380/year (ongoing)

$1,900 for 5 years

$

archive in discipline specific repository
Archive in discipline-specific repository
  • Replicated, archive-quality storage
  • Data curation throughout ingest & archive period
  • Data in context with other datasets

Costs

$

Ranges

bottom line
Bottom line

No “one-size-fits all” approach

Balance costs, storage quality, access, degree of involvement, security, longevity etc.

Plan ahead so you can budget appropriately