challenges of digital preservation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Challenges of Digital Preservation PowerPoint Presentation
Download Presentation
Challenges of Digital Preservation

Loading in 2 Seconds...

play fullscreen
1 / 41

Challenges of Digital Preservation - PowerPoint PPT Presentation


  • 176 Views
  • Uploaded on

Challenges of Digital Preservation. MA / CS 109 April 22, 2011 Andrea Goethals Manager of Digital Preservation & Repository Services Harvard Library. “Digital Content”?. Digitized (born-analog). Born-digital Tweets Web sites Email Documents PDF Word, OpenOffice … Spreadsheets

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Challenges of Digital Preservation' - mina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
challenges of digital preservation

Challenges of Digital Preservation

MA / CS 109

April 22, 2011

Andrea Goethals

Manager of Digital Preservation & Repository Services

Harvard Library

digital content
“Digital Content”?
  • Digitized (born-analog)
  • Born-digital
    • Tweets
    • Web sites
    • Email
    • Documents
      • PDF
      • Word, OpenOffice …
      • Spreadsheets
    • Data sets
digital content is not new
Digital content is not new
  • 1957: 1st digital image
  • 1969: ARPAnet
  • 1971: 1st email sent
  • 1972: 1st consumer-level video game
  • 1975: 1st digital camera

Russell Kirsch’s son (source: NIST)

but has only recently exploded
But has only recently exploded
  • 1998: 1st Google index
    • 26 million pages
  • 2000: Google index
    • 1 billion pages
  • 2008: Google link processors
    • 1 trillion unique URIs
    • “… and the number of individual Web pages out there is growing by several billion pages per day” – from the official Google blog
the coming tsunami
The coming tsunami
  • 2010: estimated at 1.2 ZB (1 ZB is 1 million TBs)
    • DVDs stacked from Earth to the Moon and back
  • 2020: expected to grow by a factor of 44 to 35 ZB
    • DVDs stacked halfway to Mars

Source: 2010 IDC Digital Universe Study sponsored by EMC

outpacing storage
Outpacing storage

Source: 2009 IDC Digital Universe Study sponsored by EMC

may be historically significant
May be historically significant

Captured March 19, 2011 for a Japan Earthquake collection created by Virginia Tech, Internet Archive (http://www.archive-it.org/public/collection.html?id=2438)

may be a work of art
May be a work of art

YouTube Play. A Biennial of Creative Video (Oct. 2010 -)

may be an important reference
May be an important reference

Only available in digital form

who cares
Who cares?
  • Cultural heritage institutions
    • Libraries, archives
    • Museums, historical societies
    • Academic institutions
  • Governments
  • Entertainment, news and media industry
  • Scientific community
  • Funding bodies (NSF, NIH)
  • You?
preservation historically
Preservation historically
  • Archives and libraries have been preserving all kinds of analog material for centuries using:
    • Environmental control
    • Conservation treatments
  • Can store away until resources allow processing
    • Benign neglect approach works well
analog content is fairly durable
Analog content is fairly durable
  • Even damaged, may still be identifiable, readable, usable

Anatolian Cuneiform Tablet, circa 1850 BCE

in contrast digital content is
In contrast digital content is
  • Easily destroyed
  • Transient
  • Hidden
  • Requires more active attention – benign neglect approach doesn’t work
digital content is easily destroyed
Digital content is easily destroyed
  • Bad people
  • Hardware or software failures
  • Human mistakes
    • The slip of a finger can lead to catastrophic results
    • “Help! Accidental deletion. I accidentally deleted 62 images… can you please recover them from backups?”
digital content is transient
Digital content is transient
  • Average lifespan of a Web site is between 44 and 100 days

Captured April 8, 2009

Visited October 13, 2010

digital content is hidden
Digital content is hidden
  • Which is corrupt?
digital content is hidden1
Digital content is hidden
  • Both. Use helps but its not enough to detect corruption.
but is it usable
But is it usable???
  • It’s not enough to preserve the digital bits
    • AppleWorks?
    • WordStar?
    • Excel 1.0?
  • To use digital content we need software that can read the format
reading formats
Reading formats

ffd8ffe000104a46494600010201

008300830000ffed0fb050686f74

6f73686f7020332e30003842494d

03e90a5072696e7420496e666f00

0000007800000000004800480000

000002f40240ffeeffee03060252

0347052803fc0002000000480048

0000000002d80228000100000064

000000010003030300000001270f

0001000100000000000000000000

0000600800190190000000000000

0000000000000000000000000000

0000000000000000000000003842

494d03ed0a5265736f6c7574696f

6e0000000010008313a3000200 ...

reading formats1
Reading formats

ffd8ffe000104a46494600010201

008300830000ffed0fb050686f74

6f73686f7020332e30003842494d

03e90a5072696e7420496e666f00

0000007800000000004800480000

000002f40240ffeeffee03060252

0347052803fc0002000000480048

0000000002d80228000100000064

000000010003030300000001270f

0001000100000000000000000000

0000600800190190000000000000

0000000000000000000000000000

0000000000000000000000003842

494d03ed0a5265736f6c7574696f

6e0000000010008313a3000200 ...

SOI

APP0 JFIF 1.2

APP13 IPTC

APP2 ICC

DQT

SOF0 183x512

DRI

DHT

SOS

ECS0

RST0

ECS1

RST1

ECS2

...

reading formats2
Reading formats

ffd8ffe000104a46494600010201

008300830000ffed0fb050686f74

6f73686f7020332e30003842494d

03e90a5072696e7420496e666f00

0000007800000000004800480000

000002f40240ffeeffee03060252

0347052803fc0002000000480048

0000000002d80228000100000064

000000010003030300000001270f

0001000100000000000000000000

0000600800190190000000000000

0000000000000000000000000000

0000000000000000000000003842

494d03ed0a5265736f6c7574696f

6e0000000010008313a3000200 ...

SOI

APP0 JFIF 1.2

APP13 IPTC

APP2 ICC

DQT

SOF0 183x512

DRI

DHT

SOS

ECS0

RST0

ECS1

RST1

ECS2

...

access to information

information

content

information

content

HW (paper)

symbols

bits

formats

language

SW

HW (paper)

HW

Access to information

Analogbook

Unmediated access

Digital book

Technology-mediated access

formats are key to digital preservation

information

content

bits

formats

SW

HW

Formats are key to digital preservation

If the format of our content is unsupported by technology, we can’t access the content’s information!

digital

content

supporting

technologies

dependent on fleeting technology
Dependent on fleeting technology
  • We are dependent on technology to interpret (render, play, etc.) digital content
  • No technology sticks around – it all ages and disappears
  • Eventually all digital content in its original format becomes unusable!
format obsolescence
Format obsolescence
  • Kodak PhotoCD
    • Used by libraries in the 1990’s and into 2000’s as a preservation format
    • Best decoders were from Kodak and are no longer supported
    • Very few software decoders remaining – soon images in this format will be unusable
    • Harvard’s Digital Repository Service has 7,243 of these
two sub problems
Two sub-problems
  • Keep the bits safe
  • Keep the information usable as technology changes
safe bits
Safe bits
  • Infrastructure, polices, practices and professional staff to counter risks
    • High quality storage
    • Redundancy (multiple copies, multiple locations)
    • Media refreshing (replacing)
    • Security and access restrictions
    • Content recovery
    • Integrity monitoring (check for corruption)…
integrity monitoring
Integrity monitoring
  • Message digests – unique signatures for digital content
    • Fixed-size bit strings
      • 6326ec82b3200df4a87fc54356d2cb73
    • Calculated by cryptographic hash functions, e.g. MD5, SHA1, …
  • Any changes to a file result in a changed message digest
  • Useful for detecting corruption
usable information
Usable information
  • People have to be able to find it
  • People must be able to manage it
  • Document what’s important (description, context, ownership, processing history)
  • Know what you are preserving (formats)…
a tiff is a tiff
A TIFF is a TIFF?
  • Tiff 4.0
  • Tiff 5.0
  • Tiff 6.0
  • Tiff 6.0 extension YCbCr (Class Y)
  • TIFF/IT (ISO 12639:2003)
  • TIFF/EP (ISO 12234-2:2001)
  • RichTIFF
  • EXIF 2.0
  • EXIF 2.1 (JEIDA-49-1998)
  • EXIF 2.2 (JEITA CP-3451)
  • GeoTIFF 1.0
  • TIFF-FX (RFC 2301)
  • Class F (RFC 2306)
  • RFC 1314
  • Canon RAW (.crw, .cr2, .tif)
  • Nikon RAW (.nef)
  • DNG (Adobe Digital Negative)
identifying formats
Identifying formats
  • Techniques: “magic numbers”, full parse
  • Few tools
    • Support limited number of formats
    • Accuracy varies
  • Some improvements
    • File Information Tool Set (FITS)
      • fits.google.code
    • NARA-sponsored research
usable information1
Usable information
  • Make sure there’s technology to support the formats! (technology watch)
  • Preservation strategies
    • Technology preservation
    • Creation of viewing software
    • Emulation & variations:
      • Universal Virtual Machine
      • Universal Virtual Computer
    • Format normalization
    • Format migrations…
key format migration considerations
Key format migration considerations
  • What can’t be lost in the transformation? “Significant properties”
    • E.g. color, embedded metadata, resolution, ICC profiles, interaction, attachments, fonts, links
    • How important are each of these properties? – weighted criteria
  • To what format? “Preservable” formats
  • What else must be changed? Ex: Links
  • How many versions to keep?
preservation lifecycle a series of hand offs
Preservation lifecycle – a series of hand-offs
  • Create or acquire digital content
  • Ingest into a preservation repository
    • Continuous cycle of:
      • Monitoring
      • Planning
      • Intervention
    • Subject to collection management decisions
  • Transfer to next generation of the repository or to a different repository
ongoing commitment
Ongoing commitment
  • Requires continual proactive program
    • You can’t just start and stop
    • Time frames are MUCH shorter than for preservation of analog material
  • Requires ongoing investment in infrastructure and staff
can t do it alone
Can’t do it alone
  • Digital preservation activities must be shared across institutions
  • Even collectively we don’t have adequate resources or understanding
preservation community
Preservation community
  • Collaborative organizations (NDSA, IIPC, OPF)
  • Collaborative projects
  • Standards and best practices
  • Shared infrastructure and tools
    • Formats registry
    • Repository software
    • Preservation planning tools
    • Format tools