digital archiving a workflow
Download
Skip this Video
Download Presentation
Digital Archiving – A Workflow

Loading in 2 Seconds...

play fullscreen
1 / 46

Digital Archiving A Workflow - PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on

Digital Archiving – A Workflow . K P Raghuraman National Centre for Science Information Indian Institute of Science, Bangalore. NAMASKARA. Acknowledgements. Organizers Mr. Francis Jayakant Mr. Filbert Minj Friends who supported me in the effort Internet. Digital Archiving.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Digital Archiving A Workflow' - ryder


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
digital archiving a workflow

Digital Archiving – A Workflow

K P Raghuraman

National Centre for Science Information

Indian Institute of Science, Bangalore

NAMASKARA

acknowledgements
Acknowledgements

Organizers

Mr. Francis Jayakant

Mr. Filbert Minj

Friends who supported me in the effort

Internet

Archives and Publication Cell, IISc

digital archiving
Digital Archiving
  • What is Digital Archive
  • Documented Information & storage system
  • Holds permanent, fixed data for a long time (?) in a structured and easy accessible way
  • Employs information architecture configured to assure trustworthiness and long term retention

Archives and Publication Cell, IISc

digital archiving need
Digital Archiving – Need

A practical task for keeping documents intact for future use

Improved access to information resources, preservation and dissemination as required

Any time; anywhere and any place

Archives and Publication Cell, IISc

digital archiving benefits
Digital Archiving – Benefits
  • Digitisation contribute to
    • Conservation of physical resources
    • Enables effective sharing of information and contributes to knowledge flow
    • Unlocks information that was previously difficult to access in paper form
    • Use of digital surrogates will reduce wear and tear of originals / made legible
    • Negate the use of originals
    • Access to information could be restricted with remote access
    • Provide customizable user interface for collaborative working environment
    • Faster support regarding any query & question
    • Cost saving on paper & Time saving in finding information

Archives and Publication Cell, IISc

digital archiving advantages
Digital Archiving – Advantages
  • Improved searching mechanisms
    • Metadata search - Full text search - Boolean search
    • Support simultaneous searching in a standardised form, across a range of resource categories.
    • Information, rather than media, can be collated to support a query, regardless of the original source material type.
  • Space save
    • 3000 kg of paper could be saved in a DVD
    • Data can be recombined for manipulation and compressed for various applications

Archives and Publication Cell, IISc

digital archiving technology and process
Digital Archiving – Technology and Process
  • Digital record is mirror image of original analogue/paper based file in terms of
    • Page layout and number of pages
    • Hand written text, graphics & logos
    • Colour of original document
  • These images is then rendered into desired format (e.g. pdf) for archiving, printing and distribution
  • Creation of Metadata – used for search and index
    • Additional metadata providing contextual information
      • Who uses the records
      • How will they be used
      • When will be they used
      • Access codes to prevent unauthorized access

Archives and Publication Cell, IISc

digitisation
Digitisation

Crude definition

Scan

Save

Is it just Scan and Save

Is there a workflow

Are guidelines for the whole process

Archives and Publication Cell, IISc

digitisation1
Digitisation

Definition

Converting written and printed information into electronic form

Creation of computerisation of a printed analog.

Contents

Contents – text image, audio or combination of these (multimedia)

Archives and Publication Cell, IISc

objective of digitisation
Objective of Digitisation
  • Create content of databases
    • Facilitate access
    • Preservation
    • Dissemination of information resources

Archives and Publication Cell, IISc

digitisation process
Digitisation Process

Output

  • Electronic Document
    • Tagged Image File Format (TIFF)
    • Portable Document Format (PDF)
      • Useful for hosting information on the intranet
      • Platform independent
      • PDF readers are available as free downloads

Archives and Publication Cell, IISc

digitisation objects and process
Digitisation - Objects and Process

Image

Text

Audio

Video

Scanner captures images.

Software analyses images and creates texts and images

Software converters convert raw Audio and raw Video to standard digital format

Archives and Publication Cell, IISc

digitisation issues
Digitisation - Issues

Hardware

Computer

Scanner

Software

Communication software PC – Scanner – TWAIN complaint

Image processing – Photoshop, Macromedia Fireworks etc.

Enable text material to be converted to Text i.e. OCR (Optical Character Recognition) – AABBYY, OmniPage

Suitable Policy

Consistent quality threshold for scanned images.

Choosing appropriate image format – TIFF, JPEG etc.

Choosing an appropriate file name scheme.

Archives and Publication Cell, IISc

scanners
Scanners

Flat bed scanners

Normal Desktop scanner

Sheet fed scanners

Same as above but here document moves and scan-head is immobile

Handheld scanner

Used to capture text – size of a pen.

Drum scanner

Used in publishing industries

Planetary Scanner

Scanning books

Archives and Publication Cell, IISc

types of images
Types of Images

1-bit black and white – either black or white

Used for printed text or line graphics

Unsuitable for images

8-bit grey scale – 256 grey scales

Black and white photographs

Non-color documents

8-bit color – 256 colors

low quality images

24-bit color – 16.8 million shades of color

Ideal archival quality images

For color photo printing

Archives and Publication Cell, IISc

resolution
Resolution

Measurement in dots per inch (dpi)

Higher dpi higher the file size

Archives and Publication Cell, IISc

image size
Image - Size

Images size measured in pixels

Image size varies with scanned resolution

Modification of image size is called resampling

Image screen pixels are found on each pixel of the screen

One screen pixel contains one image pixel and can have any RGB value

800 x 600 pixels 14” monitor

1024 x 786 pixels 16” monitor

Archives and Publication Cell, IISc

image file formats
Image – File Formats

Some standard image formats

TIFF – Tagged Image File Format

JPEG – Joint Photographic Expert Group

DjVu – déjà vu (a free file format)

GIF – Graphic Interchange Format

PNG – Portable Network Graphics

Archives and Publication Cell, IISc

slide19
TIFF

Multiple images and data in the same file

Tags in file header (information on size, compression)

Loss-less format, useful for archival images

Platform independent

Format useful for future modification – can edited without compression loss

Disadvantage

Size of image is very high

Archives and Publication Cell, IISc

slide20
JPEG

Strongest format for web images and printing images

Superior quality can be produced

Variety of compression capability

Best method for online viewing

Disadvantage

Lossy compression format

Archives and Publication Cell, IISc

slide21
GIF

Very old format

Lossless compression format

Less storage space

Strong candidate for graphic art and drawing.

Disadvantage

Limited to 256 colors.

Archives and Publication Cell, IISc

slide22
DjVu

File format to save scanned images especially with text.

Advanced technology for image layer separation of text and images.

High quality readable images, stored in minimum space – useful for web.

Progressive loading – useful for web.

Format used for Million books project

Archives and Publication Cell, IISc

slide23
PNG

A new format

Created to improve on GIF format

Supports 24-bit color or greyscale

Provides for variety of transparency

Lossless data compression

Disadvantage

New so old software does not support

Archives and Publication Cell, IISc

file formats
File Formats

Audio

Wav

Microsoft, IBM audio file format.

Lossless storage method – large files.

MP3 – MPEG -1 Audio Layer-3

Popular digital audio encoding.

Lossy compression format so smaller files.

Still can produce good reproduction of original.

Real Audio – ram

Variety of audio codecs from lowbitrate to high fidelity formats

Streaming audio format

Archives and Publication Cell, IISc

file formats1
File Formats

Video

MPEG 21

Defines “Rights Expression Language” standard

Sharing digital rights/permissions/restrictions for content from content creator to consumer

XML based file system

Can communicate machine readable license information in a "ubiquitous, unambiguous and secure" manner.

The main objective of the MPEG-21 is to define the technology needed to support users to exchange, access, consume, trade or manipulate Digital Items in an efficient and transparent way.

Archives and Publication Cell, IISc

slide26
OCR

Optical Character Recognition

Goal – Recreate text and other elements like tables and layout so as to edit in popular word-processors

Requirement – Scanner and text conversion software (OCR)

Technology – Examines patterns of dots and recognizes them and writes them as alphabetic characters and numbers

Archives and Publication Cell, IISc

ocr process
OCR - Process

The scanner or camera produces TIFF image

The software cleans the image for noises and starts recognizing patterns

Recognized patterns in alphabets and numbers

Unrecognized patterns into images

Archives and Publication Cell, IISc

widely used settings
Widely used settings

24 –bits color

600 dpi (while 300 or 400 for text are popular)

TIFF Rev 6 without compression or LZW compression

(PNG is currently becoming popular)

Photographs to be scanned twice the size

B&W photographs in grey scale

Text can also use the above settings can be stored as PDF or DjVu

Archives and Publication Cell, IISc

popular practices followed
Popular Practices Followed

Initially Preservation Masters are created.

Should be uncompressed to retain archival integrity

For long time storage purposes.

Compressed Web files are created for surrogate files in repository or for web-site

Archives and Publication Cell, IISc

specific file formats
Specific File Formats

Archives and Publication Cell, IISc

ocr accuracy
OCR - Accuracy

Depends

Color of paper

Characters should be reasonably well formed

The font should one of the popular ones.

99% accuracy achieved

Bleached white paper

10pt character size

1.5 line spacing

Computer based printouts

Archives and Publication Cell, IISc

ocr issues
OCR - Issues

Deal with archival material

Old text printed during hand pressed period

Gothic and exotic fonts used

Paper color is yellow

Characters are often broken and not well-formed due to age and environment factors

Archives and Publication Cell, IISc

best practice
Best Practice

First scan and store as TIFF files

OCR TIFF files

Depending on the application and size can convert it into pdf or any format

Depending on accuracy of OCR use TIFF or OCR copies for pdf

Archives and Publication Cell, IISc

ocr software
OCR – Software

AABBYY – Fine Reader – Very popular

OMNI Page – High end OCR tool

Read IRIS – A competitor to AABBYY and OMNI Page

MODI – Microsoft Office Document Imaging (introduced in Win-XP and exports to word)

Archives and Publication Cell, IISc

slide37
The software cleans up the image and saves as Hi-Res TIFF image

Using OCR it can converted to editable text

Archives and Publication Cell, IISc

summary
Summary

Digitization is a process

Large number of analogue items like image, text, audio and video are captured into digital form

Understand the variables and tasks in the process

Methods of capturing images

Conversion process performed

Archives and Publication Cell, IISc

summary1
Summary

Documentthe workflow

This will lead to life history for each digitized item

Help Create Consistency and Reliability

Archives and Publication Cell, IISc

new definition
New Definition

Is this the end of digitization?

Are we through with the work?

As in every other job here too sustainability and maintenance is necessary

Archives and Publication Cell, IISc

long term maintenance
Long term maintenance

Technology is changing rapidly

Obstacles that may need to overcome

Lack of awareness in general about how such resources may be exploited effectively for scholarly purposes

Lack of relevant IT skills and/or analytical methods

Lack of appropriate user support.

Archives and Publication Cell, IISc

strategies to preserving data
Strategies to preserving data

Preserving the data and the hardware and software platforms from which they are originally made accessible.

Refreshing data by copying them periodically onto new storage media.

Migrating data through changing technical regimes by rendering them into an appropriate standard interchange formats.

Emulating the look and feel of the original data on successive generations of hardware and software platforms.

Archives and Publication Cell, IISc

slide43
Points to ponder

Unlike paper, parchment and other traditional forms of recording medium, electronic systems and their data are not durable. Digital materials have very different preservation requirements to analogue materials, which may last for many decades through storage in optimal environmental conditions.

The other difficulty with electronic data and files is that they require the intervention of other systems to facilitate readability or usability.  This innate dependency makes the files themselves very fragile.  A problem in any of the supporting components can render the information useless. 

It is not enough to physically preserve the storage medium or present the bitstream.  Without the commensurate tools to decode and present the bitstream, a future user will be met with gibberish.

Archives and Publication Cell, IISc

digitization next step
Digitization – Next Step

Will mean preservation of materials that are ‘born digital’ .

Migration

Electronic data transferred from one data format to another.

Emulation

Attempts to use current and future technologies to emulate the tools and logic used when the records and files were originally created

Archives and Publication Cell, IISc

informative web sites
Informative web sites

Irish Virtual Research Library and Archive - Project Workbookhttp://www.ucd.ie/ivrla/workbook/wdigpreservation.html

The Arts and Humanities Data Service (AHDS) is a UK national service aiding the discovery, creation and preservation of digital resources in and for research, teaching and learning in the arts and humanitieshttp://ahds.ac.uk/about/publications/index.htm

Archives and Publication Cell, IISc

ad