1 / 21

A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books. Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008. Talking Points:. Scope / Background Why? Major hurdles Manual / automated workflows Outcomes What can we share? Results Methodologies

sjerry
Download Presentation

A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

  2. Talking Points: • Scope / Background • Why? • Major hurdles • Manual / automated workflows • Outcomes • What can we share? • Results • Methodologies • Tools, etc. IGeLU Conference 2008, September 9, 2008

  3. Alfred P. Sloan Foundation • Getty Research Institute • Archaeology and antiquities • Boston Public Library • John Adams collection • Johns Hopkins • Anti-slavery materials • The Metropolitan Museum of Art • Museum Publications • Bancroft Library • Gold Rush and westward expansion IGeLU Conference 2008, September 9, 2008

  4. Scope of the Digitization Project • 2,000,000 pages or approx. 5,000 books • Self-evident collection • Public domain pre-1923 for works published in U.S. pre-1909 for works published outside of U.S. IGeLU Conference 2008, September 9, 2008

  5. 1 Pod = 10 Scribe Stations Internet Archive Scribe Station IGeLU Conference 2008, September 9, 2008

  6. Why Do it? • Internet Archive issues • Response/search time • Metadata only searching • No control • Full-text searching • Use in metasearch • More control! IGeLU Conference 2008, September 9, 2008

  7. Major Hurdles • Getting the files • Disk space issues – for general storage and for DTL • What/how to process all the files • Abbyy OCR vs. ALTO OCR • Thumbnail generation • Handle configuration/synchronization IGeLU Conference 2008, September 9, 2008

  8. Processed by GRI • Zipped or tar files: • *_orig_jp2 • *_jp2 • *_raw_jp2 • high & low resolution PDFs • *abbyy.gz • *meta.xml • *marc.xml Link to URLs List of OCR’d books received from Internet Archive URLs from Internet Archive Download files from Internet Archive Process downloaded files Ready for Digitool Ingest IGeLU Conference 2008, September 9, 2008

  9. IGeLU Conference 2008, September 9, 2008

  10. IGeLU Conference 2008, September 9, 2008

  11. IGeLU Conference 2008, September 9, 2008

  12. Disk Space Issues • Each digitized book = 500MB to 1.5 GB of raw files • Further untarring and processing consume even more disk! • DTL scratch/processing space, permanent storage space, and Oracle tablespace – including full text indexing consumes even more disk space • 3000 books in the queue will require 10-15 TB for this project alone. IGeLU Conference 2008, September 9, 2008

  13. DTL ingest package = • Archive = raw jpeg2000 (renamed to *.j2k) • View = use copy jpeg2000 (*.jp2) • Index = ALTO files • Thumbnail = appropriate thumb of title page for display of the complex object • PDF = high res PDF as additional manifestation • MARCXML record for IE level metadata • No TIF files from IA – everything is jpeg2000 • Mapping file same for every ingest • CSV file is produced automatically IGeLU Conference 2008, September 9, 2008

  14. Abbyy to ALTO • IA scanning produces one huge OCR file in Abbyy proprietary XML • Discussions with / proposal from CCS • Real need to open source approach • Abbyy XSD can morph in future • Desire to share • Contract with Ex Libris to produce tool • Java based • Includes jar and class files • Free to share and redistribute • Tool transforms single ABBYY file to ALTO-file-per-page XML files IGeLU Conference 2008, September 9, 2008

  15. Thumbnail Creation • Initial ingest flow created complex object thumbnail from first page of PDF manifestation • Boring! • Ghostscript/PDF/ImageMagick problems • Decision to go semi-manual with script/cgi that: • creates thumbnails for first 15 jpeg2000 page images • sends URL in email for each separate ingest • creates web page for page image viewing and thumbnail selection • adds chosen thumbnail to staging directory, cleans up, and sends confirmation email IGeLU Conference 2008, September 9, 2008

  16. IGeLU Conference 2008, September 9, 2008

  17. Handle Generation • Setup per DTL docs • Firewall tweaks • Ingest flow tweaks • Handle for IE • Handles for all archive jpeg2000 images • DTL errors with mass publication of Handles • Fixed in SP21 IGeLU Conference 2008, September 9, 2008

  18. Ingest Summary • Get/process/stage files • Generate ALTO OCR files • Web CGI for thumbnail selection • Load.sh script moves all files to locations DTL expects • Activate saved Ingest Flow from DTL Web Ingest client • Wait......... IGeLU Conference 2008, September 9, 2008

  19. Outstanding Issues • Ingest speed • Remedied somewhat in SP21 • Digitized books are just darn big! • Low number of ingests per day • Handles • Manual publishing process • Need to populate Voyager bib record • METS viewer performance issues IGeLU Conference 2008, September 9, 2008

  20. Success Factors !! • Code to share • Get/process/staging scripts • Abbyy/ALTO transform code • Web cgi thumbnail code • YMMV • Handles provide true persistent IDs • http://hdl.handle.net/10020/17473 • Full-text multilingual searching • MetaLib QuickSet for metasearch of all local repositories IGeLU Conference 2008, September 9, 2008

  21. Demo and Thanks...... • http://archives.getty.edu • jshubitowski@getty.edu IGeLU Conference 2008, September 9, 2008

More Related