Human rights web archive portal technical summary
Download
1 / 7

Human Rights Web Archive Portal – Technical Summary - PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on

Human Rights Web Archive Portal – Technical Summary. Columbia University Libraries. HRWA Statistics, through July 31, 2012. c a. 500 web sites 26 million pages / documents HTML pages = 24.5 million Document files (e.g., doc) = .5 million PDFs = . 5 million XML = 100,000

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Human Rights Web Archive Portal – Technical Summary' - fineen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Hrwa statistics through july 31 2012
HRWAStatistics, through July 31, 2012

  • ca. 500 web sites

  • 26 million pages / documents

    • HTML pages = 24.5 million

    • Document files (e.g., doc) = .5 million

    • PDFs = . 5 million

    • XML = 100,000

    • Presentations (e.g., ppt) = ca. 1,800

    • Spreadsheets (e.g., xls) = ca. 700

  • ca. 65 languages


Hrwa r elevant tech terms
HRWA Relevant Tech Terms

  • Archive-It – IA’s web archiving service

  • SOLR (Lucene) – indexing tool

  • Blacklight – Discovery Interface for SOLR

  • MySQL – used as an intermediate index db

  • WARC (Web Archive Format) – web storage

  • Fedora – Columbia’s preservation repository


Hrwa challenges
HRWA Challenges

Most challenging and innovative LDPD project to date.

  • Most data in single project (ca. 2 TB)

  • Largest indexes

  • Greatest number of servers for indexing / production

  • Most complex data (WARC / Web)

  • Most challenging end-user design requirements

  • Most uncharted in terms of users, possible uses, possible value added features, scoping, etc.

  • Most cutting edge, most unanswered tech questions


Hrwa more information
HRWA More Information

  • CUL/IS Behind the Scenes page

  • CUL/IS Mellon Web Resources Wiki

  • Archive-It: Columbia’s Web Archive Collections

  • Columbia’s Human Rights Web Archive Portal


ad