Providing access to archived web sites in the library of congress web archives lcwa
This presentation is the property of its rightful owner.
Sponsored Links
1 / 45

Providing Access to Archived Web Sites in the Library of Congress Web Archives (LCWA) PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on
  • Presentation posted in: General

Providing Access to Archived Web Sites in the Library of Congress Web Archives (LCWA). Tracy Meehleib Library of Congress, NDMSO Columbia University, June 30, 2010 Digital Library Seminar Series. Library of Congress Web Archives. EVENT-DRIVEN September 11th, 2001 Winter Olympic Games 2002

Download Presentation

Providing Access to Archived Web Sites in the Library of Congress Web Archives (LCWA)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Providing access to archived web sites in the library of congress web archives lcwa

Providing Access to Archived Web Sites in the Library of Congress Web Archives (LCWA)

Tracy Meehleib

Library of Congress, NDMSO

Columbia University, June 30, 2010

Digital Library Seminar Series


Library of congress web archives

Library of Congress Web Archives

EVENT-DRIVEN

  • September 11th, 2001

  • Winter Olympic Games 2002

  • U.S. Congresses 107th, 108th, 109th, etc.

  • U.S. Elections 2000, 2002, 2004, 2006, 2008, etc.

  • Iraq War 2003- (1 Phase cataloged)

  • Papal Transition 2005

  • Supreme Court Nominations 2005-2006

  • Crisis in Darfur, Sudan 2006

  • Egypt 2008

  • Single Sites (EUR, HSS, ST&B)

  • Indian Election 2009

  • Indonesian Election 2009

  • Philippine Election 2010

  • Sri Lanka Election 2010

  • Timor Leste 2010

  • Burma/Myanmar Election 2010

    FORMAT/COLLECTION-DRIVEN

  • Organizational Sites corresponding to Papers/Archives collected by LC’s Manuscript Division

  • Sites corresponding to creators whose works are collected by/represented in LC’s Prints & Photographs Division

  • Legal Blawgs identified by the Law Division

    COLLABORATIVE

  • End-of-Term Government/January 2009 (LC, IA, UNT, CDL, GPO)

  • Hurricane Katrina (LC, IA, individual contributors)

  • K-12 (LC and Internet Archive)


Iraq war 2003 web archive

Iraq War, 2003 Web Archive


Crisis in darfur sudan 2006 web archive

Crisis in Darfur, Sudan 2006 Web Archive


Lc manuscript division archive of organizational web sites

LC Manuscript Division Archive of Organizational Web Sites


Visual image web archive

Visual Image Web Archive


Legal blawgs web archive

Legal Blawgs Web Archive


Egypt 2008 web archive

Egypt, 2008 Web Archive


Southeast asian elections 2009 2010 web archives

Southeast Asian Elections 2009-2010, Web Archives


Library of congress web archives1

Library of Congress Web Archives

Election 2000800

Election 20024000

Election 20041945

Election 20062098

Election 20082000

107th Congress579

108th Congress583

109th Congress580

110th Congress580

September 11, 20012300

Winter Olympics 200262

Iraq War 2003-231-plus (multi-phase)

Papal Transition 2005192

Crisis In Darfur, Sudan 2006218

Visual Image Web Sites17

Organizational Sites, Manuscript Division30

U.S. Supreme Court Nominations 2005-2006281

Legal Blawgs 2007-90-plus (multi-phase)

Egypt 200830

Indian General Elections58

Indonesian General Elections79

Singles Sites925-plus (multi-phase)

U.S. Supreme Court Nominations 2005-2006281

Egypt 200830


Organizational structure

Organizational Structure

WEB ARCHIVING TEAM

In the Office of Strategic Initiatives (OSI).

We are project managers and technical staff

focused on capture, tools, and permissions.

CURATORS/RECOMMENDING OFFICERS

In Library Services, Congressional

Research Service, and the Law Library

pick the collections and what URLs to archive,

and research who to contact for permission.

INFORMATION

TECHNOLOGY OFFICE and

TECHNICAL ARCHITECTURE TEAM

Also in OSI. Supports Wayback

and tools development,

Repository development and Data Transfers.

Contractors are also used in this area.

BIBLIOGRAPHIC ACCESS

MODS records are created in Library

Services: the Network Development

& MARC Standards Office &

Acquisitions & Bibliographic Access

staff do the cataloging.


Web archives processing workflow

Web Archives Processing Workflow

  • Create a MODS template for metadata extraction

  • Metadata extraction results in a preliminary MODS record for each archived site

  • Enhance record, reviewing & revising some values if needed (title, language, abstract, keywords) and adding some values (LCSH headings—subjects and sometimes names)

  • Register item-level handles

  • Load item-level MODS records onto server, index, generate item-level search/browse

  • Create a collection-level overview page for LCWA collection homepage

  • Create collection-level record in ILS and register collection-level handle


Why provide site level access to these sites

Why Provide Site Level Access to these Sites?

  • Access limitations of searching W/ARC files by keyword and URL at the archive level

  • Increase access using controlled vocabularies (LCSH, TGM, etc.)

  • Leverage subject cataloging & language expertise to enhance subject access as economically as possible

  • Resources become integratable with other library resources at the item level

  • Better precision and recall searching within and across archives

  • Persistent IDs/handles allow for stable citations and digital scholarship at site-level

  • Records are “portable”—so we can leverage existing and new search/browse systems


How do we provide site level access to these sites

How Do We Provide Site-Level Access to these Sites?

  • Boilerplate as much relevant archive-level and site-level metadata as is possible into the MODS template

  • Extract as much useful metadata as is possible from archived web sites W/ARC files (using a perl script or other method that grabs the metadata from meta tags in the W/ARC files)—titles, dates, file types, abstracts, subject keywords, etc.

  • Leverage LC subject cataloging & language expertise and controlled vocabularies to add subject access


Overview of mods record data elements

Overview of MODS Record Data Elements

Title - Extracted from W/ARC file/HTML title tag

- Cataloger uses if viable, otherwise supplies

Alternative Title - Cataloger supplies if another useful and different title displays on piece

Name Personal - Included for some archives, when relevant, cataloger supplies

Name Corporate - Included for some archives, when relevant, cataloger supplies

Type of Resource - Boilerplate “text”

Genre- Boilerplate “Web site”

Origin Info- Extracted from W/ARC file – first/last dates captured YYYMMDD(iso8601)

Language- Boilerplate in if known (iso639-2b code)

- Cataloger can supply additional languages

Physical Description- Extracted from W/ARC file/MIME type, e.g., text/css, image/jpeg

Abstract- Extracted from W/ARC file/META name=description content

- Cataloger can edit/enhance

Subject/Keywords- Extracted from W/ARC file/META name=keywords content

- Cataloger can edit/enhance

Subject/LCSH- Cataloger supplies

Collection Title/PID- Boilerplate, collection title & collection PID/handle

Identifier- Boilerplate, variant of handle, e.g, hdl:loc.natlib/mrva0000.0000

Note- Extracted from W/ARC file, resolves to URL for active site

Location/Usage- Boilerplate item-level PID/handle

- PID is registered to resolve to archived Web site URL

Access Condition- Boilerplate rights info/permissions info – imported from OSI records

Record Info- Boilerplate record creation date

- Boilerplate record identifier, handle suffix mrva0000.0000


Crisis in darfur sudan 2006 web archive1

Crisis in Darfur, Sudan 2006 Web Archive

Archive size:218 sites

Harvest info:1 phase, multiple captures

Frequency: Varies--weekly to monthly crawls for each site

Metadata:1 collection-level MARC record, with collection level PID

218 item-level MODS records, with item-level PIDs

LCSH:1 boilerplate LCSH heading

Unlimited specific LCSH headings at site level—these are selected by cataloger from a list of about 20 LCSH terms that relate to the content in the archive


Catalogers list for darfur 2006 web archive

Catalogers’ List for Darfur, 2006 Web Archive


Resource page for an archived web site darfur 2006 web archive

Resource Page for an Archived Web Site, Darfur, 2006 Web Archive


Bilingual eng nor archived web site darfur 2006 web archive

Bilingual (eng/nor) Archived Web Site - Darfur, 2006 Web Archive


Preliminary mods record darfur 2006 web archive

Preliminary MODS Record – Darfur, 2006 Web Archive


Mods subject heading list darfur 2006 web archive

MODS Subject Heading List - Darfur, 2006 Web Archive


Completed mods record darfur 2006 web archive

Completed MODS Record – Darfur, 2006 Web Archive

<mods xmlns="http://www.loc.gov/mods/v3" version="3.2"><title Info><title>afrika.no: The Norwegian Council for Africa</title></title Info><type Of Resource>text</type Of Resource><genre>Web site</genre><origin Info><date Captured encoding="iso8601" point="start">20060717</date Captured><date Captured encoding="iso8601" point="end">20061120</date Captured></origin Info><language><language Term authority="iso639-2b" type="code">eng</language Term><language Term authority="iso639-2b" type="code">nor</language Term></language><physical Description><internet Media Type>application/download</internet Media Type><internet Media Type>application/x-javascript</internet Media Type><internet Media Type>image/bmp</internet Media Type><internet Media Type>image/gif</internet Media Type><internet Media Type>image/jpeg</internet Media Type><internet Media Type>image/pjpeg</internet Media Type><internet Media Type>text/css</internet Media Type><internet Media Type>text/html</internet Media Type></physical Description><abstract>afrika.no - The Index on Africa and Africa News Update. Features news on and links to all countries in Africa. With sections on Culture, Development, Economy, Education, Environment, Health, Human Rights, News and Politics. By the Norwegian Council for Africa.</abstract><subject authority="keyword"><topic>afrika, africa, culture, development, economy, education, environment, health, politics, travel</topic></subject><subject authority="lcsh"><geographic>Sudan</geographic><topic>History</topic><temporal>Darfur Conflict, 2003-</temporal></subject><subject authority="lcsh"><topic>International relief</topic></subject><subject authority="lcsh"><geographic>Sudan</geographic><topic>Economic conditions</topic><temporal>1983-</temporal></subject><related Item type="host"><title Info><title>Crisis in Darfur, Sudan Web Archive, 2006</title></title Info><location><url>http://hdl.loc.gov/loc.natlib/collnatlib.00000011</url></location></related Item><identifier>hdl:loc.natlib/mrva0011.0037</identifier><note type="system details">www.afrika.no/</note><location><url display Label="Archived site">http://loc.archive.org/darfur/2006*/www.afrika.no/</url></location><location><url usage="primary display">http://hdl.loc.gov/loc.natlib/mrva0011.0037</url></location><access Condition>Access restricted to on-site users at the Library of Congress.</access Condition><record Info><record Creation Date encoding="iso8601">20070516</record Creation Date><record Identifier source="dlc">mrva0011.0037</record Identifier></record Info>

</mods>


Displayed mods record darfur 2006 web archive

Displayed MODS Record - Darfur, 2006 Web Archive


Tag cloud generated from archived web site darfur 2006 web archive

Tag Cloud Generated from Archived Web Site Darfur, 2006 Web Archive


Library of congress web archives homepage

Library of Congress Web Archives Homepage


Collection overview darfur 2006 web archive

Collection Overview - Darfur, 2006 Web Archive


Search page darfur 2006 web archive

Search Page - Darfur, 2006 Web Archive


Browse page darfur 2006 web archive

Browse Page - Darfur, 2006 Web Archive


Marc collection level record darfur 2006 web archive

MARC Collection-Level Record - Darfur, 2006 Web Archive


Google search collection level darfur 2006 web archive

Google Search – Collection-level - Darfur, 2006 Web Archive


Google search item level darfur 2006 web archive

Google Search – Item-level - Darfur, 2006 Web Archive


Lc web archives levels of access

LC OPAC/ILS

SEARCH

NUTCHWAX

LUCENE SEARCH INTERFACE

ARCHIVE-LEVEL

HOMEPAGE

&

MODS RECORDS

SEARCH/BROWSE

107th Congress

108th Congress

Election 2002

Election 2004

September 11, 2001

Olympics 2002

IraqWar 2003

Papal Transition 2005

Crisis In Darfur 2006

Egypt 2008

Legal Blawgs

LC Web Archives – Levels of Access

NUTCHWAX INDEXES

MODS

ITEM-LEVEL

RECORDS

W/ARC FILES

ARCHIVED

WEB SITES

INTERNET

SEARCH ENGINES


Results pros

Results - Pros

  • Archived resources are searchable and indexable along with other library collections and online resources—METS/MODS records are portable

  • Item-level and collection-level subject access and controlled vocabularies make these resources highly integratable at the item level and collection-level

  • Site-level access facilitates searching and browsing within and across web archives—ability to find, refind & cite resources

  • Good use and reuse of extracted and human-created metadata—friendly environment in which traditional MARC catalogers learn XML and MODS—project benefits from specialized subject cataloger expertise

  • Flexible and sustainable infrastructure for making web archives available for digital scholarship—stable/citable persistent IDS at the site level and the collection level


Results cons

Results - Cons

  • Scalability—approach works well with archives of up to 2,000 sites, but hasn’t been tested w/much larger archives

  • Project investment is basically the same for each archive—whether it’s 100 sites or 2000 sites--project setup still requires template creation, metadata extraction, LCSH analysis at archive level, handle registration, etc.—so essentially the same amount of resources regardless of archive size

  • Proliferation of needed “sub-workflows” for “types of archives”, i.e., elections, congresses, ovops, etc.


Ndmso aba accomplishments in 2009 2010

NDMSO & ABA Accomplishments in 2009/2010

  • NDMSO implemented a workflow management software (SmartSheet) to track web archiving workflow and tasks completed across divisions and Teams (OSI, NDMSO, ABA)

  • NDMSO enhanced the LCWA/MODS metadata profile to better comply with DLF/Aquifer Guidelines v. 1.1 and updated records

  • NDMSO wrapped LCWA MODS in a METS wrapper and developed a program to generate a thumbnail image of the first capture to display with METS/MODS record

  • NDMSO got a n LC site license for oXygen approved so it is available on all LC workstations

  • NDMSO trained an ABA cataloger to take over oversight and management of MODS cataloging

  • ABA developed sub-workflows that function within the overall LCWA framework—i.e., for Elections, Congresses, OVOPS, Single Sites, etc.


Workflow management for lcwa implemented in 2009

Workflow Management for LCWA Implemented in 2009


Lcwa workflow template in smartsheet

LCWA Workflow Template in SmartSheet


Providing access to archived web sites in the library of congress web archives lcwa

LCWA metadata profile was improved in 2009 to comply with revised DLF/Aquifer Guidelines issued in March 2009<http://www.loc.gov/standards/mdc/docs/>


Excel spreadsheet for ovops data entry indonesian general elections 2009 web archive

Excel Spreadsheet for OVOPS Data Entry Indonesian General Elections, 2009 Web Archive


Lcwa raw mets mods in xml datastore pilot

LCWA raw METS/MODS in XML DataStore Pilot


Lcwa raw mets mods in xml datastore pilot1

LCWA raw METS/MODS in XML DataStore Pilot


Lcwa mets mods displayed in xml datastore pilot

LCWA METS/MODS displayed in XML DataStore Pilot


Library of congress xml datastore project starting fall 2010 unified searching of lc materials

Library of Congress XML DataStore ProjectStarting Fall 2010Unified Searching of LC Materials

PHASE I (Fall 2010)

  • LC Online Catalog (OPAC/ILS) – 17 million METS/MODS/MARCXML<http://catalog.loc.gov/>

  • LC Encoded Archival Description Finding Aids – EADs<http://www.loc.gov/rr/ead/

  • LC Performing Arts Encyclopedia (PAE) – 50,000 METS/MODS<http://www.loc.gov/performingarts/>

    PHASE II

  • LC American Memory Records – 13 million METS/MODS<http://memory.loc.gov/ammem/index.html>

  • LC Veterans History Project – METS/MODS<http://www.loc.gov/vets/>

  • LC Web Archives (LCWA) – 8,000+ native METS/MODS<http://www.loc.gov/lcwa/>

    PHASE III

  • All other Library Services owned and managed metadata/content--including full-text XML

  • Tibetan Oral History Project - TEI

  • Newspaper Project – ALTO XML

  • Handbook of Latin American Studies – 160,000 article-level MARC

    TYPES of RECORDS to be included:

    METS, MODS, MARCXML, EAD, TEI, KML, ALTO, etc.


Considerations challenges for 2010 2011

Considerations & Challenges for 2010/2011

  • Update LCWA MODS profile to current version of MODS—newest version is MODS 3.4 – and MODS 4.0 is on the horizon

  • Load, store, search, browse, retrieve LCWA METS/MODS/(perhaps PREMIS) records in LC’s new “XML datastore” (a MarkLogic database) with all other LC records

  • Create a flexible MODS input/editing form/tool that would hide boilerplate and extracted metadata that cataloger does not need to see—we experimented w/XMLSPY’s Authentic and XForms, but we lost flexibility w/regard to parsed subjects with both of these

  • Experiment with tag clouds for preliminary access to uncataloged archives, as a possible cataloging tool, and as an alternative display tool at the item and archive levels

  • Multilingual/Multiscript collections—extracting multilingual/multiscript metadata and providing search and browse in multiple foreign languages, with some that read right to left, such as Arabic—presents new challenges

  • Preservation issues—identify relevant PREMIS preservation metadata that can be extracted/boilerplated into the METS record/wrapper, and included along with the MODS descriptive metadata

  • Integrate access to collaborative web archives—access issues will be more complicated because different archiving institutions use different cataloging standards and conventions, PIDs, etc.

  • Integrate the NutchWAX/SOLR component to provide more comprehensive keyword access to W/ARC files—to complement existing collection and site-level access

  • Experiment with “preloading” some descriptive metadata into the WARC files—that will be eventually extracted into the preliminary MODS record


That s all folks

THAT’S ALL FOLKS

[email protected]


  • Login