Automating Metadata Extraction for Web Archives: Efficiency in Record Creation

Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / abgr@loc.gov Gina Jones / gjon@loc.gov Library of Congress Office of Strategic Initiatives Web Capture Team

Library of Congress Web Archives • Since 2000, 20+ thematic, event-based collections • 100 TB+ of data collected • 12,500+ URLs http://www.loc.gov/lcwa

Web Archiving Tools • Crawling: • Heritrix • WARC • Access: • Wayback Machine • NutchWAX International Internet Preservation Consortium netpreserve.org

LC’s Web Archive Workflow • Identify & select URLs (LS or LAW) • Determine crawl strategy, create a seed list for crawling (OSI) • Sites harvested by Internet Archive or in-house crawlers (OSI), • Quality Review (OSI & curators) • Create “catalogers list” (OSI) and XML MODS template (LS) for metadata extraction

Describing the Archives • Collection-level MARC record in OPAC • Item-level MODS records in LCWA • One record per recommended URL for each distinct collection • With so many thousands of URLs to process, how do we streamline the process?

XML MODS Template

Metadata Extraction • For each URL that will be cataloged: • Get archived web site metadata • Combine with URL Nominations Database metadata • If elections/campaign web site, metadata also pulled from our candidate Access database (used to create subject terms) • Using XML template, we add collection and record level metadata • Create a single file for delivery

Data Sources for Metadata Extraction

URL Access Rights Language(s) Category Subject Terms URL Nominations Database

Name URL Party Affiliation State Race District (House) Election Candidate Metadata

From 1st capture: Document Title Keywords Abstract Mime Types From Wayback index: Capture Dates (First & Last) Archived Web Site Metadata

Combined Data in Template

Automating Metadata Extraction for Web Archives: Efficiency in Record Creation

Automating Metadata Extraction for Web Archives: Efficiency in Record Creation

Presentation Transcript

Semi-Automated Creation of Facet Hierarchies

Tutorial OAI and OAI-PMH for Beginners An introduction to the Open Archives Initiative and the Protocol for Metadata Har

Extraction Site Ridge Preservation

Dublin Core Metadata Tutorial July 9, 2007 Stuart Weibel Senior Research Scientist OCLC Programs and Research

EXTRACTION OF METALS

The Metadata Landscape for Digital Video

Revision

Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web

Information Extraction

Extraction Metallurgy

Metadata for Digital Projects

DNA Sequencing

Metadata Extraction: Human Language Technology and the Semantic Web

Geospatial Metadata session

Introd uction to Metadata

DoD Metadata Registry

ALA 2002 LITA Open Source Software Open Archives Initiative

Archives, Digital Archives and Encoded Archival Description

2. Processes

THE DATA ON METADATA

Helen Walker – National Archives of Australia Emma Buckley - National Archives of Australia