140 likes | 276 Views
This presentation discusses strategies for automating the metadata extraction process within the Library of Congress's web archiving efforts. Since 2000, over 100 TB of data from 12,500+ URLs has been collected, necessitating a more efficient workflow for record creation. Key tools and techniques, including the Heritrix crawler and XML MODS templates, are explored, focusing on streamlining processes for cataloging archived websites. Specific data sources and quality review practices in metadata extraction will also be addressed.
E N D
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / abgr@loc.gov Gina Jones / gjon@loc.gov Library of Congress Office of Strategic Initiatives Web Capture Team
Library of Congress Web Archives • Since 2000, 20+ thematic, event-based collections • 100 TB+ of data collected • 12,500+ URLs http://www.loc.gov/lcwa
Web Archiving Tools • Crawling: • Heritrix • WARC • Access: • Wayback Machine • NutchWAX International Internet Preservation Consortium netpreserve.org
LC’s Web Archive Workflow • Identify & select URLs (LS or LAW) • Determine crawl strategy, create a seed list for crawling (OSI) • Sites harvested by Internet Archive or in-house crawlers (OSI), • Quality Review (OSI & curators) • Create “catalogers list” (OSI) and XML MODS template (LS) for metadata extraction
Describing the Archives • Collection-level MARC record in OPAC • Item-level MODS records in LCWA • One record per recommended URL for each distinct collection • With so many thousands of URLs to process, how do we streamline the process?
Metadata Extraction • For each URL that will be cataloged: • Get archived web site metadata • Combine with URL Nominations Database metadata • If elections/campaign web site, metadata also pulled from our candidate Access database (used to create subject terms) • Using XML template, we add collection and record level metadata • Create a single file for delivery
URL Access Rights Language(s) Category Subject Terms URL Nominations Database
Name URL Party Affiliation State Race District (House) Election Candidate Metadata
From 1st capture: Document Title Keywords Abstract Mime Types From Wayback index: Capture Dates (First & Last) Archived Web Site Metadata