Search and Access Technologies for Large Scale Web Archives

Search and Access Technologies for Large Scale Web Archives Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University of Maryland In Collaboration with the Library of Congress and the Internet Archive

Web Archiving • Web – Main publication/communication medium today, but it is an ephemeral medium. • Web Archiving: • Capture, annotate, and store important web contents within their contextual and temporal characteristics; • Preserve to enable search and access in the long term; • Unprecedented scale and heterogeneity. NDIIPP Partners Meeting

Goals • Discovery of relevant contents based on unstructured queries involving temporal specifications • Presentation of pertinent summary information in ranked order according to the temporal context • Scalable search and access performance NDIIPP Partners Meeting

Existing Access Methods • Chronological Listing Based on URLs • Used by the Wayback Machine of the Internet Archive, arguably the leader in web archiving. • Directory Organization • Typically for domain specific contents, which are organized according to some hierarchical structure. • Full Text Search • Similar to current web search engines (NutchWax/WERA) NDIIPP Partners Meeting

Limitations of Current Technologies • Chronological Listing • Users are expected to provide URLs. • Hierarchical Listing • Not scalable. Users explore hierarchical structures, with possibly large numbers of entries. • Full Text Search (NutchWax/WERA) • Ranking of returned results does not take temporal context into consideration. • A listing similar to current web search engines. • Lack in performance and scalability. NDIIPP Partners Meeting

Issue #1: Scalability and Performance • For any search time span, the ENTIRE history has to be examined. (Multiple distributed indices can be maintained instead. However, all the indices still need to be searched). time search time span NDIIPP Partners Meeting

Example: Search All, and then Filter “Find web pages that contain ‘September 11th’ before 2001” September 11 attacks - Wikipedia, the free encyclopediaThe September 11 attacks (often referred to as nine-eleven, written 9/11) were a series of coordinated suicide attacks by al-Qaeda upon the United States on … en.wikipedia.org/wiki/September_11,_2001_attacks September 11 Digital ArchiveUses electronic media to collect, preserve, and present the history of the September 11, 2001 attacks in New York, Virginia, and Pennsylvania and the public … 911digitalarchive.org/ 9/11 Tributes, September 11 Tributes and Memorials to the Victims …Tributes of 9/11 - September 11th 9/11 memorials. For the Victims their Families and the many Heroes of September 11th. 2001. 9/11 World Trade Center, ... www.jontzen.com/tribute.htm - 132k National Commission on Terrorist Attacks Upon the United StatesCommission chartered to prepare a full and complete account of the circumstances surrounding the September 11, 2001 terrorist ttacks, … www.9-11commission.gov/ - 8k … and 4 million other pages pertaining to the September 11th Attack … 4 Million+ pages Search all, and then Filter  Very inefficient!! Ethiopian calendar - Wikipedia, the free encyclopediaThus the first day of the Ethiopian year, 1 Mäskäräm, for years between 1901 and 2099 (inclusive), is usually September 11 (Gregorian), ...en.wikipedia.org/wiki/Ethiopian_calendar - 43k APOD: September 11, 1997 - Mars Global Surveyor: AerobrakingSeptember 11, 1997 See Explanation. Clicking on the picture will download the highest resolution version available. Mars Global Surveyor: Aerobraking … apod.nasa.gov/apod/ap970911.html - 5k … and only 630 other pages that are irrelevant to the September 11th Attack 600+ pages

Issue #2: Time-independent Ranking • Regardless of the search time span, the current ranking schemes always consider the ENTIRE history. • Meaning and popularity of a term changes over time, and a ranking scheme should be dependent not only on the search terms but also the search time span. time search time span NDIIPP Partners Meeting

Issue #3: Ineffective Search Result Delivery • Search results are usually delivered as a list of URLs, sorted by the relevance ranks. • No other grouping / sorting options available. NDIIPP Partners Meeting

Core Technologies Developed • Ranking that depends on the time span specified by the user. • Flexible and intuitive presentations of the returned results, ordered according to user’s specification. • First Step toward Scalable and efficient ‘full text + temporal’ search. NDIIPP Partners Meeting

Scalable & Efficient Temporal Searches time-window time search time span t4 t1 t2 t3 For a given search time span, only these two indices are involved. NDIIPP Partners Meeting

Index Distribution and Parallel Search Web Interface ADAPT Web Archive Search Web Server Request Broker Result Aggregator Search Server Search Server Search Server Search Server Search Cluster NDIIPP Partners Meeting

Time-dependent Ranking time-window time search time span t3 t4 t1 t2 For a given search time span and terms, rankings depend on term popularity during this time span only (rather than the entire time span) NDIIPP Partners Meeting

Search Result Delivery Sorted by Relevance Grouped by Time Grouped by URL Sorted by Time NDIIPP Partners Meeting

Collection Used • Collaboration with the Library of Congress and the Internet Archive. • US 108th Congress Web Archive: • 16 monthly crawls between December 2003 and March 2005. • Web sites of Representatives, Senators, Delegates, and Committees of the 108th US Congress (2003-2004). • Number of sites: 582 • Number of records: 27 Millions • Total size around 2TB • Archived in the Library of Congress NDIIPP Partners Meeting

INTERNET Library of Congress Internet Archive ADAPT Web Archive Server UMIACS P Search/Return Ranked URLs Retrieve Web Documents WARCs WARCs WARCs Storage Containers Inverted Indices Processing/Indexing Cluster (Hadoop) Storage Cluster Search Cluster

Demo NDIIPP Partners Meeting

Screen Shots Ungroup Group by Time Sort by Time Sort by Relevance Follow Link Collapse Results Retrieve Page Search Keywords Time SpanOptions May 21, 2009 18

Search and Access Technologies for Large Scale Web Archives

Search and Access Technologies for Large Scale Web Archives

Presentation Transcript

Multilingual Access to Large Spoken Archives

WISE: Large Scale Content-Based Web Image Search

Large Scale Internet Search at Ask.com

Bundling Features for Large Scale Partial-Duplicate Web Image Search

Web Technologies Search Engines

Web Technologies Search Engines

Automatic Wrappers for Large Scale Web Extraction

Bundling Features for Large Scale Partial-Duplicate Web Image Search

Scalability and Efficiency Challenges in Large-Scale Web Search Engines

Automatic Wrappers for Large Scale Web Extraction

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Large Scale Internet Search at Ask

The Architecture of a Large-Scale Web Search and Query Engine

Very Large Scale Neighborhood Search

MUFIN: Large-scale Similarity Search

Comparing Large Scale Storage Technologies

Exploiting Large Scale Web Semantics

Automatic Wrappers for Large Scale Web Extraction

HathiTrust Large Scale Search

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Multilingual Access to Large Spoken Archives

Search and Access Strategies for Web Archives