Search and access technologies for large scale web archives
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

Search and Access Technologies for Large Scale Web Archives PowerPoint PPT Presentation


  • 55 Views
  • Uploaded on
  • Presentation posted in: General

Search and Access Technologies for Large Scale Web Archives. Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University of Maryland. In Collaboration with the Library of Congress and the Internet Archive.

Download Presentation

Search and Access Technologies for Large Scale Web Archives

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Search and access technologies for large scale web archives

Search and Access Technologies for Large Scale Web Archives

Joseph JaJa, Sangchul Song, and Mike Smorul

Institute for Advanced Computer Studies

Department of Electrical and Computer Engineering

University of Maryland

In Collaboration with the Library of Congress and the Internet Archive


Web archiving

Web Archiving

  • Web – Main publication/communication medium today, but it is an ephemeral medium.

  • Web Archiving:

    • Capture, annotate, and store important web contents within their contextual and temporal characteristics;

    • Preserve to enable search and access in the long term;

    • Unprecedented scale and heterogeneity.

NDIIPP Partners Meeting


Goals

Goals

  • Discovery of relevant contents based on unstructured queries involving temporal specifications

  • Presentation of pertinent summary information in ranked order according to the temporal context

  • Scalable search and access performance

NDIIPP Partners Meeting


Existing access methods

Existing Access Methods

  • Chronological Listing Based on URLs

    • Used by the Wayback Machine of the Internet Archive, arguably the leader in web archiving.

  • Directory Organization

    • Typically for domain specific contents, which are organized according to some hierarchical structure.

  • Full Text Search

    • Similar to current web search engines (NutchWax/WERA)

NDIIPP Partners Meeting


Limitations of current technologies

Limitations of Current Technologies

  • Chronological Listing

    • Users are expected to provide URLs.

  • Hierarchical Listing

    • Not scalable. Users explore hierarchical structures, with possibly large numbers of entries.

  • Full Text Search (NutchWax/WERA)

    • Ranking of returned results does not take temporal context into consideration.

    • A listing similar to current web search engines.

    • Lack in performance and scalability.

NDIIPP Partners Meeting


Issue 1 scalability and performance

Issue #1: Scalability and Performance

  • For any search time span, the ENTIRE history has to be examined. (Multiple distributed indices can be maintained instead. However, all the indices still need to be searched).

time

search time span

NDIIPP Partners Meeting


Example search all and then filter

Example: Search All, and then Filter

“Find web pages that contain ‘September 11th’ before 2001”

September 11 attacks - Wikipedia, the free encyclopediaThe September 11 attacks (often referred to as nine-eleven, written 9/11) were a series of coordinated suicide attacks by al-Qaeda upon the United States on … en.wikipedia.org/wiki/September_11,_2001_attacks

September 11 Digital ArchiveUses electronic media to collect, preserve, and present the history of the September 11, 2001 attacks in New York, Virginia, and Pennsylvania and the public … 911digitalarchive.org/

9/11 Tributes, September 11 Tributes and Memorials to the Victims …Tributes of 9/11 - September 11th 9/11 memorials. For the Victims their Families and the many Heroes of September 11th. 2001. 9/11 World Trade Center, ... www.jontzen.com/tribute.htm - 132k

National Commission on Terrorist Attacks Upon the United StatesCommission chartered to prepare a full and complete account of the circumstances surrounding the September 11, 2001 terrorist ttacks, … www.9-11commission.gov/ - 8k

… and 4 million other pages pertaining to the September 11th Attack …

4 Million+ pages

Search all, and then Filter  Very inefficient!!

Ethiopian calendar - Wikipedia, the free encyclopediaThus the first day of the Ethiopian year, 1 Mäskäräm, for years between 1901 and 2099 (inclusive), is usually September 11 (Gregorian), ...en.wikipedia.org/wiki/Ethiopian_calendar - 43k

APOD: September 11, 1997 - Mars Global Surveyor: AerobrakingSeptember 11, 1997 See Explanation. Clicking on the picture will download the highest resolution version available. Mars Global Surveyor: Aerobraking … apod.nasa.gov/apod/ap970911.html - 5k

… and only 630 other pages that are irrelevant to the September 11th Attack

600+ pages


Issue 2 time independent ranking

Issue #2: Time-independent Ranking

  • Regardless of the search time span, the current ranking schemes always consider the ENTIRE history.

  • Meaning and popularity of a term changes over time, and a ranking scheme should be dependent not only on the search terms but also the search time span.

time

search time span

NDIIPP Partners Meeting


Issue 3 ineffective search result delivery

Issue #3: Ineffective Search Result Delivery

  • Search results are usually delivered as a list of URLs, sorted by the relevance ranks.

  • No other grouping / sorting options available.

NDIIPP Partners Meeting


Core technologies developed

Core Technologies Developed

  • Ranking that depends on the time span specified by the user.

  • Flexible and intuitive presentations of the returned results, ordered according to user’s specification.

  • First Step toward Scalable and efficient ‘full text + temporal’ search.

NDIIPP Partners Meeting


Scalable efficient temporal searches

Scalable & Efficient Temporal Searches

time-window

time

search time span

t4

t1

t2

t3

For a given search time span, only these two indices are involved.

NDIIPP Partners Meeting


Index distribution and parallel search

Index Distribution and Parallel Search

Web Interface

ADAPT Web Archive Search Web Server

Request Broker

Result Aggregator

Search Server

Search Server

Search Server

Search Server

Search Cluster

NDIIPP Partners Meeting


Time dependent ranking

Time-dependent Ranking

time-window

time

search time span

t3

t4

t1

t2

For a given search time span and terms, rankings depend on term popularity during this time span only (rather than the entire time span)

NDIIPP Partners Meeting


Search result delivery

Search Result Delivery

Sorted by Relevance

Grouped by Time

Grouped by URL

Sorted by Time

NDIIPP Partners Meeting


Collection used

Collection Used

  • Collaboration with the Library of Congress and the Internet Archive.

  • US 108th Congress Web Archive:

    • 16 monthly crawls between December 2003 and March 2005.

    • Web sites of Representatives, Senators, Delegates, and Committees of the 108th US Congress (2003-2004).

    • Number of sites: 582

    • Number of records: 27 Millions

    • Total size around 2TB

  • Archived in the Library of Congress

NDIIPP Partners Meeting


Search and access technologies for large scale web archives

INTERNET

Library of Congress

Internet Archive

ADAPT Web Archive Server

UMIACS

P

Search/Return

Ranked URLs

Retrieve Web

Documents

WARCs

WARCs

WARCs

Storage

Containers

Inverted Indices

Processing/Indexing Cluster (Hadoop)

Storage Cluster

Search Cluster


Search and access technologies for large scale web archives

Demo

NDIIPP Partners Meeting


Screen shots

Screen Shots

Ungroup

Group by Time

Sort by Time

Sort by Relevance

Follow Link

Collapse Results

Retrieve Page

Search Keywords Time SpanOptions

May 21, 2009

18


  • Login