1 / 18

Search and Access Technologies for Large Scale Web Archives

Search and Access Technologies for Large Scale Web Archives. Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University of Maryland. In Collaboration with the Library of Congress and the Internet Archive.

Download Presentation

Search and Access Technologies for Large Scale Web Archives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search and Access Technologies for Large Scale Web Archives Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University of Maryland In Collaboration with the Library of Congress and the Internet Archive

  2. Web Archiving • Web – Main publication/communication medium today, but it is an ephemeral medium. • Web Archiving: • Capture, annotate, and store important web contents within their contextual and temporal characteristics; • Preserve to enable search and access in the long term; • Unprecedented scale and heterogeneity. NDIIPP Partners Meeting

  3. Goals • Discovery of relevant contents based on unstructured queries involving temporal specifications • Presentation of pertinent summary information in ranked order according to the temporal context • Scalable search and access performance NDIIPP Partners Meeting

  4. Existing Access Methods • Chronological Listing Based on URLs • Used by the Wayback Machine of the Internet Archive, arguably the leader in web archiving. • Directory Organization • Typically for domain specific contents, which are organized according to some hierarchical structure. • Full Text Search • Similar to current web search engines (NutchWax/WERA) NDIIPP Partners Meeting

  5. Limitations of Current Technologies • Chronological Listing • Users are expected to provide URLs. • Hierarchical Listing • Not scalable. Users explore hierarchical structures, with possibly large numbers of entries. • Full Text Search (NutchWax/WERA) • Ranking of returned results does not take temporal context into consideration. • A listing similar to current web search engines. • Lack in performance and scalability. NDIIPP Partners Meeting

  6. Issue #1: Scalability and Performance • For any search time span, the ENTIRE history has to be examined. (Multiple distributed indices can be maintained instead. However, all the indices still need to be searched). time search time span NDIIPP Partners Meeting

  7. Example: Search All, and then Filter “Find web pages that contain ‘September 11th’ before 2001” September 11 attacks - Wikipedia, the free encyclopediaThe September 11 attacks (often referred to as nine-eleven, written 9/11) were a series of coordinated suicide attacks by al-Qaeda upon the United States on … en.wikipedia.org/wiki/September_11,_2001_attacks September 11 Digital ArchiveUses electronic media to collect, preserve, and present the history of the September 11, 2001 attacks in New York, Virginia, and Pennsylvania and the public … 911digitalarchive.org/ 9/11 Tributes, September 11 Tributes and Memorials to the Victims …Tributes of 9/11 - September 11th 9/11 memorials. For the Victims their Families and the many Heroes of September 11th. 2001. 9/11 World Trade Center, ... www.jontzen.com/tribute.htm - 132k National Commission on Terrorist Attacks Upon the United StatesCommission chartered to prepare a full and complete account of the circumstances surrounding the September 11, 2001 terrorist ttacks, … www.9-11commission.gov/ - 8k … and 4 million other pages pertaining to the September 11th Attack … 4 Million+ pages Search all, and then Filter  Very inefficient!! Ethiopian calendar - Wikipedia, the free encyclopediaThus the first day of the Ethiopian year, 1 Mäskäräm, for years between 1901 and 2099 (inclusive), is usually September 11 (Gregorian), ...en.wikipedia.org/wiki/Ethiopian_calendar - 43k APOD: September 11, 1997 - Mars Global Surveyor: AerobrakingSeptember 11, 1997 See Explanation. Clicking on the picture will download the highest resolution version available. Mars Global Surveyor: Aerobraking … apod.nasa.gov/apod/ap970911.html - 5k … and only 630 other pages that are irrelevant to the September 11th Attack 600+ pages

  8. Issue #2: Time-independent Ranking • Regardless of the search time span, the current ranking schemes always consider the ENTIRE history. • Meaning and popularity of a term changes over time, and a ranking scheme should be dependent not only on the search terms but also the search time span. time search time span NDIIPP Partners Meeting

  9. Issue #3: Ineffective Search Result Delivery • Search results are usually delivered as a list of URLs, sorted by the relevance ranks. • No other grouping / sorting options available. NDIIPP Partners Meeting

  10. Core Technologies Developed • Ranking that depends on the time span specified by the user. • Flexible and intuitive presentations of the returned results, ordered according to user’s specification. • First Step toward Scalable and efficient ‘full text + temporal’ search. NDIIPP Partners Meeting

  11. Scalable & Efficient Temporal Searches time-window time search time span t4 t1 t2 t3 For a given search time span, only these two indices are involved. NDIIPP Partners Meeting

  12. Index Distribution and Parallel Search Web Interface ADAPT Web Archive Search Web Server Request Broker Result Aggregator Search Server Search Server Search Server Search Server Search Cluster NDIIPP Partners Meeting

  13. Time-dependent Ranking time-window time search time span t3 t4 t1 t2 For a given search time span and terms, rankings depend on term popularity during this time span only (rather than the entire time span) NDIIPP Partners Meeting

  14. Search Result Delivery Sorted by Relevance Grouped by Time Grouped by URL Sorted by Time NDIIPP Partners Meeting

  15. Collection Used • Collaboration with the Library of Congress and the Internet Archive. • US 108th Congress Web Archive: • 16 monthly crawls between December 2003 and March 2005. • Web sites of Representatives, Senators, Delegates, and Committees of the 108th US Congress (2003-2004). • Number of sites: 582 • Number of records: 27 Millions • Total size around 2TB • Archived in the Library of Congress NDIIPP Partners Meeting

  16. INTERNET Library of Congress Internet Archive ADAPT Web Archive Server UMIACS P Search/Return Ranked URLs Retrieve Web Documents WARCs WARCs WARCs Storage Containers Inverted Indices Processing/Indexing Cluster (Hadoop) Storage Cluster Search Cluster

  17. Demo NDIIPP Partners Meeting

  18. Screen Shots Ungroup Group by Time Sort by Time Sort by Relevance Follow Link Collapse Results Retrieve Page Search Keywords Time SpanOptions May 21, 2009 18

More Related