inf 141 information retrieval l.
Skip this Video
Loading SlideShow in 5 Seconds..
INF 141: Information Retrieval PowerPoint Presentation
Download Presentation
INF 141: Information Retrieval

Loading in 2 Seconds...

play fullscreen
1 / 28

INF 141: Information Retrieval - PowerPoint PPT Presentation

  • Uploaded on

INF 141: Information Retrieval Discussion Session Week 3 – Winter 2010 TA: Sara Javanmardi Open Source Web Crawlers Heritrix Extensible, Web-Scale, Distributed Internet Archive’s Crawler Internet Archive

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'INF 141: Information Retrieval' - albert

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
inf 141 information retrieval

INF 141: Information Retrieval

Discussion Session

Week 3 – Winter 2010

TA: Sara Javanmardi

  • Extensible, Web-Scale, Distributed
  • Internet Archive’s Crawler
internet archive
Internet Archive
  • dedicated to building and maintaining a free and openly accessible online digital library, including an archive of the Web.

  • Apache’s Open Source Search Engine
  • Distributed
  • Tested with 100M Pages
  • 1998-2002
  • Single Machine
  • Lots of Problems (Memory leaks, …)
  • Reported to be very slow
  • Single Machine
  • Should Easily Scale to 20M Pages
  • Very Fast
    • Crawled and Processed the whole English Wikipedia in 10 hours.

Should be Extremely Fast,

otherwise it would be a bottleneck

docid server
Docid Server
  • Key-value pairs are stored in a B+-tree data structure.
  • Berkeley DB as the storage engine
berkeley db
Berkeley DB
  • Unlike traditional database systems like MySQL and others, Berkeley DB comes in form of a jar file which is linked to the Java program and runs in the process space of the crawlers.
  • No need for inter-process communication and waiting for context switch between processes.
  • You can think of it as a large HashMap:



adding a url to frontier
Adding a URL to Frontier

public static synchronized int getDocID(String URL) {

if there is any key-value pair for key = URL

return value


docID= lastdocid+1

put (URL, docID) in storage

return -docID


We add the URL to the frontier

put (docID,URL) in URL - Queue

things to know
Things to Know:
  • Crawler4j only handles duplicate detection in the level of URLs, not in the level of the content.
  • Frontier can be implemented as Priority Queue .
why priority queue
Why Priority Queue?
  • Politeness: do not hit a web server too frequently
  • Freshness: crawl some pages more often than others
    • E.g., pages (such as News sites) whose content changes often
assigning priority
Assigning Priority
  • Prioritizer assigns to URL an integer priority between 1 and K
  • Heuristics for assigning priority
    • Refresh rate sampled from previous crawls
    • Application-specific (e.g., “crawl news sites more often”)
assignment 3 programming part
Assignment 3 Programming Part
  • You can do the assignment 3 in groups of 1, 2 or 3.

Tiffany Siu

James Milewski , Matt Fritz

Kevin Boomhouwer

Azia Foster

James Rose , Sean Tsusaki , Jeff Gaskill

Tzu Yang Huang

Fiel Guhit ,Sarah Lee

Rob Duncan, Ben Kahn, Dan Morgan

Qi Zhu (Chess), Zhuomin Wu

Lucy Luxiao, Melanie Sun, Norik Davtian

Alex Kaiser, Sam Kaufman

Nery Chapeton

Jason Gahagan

Melanie Cheung, Anthony Liu

Chad Curtis , Derek Lee, Rakesh Rajput

Andrew J. Santa Maria

Zack Pelz

crawling one digg category
Crawling One Digg Category

Initial Seeds

digg com robots txt

User-agent: *

Disallow: /aboutpost

Disallow: /addfriends

Disallow: /addim

Disallow: /addlink

Disallow: /ajax

Disallow: /api

Disallow: /captcha

Disallow: /css/remote-skins/

Disallow: /deleteuserim

Disallow: /deleteuserlink

Disallow: /diginfull


User-agent: Referrer Karma/2.0

Disallow: /

things to do
Things To Do
  • Download the jar file and import it.
  • Download the dependency libraries and import them.
  • Download and complete the source code to crawl
extra credit question
Extra Credit Question

1) Extract story id from digg page:

<div  class="news-body"id="18765384">

2) Send to API: