inf 141 information retrieval l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
INF 141: Information Retrieval PowerPoint Presentation
Download Presentation
INF 141: Information Retrieval

Loading in 2 Seconds...

play fullscreen
1 / 28

INF 141: Information Retrieval - PowerPoint PPT Presentation


  • 450 Views
  • Uploaded on

INF 141: Information Retrieval Discussion Session Week 3 – Winter 2010 TA: Sara Javanmardi Open Source Web Crawlers Heritrix Extensible, Web-Scale, Distributed Internet Archive’s Crawler Internet Archive

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'INF 141: Information Retrieval' - albert


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
inf 141 information retrieval

INF 141: Information Retrieval

Discussion Session

Week 3 – Winter 2010

TA: Sara Javanmardi

heritrix
Heritrix
  • Extensible, Web-Scale, Distributed
  • Internet Archive’s Crawler
internet archive
Internet Archive
  • dedicated to building and maintaining a free and openly accessible online digital library, including an archive of the Web.

http://www.archive.org/

nutch
Nutch
  • Apache’s Open Source Search Engine
  • Distributed
  • Tested with 100M Pages
websphinx
WebSphinx
  • 1998-2002
  • Single Machine
  • Lots of Problems (Memory leaks, …)
  • Reported to be very slow
crawler4j
Crawler4j
  • Single Machine
  • Should Easily Scale to 20M Pages
  • Very Fast
    • Crawled and Processed the whole English Wikipedia in 10 hours.
architecture
Architecture

Should be Extremely Fast,

otherwise it would be a bottleneck

docid server
Docid Server
  • Key-value pairs are stored in a B+-tree data structure.
  • Berkeley DB as the storage engine
berkeley db
Berkeley DB
  • Unlike traditional database systems like MySQL and others, Berkeley DB comes in form of a jar file which is linked to the Java program and runs in the process space of the crawlers.
  • No need for inter-process communication and waiting for context switch between processes.
  • You can think of it as a large HashMap:

Key

Value

adding a url to frontier
Adding a URL to Frontier

public static synchronized int getDocID(String URL) {

if there is any key-value pair for key = URL

return value

else

docID= lastdocid+1

put (URL, docID) in storage

return -docID

}

We add the URL to the frontier

put (docID,URL) in URL - Queue

things to know
Things to Know:
  • Crawler4j only handles duplicate detection in the level of URLs, not in the level of the content.
  • Frontier can be implemented as Priority Queue .
why priority queue
Why Priority Queue?
  • Politeness: do not hit a web server too frequently
  • Freshness: crawl some pages more often than others
    • E.g., pages (such as News sites) whose content changes often
assigning priority
Assigning Priority
  • Prioritizer assigns to URL an integer priority between 1 and K
  • Heuristics for assigning priority
    • Refresh rate sampled from previous crawls
    • Application-specific (e.g., “crawl news sites more often”)
assignment 3 programming part
Assignment 3 Programming Part
  • You can do the assignment 3 in groups of 1, 2 or 3.

Tiffany Siu

James Milewski , Matt Fritz

Kevin Boomhouwer

Azia Foster

James Rose , Sean Tsusaki , Jeff Gaskill

Tzu Yang Huang

Fiel Guhit ,Sarah Lee

Rob Duncan, Ben Kahn, Dan Morgan

Qi Zhu (Chess), Zhuomin Wu

Lucy Luxiao, Melanie Sun, Norik Davtian

Alex Kaiser, Sam Kaufman

Nery Chapeton

Jason Gahagan

Melanie Cheung, Anthony Liu

Chad Curtis , Derek Lee, Rakesh Rajput

Andrew J. Santa Maria

Zack Pelz

crawling one digg category
Crawling One Digg Category

http://digg.com/arts_culture

http://digg.com/autos

http://digg.com/educational

http://digg.com/food_drink

http://digg.com/health

http://digg.com/travel_places

http://digg.com/arts_culture/popular/365days

http://digg.com/arts_culture/popular/30days

Initial Seeds

digg com robots txt
digg.com/robots.txt

User-agent: *

Disallow: /aboutpost

Disallow: /addfriends

Disallow: /addim

Disallow: /addlink

Disallow: /ajax

Disallow: /api

Disallow: /captcha

Disallow: /css/remote-skins/

Disallow: /deleteuserim

Disallow: /deleteuserlink

Disallow: /diginfull

...

User-agent: Referrer Karma/2.0

Disallow: /

things to do
Things To Do
  • Download the jar file and import it.
  • Download the dependency libraries and import them.
  • Download Crawler4j-example-simple.zip and complete the source code to crawl digg.com
extra credit question
Extra Credit Question

1) Extract story id from digg page:

<div  class="news-body"id="18765384">

2) Send to API:http://services.digg.com/1.0/endpoint?method=story.getDiggs&story_id=18765384&count=100&offset=0