1 / 12

Distributed Web Crawling (a survey by Dustin Boswell)

Distributed Web Crawling (a survey by Dustin Boswell). Basic Crawling Algorithm. UrlsTodo = {‘‘yahoo.com/index.html’’} Repeat: url = UrlsTodo.getNext() html = Download( url ) UrlsDone.insert( url ) newUrls = parseForLinks( html ) For each newUrl not in UrlsDone:

sumana
Download Presentation

Distributed Web Crawling (a survey by Dustin Boswell)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Web Crawling(a survey by Dustin Boswell)

  2. Basic Crawling Algorithm • UrlsTodo = {‘‘yahoo.com/index.html’’} • Repeat: • url = UrlsTodo.getNext() • html = Download( url ) • UrlsDone.insert( url ) • newUrls = parseForLinks( html ) • For each newUrl not in UrlsDone: • UrlsTodo.insert( newUrl )

  3. Statistics to Keep in Mind Documents on the web: Avg. HTML size: Avg. URL length: Links per page: External Links per page: 3 Billion + (by Google’s count) 15KB 50+ characters 10 2 Download the entire web in a year: 95 urls / second !

  4. Statistics to Keep in Mind Documents on the web: Avg. HTML size: Avg. URL length: Links per page: External Links per page: 3 Billion + (by Google’s count) 15KB 50+ characters 10 2 Download the entire web in a year: 95 urls / second ! 3 Billion * 15KB = 45 TeraBytes of HTML 3 Billion * 50 chars = 150 GigaBytes of URL’s !!  multiple machines required

  5. Distributing the Workload Internet Machine 0 Machine 1 Machine N-1 LAN • Each machine is assigned a fixed subset of the url-space

  6. Distributing the Workload Internet Machine 0 Machine 1 Machine N-1 LAN • Each machine is assigned a fixed subset of the url-space • machine = hash( url’s domain name )% N

  7. Distributing the Workload Internet Machine 0 Machine 1 Machine N-1 LAN cnn.com/sports cnn.com/weather cbs.com/csi_miami … bbc.com/us bbc.com/uk bravo.com/queer_eye … • Each machine is assigned a fixed subset of the url-space • machine = hash( url’s domain name )% N

  8. Distributing the Workload Internet Machine 0 Machine 1 Machine N-1 LAN cnn.com/sports cnn.com/weather cbs.com/csi_miami … bbc.com/us bbc.com/uk bravo.com/queer_eye … • Each machine is assigned a fixed subset of the url-space • machine = hash( url’s domain name )% N • Communication: a couple urls per page (very small) • DNS cache per machine • Maintain politeness : don’t want to DOS attack someone!

  9. Software Hazards • Slow/Unresponsive DNS Servers • Slow/Unresponsive HTTP Servers parallel / asynch interface desired

  10. Software Hazards • Slow/Unresponsive DNS Servers • Slow/Unresponsive HTTP Servers • Large or Infinite-sized pages • Infinite Links (“domain.com/time=100”, “…101”, “…102”, …) • Broken HTML parallel / asynch interface desired

  11. Previous Web Crawlers

  12. Questions?

More Related