1 / 10

Crawling

Crawling. Paolo Ferragina Dipartimento di Informatica Università di Pisa. Reading 20.1, 20.2 and 20.3. Spidering. 24h, 7days “ walking ” over a Graph What about the Graph? BowTie Direct graph G = (N, E) N changes (insert, delete) >> 50 * 10 9 nodes

Download Presentation

Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

  2. Spidering • 24h, 7days “walking” over a Graph • What about the Graph? • BowTie • Direct graph G = (N, E) • N changes (insert, delete) >> 50 * 109 nodes • E changes (insert, delete) > 10 links per node • 10*50*109 = 500*109 1-entries in adj matrix

  3. Crawling Issues • How to crawl? • Quality: “Best” pages first • Efficiency: Avoid duplication (or near duplication) • Etiquette: Robots.txt, Server load concerns (Minimize load) • How much to crawl? How much to index? • Coverage: How big is the Web? How much do we cover? • Relative Coverage: How much do competitors have? • How often to crawl? • Freshness: How much has changed? • How to parallelize the process

  4. Page selection • Given a page P, define how “good” P is. • Several metrics: • BFS, DFS, Random • Popularity driven (PageRank, full vs partial) • Topic driven or focused crawling • Combined

  5. This page is a new one ? • Check if file has been parsed or downloaded before • after 20 mil pages, we have “seen” over 200 million URLs • each URL is at least 100 bytes on average • Overall we have about 20Gb of URLS • Options: compress URLs in main memory, or use disk • Bloom Filter (Archive) • Disk access with caching (Mercator, Altavista) • Also, two-level indexing with Front-coding compression

  6. Crawler Manager AR Link Extractor Downloaders PQ Crawler “cycle of life” PR Link Extractor: while(<Page Repository is not empty>){ <take a page p (check if it is new)> <extract links contained in p within href> <extract links contained in javascript> <extract ….. <insert these links into the Priority Queue> } Downloaders: while(<Assigned Repository is not empty>){ <extract url u> <download page(u)> <send page(u) to the Page Repository> <store page(u) in a proper archive, possibly compressed> } Crawler Manager: while(<Priority Queue is not empty>){ <extract some URL u having the highest priority> foreach u extracted { if ( (u “Already Seen Page” ) || ( u “Already Seen Page” && <u’s version on the Web is more recent> ) ) { <resolve u wrt DNS> <send u to the Assigned Repository> } } }

  7. Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication • Dynamic assignment • Central coordinator dynamically assigns URLs to crawlers • Links are given to Central coordinator (?bottleneck?) • Static assignment • Web is statically partitioned and assigned to crawlers • Crawler only crawls its part of the web

  8. Two problems with static assignment Let D be the number of downloaders. hash(URL) maps an URL to {0,...,D-1}. Dowloader x fetches the URLs U s.t. hash(U) = x Which hash would you use? • Load balancing the #URLs assigned to downloaders: • Static schemes based on hosts may fail • www.geocities.com/…. • www.di.unipi.it/ • Dynamic “relocation” schemes may be complicated • Managing the fault-tolerance: • What about the death of downloaders ? DD-1, new hash !!! • What about new downloaders ? DD+1, new hash !!!

  9. Each server gets replicated log S times [monotone] adding a new server moves points between one old to the new, only. [balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times... A nice technique: Consistent Hashing • Item andservers mapped to unit circle • Item K assigned to first server N such that ID(N) ≥ ID(K) • What if a downloader goes down? • What if a new downloader appears? • A tool for: • Spidering • Web Cache • P2P • Routers Load Balance • Distributed FS

  10. Examples: Open Source • Nutch, also used by WikiSearch • http://nutch.apache.org/

More Related