1 / 6

CSCI 572’s Class Project

Measuring the performance of parallel crawlers in different modes. CSCI 572’s Class Project. Huy Pham PhD – Computer Science Spring 2011. Project inspired by the research paper on parallel crawlers. Site S1 is crawled by crawler C1 and site S2 is crawled by C2

peggy
Download Presentation

CSCI 572’s Class Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring the performance of parallel crawlers in different modes CSCI 572’s Class Project Huy Pham PhD – Computer Science Spring 2011

  2. Project inspired by the research paper on parallel crawlers • Site S1 is crawled by crawler C1 and site S2 is crawled by C2 • In Firewall mode, crawlers ignore inter-partition links (C1 ignores g and C2 ignores d). Firewall mode makes no overlapping , quick performance (no communication between crawlers), but some data can be missed due to the elimination of inter-partition links. • In Cross-over mode, crawlers also follow inter-partition links, hence download more pages than in Firewall mode, but overlapping is an issue (g and d get downloaded twice). • In Exchange mode, crawlers periodically and incrementally exchange inter-partition links, hence avoid overlapping and increase coverage. Two parallel crawlers

  3. Implementation • Crawling two websites in parallel: USC School of Letters, Arts and Sciences and USC main page: usc.edu. These two sites have their own data, and also share lots of links pointing to each other. • The data from domains other than LAS and usc.edu will get ignored in Firewall mode, only data from the two domains are crawled, no overlapping in this case. This data will be used to test the data from the cross-over and exchange modes. • In cross-over mode, besides the data from the two domains (Viterbi and LAS), only data from usc.edu will get crawled in order to limit the amount of data retrieved from the crawling processes. The reason is there are links from pages of the two domains that point to other different sites such as experiencela.com, thegrovela.com…, and those sites often contain too much data to handle. Overlapping will be expected in cross-over mode since both usc.edu and LAS have links that point to each other, hence the data will get crawled twice. Data(firewall mode) – Data(exchange-mode) = overlapping

  4. In exchange mode, two crawlers (LAS and usc.edu) will exchange batches of information. When a crawler sees a page, whose domain is not the one it’s supposed to crawl, it will store the URL in a batch; when the batch is full, it will send the batch to the corresponding crawler. Viterbi crawler Nutch Solr usc.edu DBMS crawler Indexing crawler LAS

  5. Cross-over mode • Graph picturing the dependence of percentage of overlapping on the total amount of data crawled. Example:

  6. Comparing Exchange and Cross-Over modes • Graph representing the true data (overlapping excluded) that two crawlers have retrieved depending on time. The total data retrieved by each crawler will be approximately the same, but after the overlapping has been calculated and excluded from the cross-over mode, its retrieved data will be less than that of the exchange mode. Example: Data retrieved overlapping

More Related