1 / 5

CSCI 572’s Class Project

Measuring the performance of parallel crawlers in different modes. CSCI 572’s Class Project. Huy Pham PhD – Computer Science Spring 2011. Project inspired by the research paper on parallel crawlers. Site S1 is crawled by crawler C1 and site S2 is crawled by C2

val
Download Presentation

CSCI 572’s Class Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring the performance of parallel crawlers in different modes CSCI 572’s Class Project Huy Pham PhD – Computer Science Spring 2011

  2. Project inspired by the research paper on parallel crawlers • Site S1 is crawled by crawler C1 and site S2 is crawled by C2 • In Firewall mode, crawlers ignore inter-partition links (C1 ignores g and C2 ignores d). Firewall mode makes no overlapping , quick performance (no communication between crawlers), but some data can be missed due to the elimination of inter-partition links. • In Cross-over mode, crawlers also follow inter-partition links, hence download more pages than in Firewall mode, but overlapping is an issue (g and d get downloaded twice). Two parallel crawlers

  3. Crawler 1 Crawler 2 Viterbi LAS usc.edu

  4. Continued.. • In Exchange mode, crawlers periodically and incrementally exchange inter-partition links, hence avoid overlapping and increase coverage. • Implementation: Crawling two websites in parallel: USC Viterbi School of Engineering and USC School of Letters, Arts and Sciences. These two sites have their own data, and also share lots of links (generally to each other and to USC website). The data from USC website will get ignored in Firewall mode, overlapping will happen in cross-over mode when the two sites point to each other, and exchange mode will prove to be the best among the three modes. Nutch Solr Viterbi crawler DBMS Indexing LAS crawler

  5. Evaluation

More Related