1 / 24

Intelligent Crawling

Learn about the challenges of web crawling and how to develop effective crawling strategies. Explore importance metrics such as backlink count and page similarity, and discover crawling models like crawl-and-stop and limited buffer crawl. Experiment with ordering metrics to optimize the crawling process.

cjansen
Download Presentation

Intelligent Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Junghoo Cho Hector Garcia-Molina Stanford InfoLab Intelligent Crawling

  2. What is a crawler? • Program that automatically retrieves pages from the Web. • Widely used for search engines.

  3. Challenges • There are many pages out on the Web. (Major search engines indexed more than 100M pages) • The size of the Web is growing enormously. • Most of them are not very interesting  In most cases, it is too costly or not worthwhile to visit the entire Web space.

  4. Good crawling strategy • Make the crawler visit “important pages” first. • Save network bandwidth • Save storage space and management cost • Serve quality pages to the client application

  5. Outline • Importance metrics : what are important pages? • Crawling models : How is crawler evaluated? • Experiments • Conclusion & Future work

  6. Importance metric The metric for determining if a page is HOT • Similarity to driving query • Location Metric • Backlink count • Page Rank

  7. Similarity to a driving query Example) “Sports”, “Bill Clinton” the pages related to a specific topic • Importance is measured by closeness of the page to the topic (e.g. the number of the topic word in the page) • Personalized crawler

  8. Importance metric The metric for determining if a page is HOT • Similarity to driving query • Location Metric • Backlink count • Page Rank

  9. Backlink-based metric • Backlink count • number of pages pointing to the page • Citation metric • Page Rank • weighted backlink count • weight is iteratively defined

  10. B A C E D F BackLinkCount(F) = 2 PageRank(F) = PageRank(E)/2 + PageRank(C)

  11. Ordering metric • The metric for a crawler to “estimate” the importance of a page • The ordering metric can be different from the importance metric

  12. Crawling models • Crawl and Stop • Keep crawling until the local disk space is full. • Limited buffer crawl • Keep crawling until the whole web space is visited throwing out seemingly unimportant pages.

  13. Crawl and stop model

  14. Crawling models • Crawl and Stop • Keep crawling until the local disk space is full. • Limited buffer crawl • Keep crawling until the whole web space is visited throwing out seemingly unimportant pages.

  15. Limited buffer model

  16. Architecture HTML parser crawled page extracted URL Virtual Crawler page info URL pool WebBase Crawler Page Info selected URL Repository URL selector Stanford WWW

  17. Experiments • Backlink-based importance metric • backlink count • PageRank • Similiarty-based importance metric • similarity to a query word

  18. Ordering metrics in experiments • Breadth first order • Backlink count • PageRank

  19. Similarity-based crawling • The content of the page is not available before it is visited • Essentially, the crawler should “guess” the content of the page • More difficult than backlink-based crawling

  20. Promising page Anchor Text HOT Parent Page URL Sports Sports!! Sports!! …/sports.html ? ? ?

  21. Virtual crawler for similarity-based crawling Promising page • Query word appears in its anchor text • Query word appears in its URL • The page pointing to it is “important” page • Visit “promising pages” first • Visit “non-promising pages” in the ordering metric order

  22. Conclusion • PageRank is generally good as an ordering metric. • By applying a good ordering metric, it is possible to gather important pages quickly.

More Related