1 / 17

Freshness Policy

Freshness Policy. Binoy Dharia , K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA. Freshness Policy.

vin
Download Presentation

Freshness Policy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Freshness Policy BinoyDharia, K. Rohan Gandhi, MadhuraKolwadkar Department of Computer Science University of Southern California Los Angeles, CA

  2. Freshness Policy • Freshness policy also known as Revisit policy is the process of determining the order and time to re-crawl the web pages by any crawler. • By the time a Web crawler has finished its crawl, many events could have happened, including creations, updates and deletions which will make the crawled data out-of-date. • In order to display latest results to the user search engine must have an efficient revisit policy. • An efficient revisit policy will not only save time and bandwidth but also keep search engines data up-to-date.

  3. Metrics for evaluation of Freshness Policy • Two metrics for determining how up to date a site is can be described as follows: • Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a pagepin the repository at timetis defined as: • Age: This is a measure that indicates how outdated the local copy is. The age of a page in the repository, at time is defined as:

  4. Methodology • Tracked over 90 sites over a period of 2 weeks. • We divided them into 4 categories: • Movies • Technology • Education • News • Sites selected based on Alexa traffic Rankings – Rohan and Binoy • Developed crawler in Java to download original as well as cached version of Google and Bing for each web page twice a day– Binoy and Rohan • Implemented our own code to extract date and time from the cache for each web page- Rohan • Implemented our own Diff functionality to detect changes in a web page over a period of time which ignored html tags and scripts and considered data between the tags– Madhura • Data Integration – Madhura • Data Analysis – Binoy, Rohan and Madhura • Study of Nutch Adaptive Fetch Policy - Binoy

  5. NUTCH 1.2 Setup • Installed Nutch with Lucene on local machine for crawling • Settings used for Nutch Crawling <name>db.fetch.interval.default</name><value>172800</value><name>db.fetch.schedule.class</name><value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value><name>db.fetch.schedule.adaptive.inc_rate</name><value>0.4</value><name>db.fetch.schedule.adaptive.dec_rate</name><value>0.2</value>

  6. Nutch Crawling Snapshot • Average Freshness achieved with Nutch Fetch Policy – 0.5

  7. Data Integration and Calculations (Excel) • Data Snippet after Integration • Age and Freshness Calculations • Per Site Average Age per Site = (Sum of Ages)/ (Number of Crawls) Average Freshness per Site = (Sum of Freshness) / (Number of Crawls) • Per Category Average Age per Category = (Sum of Average Site Ages) / (Site Count) Average Freshness per Category = (Sum of Average Site Freshness) / (Site Count)

  8. Standard Deviation • Standard Deviation in Age for a Category (Days) = sqrt [ (sum of squares of age difference) / Site Count ]

  9. Data Analysis • Age Comparison between Google and Bing • Conclusions : • Google Database is much more up to date as compared to Bing • Google crawls news sites more than once a day • Google crawling cycle is mostly consistent across different categories • Google average crawling cycle is 0.8 Days • Bing average crawling cycle is 4.6 Days

  10. Data Analysis • Freshness Comparison between Google and Bing • Conclusions : • News sites change frequently and so even though the Age for News sites is low, cached page is usually not fresh • Google Average Freshness is 0.65 • Bing Average Freshness is 0.28

  11. Data Analysis • Comparison of Standard Deviation across Domains • Conclusions : • Google’s standard deviation is low which indicates category of a site is not a major factor while deciding frequency of crawl • Same inference does not apply for Bing

  12. Data Analysis • Alexa Rank (x-axis) vs Google Cache Age (y-axis) • Conclusion: • Google - Sites with high traffic are crawled more frequently

  13. Data Analysis • Alexa Rank (x-axis) vs Bing Cache Age (y-axis) • Conclusion : • Bing crawling is uniform across sites with varying traffic volume

  14. Data Analysis • Date Modified vs Crawl Date • Conclusion : • Google Crawling seems to be more adaptive to original site changes while Bing crawling is uniform for sites with high ranking

  15. Data Analysis • Date Modified vs Crawl Date • Conclusion : • Google as well as Bing Crawling seems to be uniform for low ranking sites

  16. Conclusions • Google Freshness Policy Factors Identified • Popularity/Traffic volume • Category not considered • Frequency of Change of a page affects Crawling cycle – Adaptive ! • Bing Freshness Policy Factors Identified • Site popularity is not considered • Category is considered • Frequency of Change of a page affects Crawling cycle – Adaptive !

  17. Limitations and Future Work • Limitations • Conclusions are drawn on a limited random data sample because of • Crawling restrictions on Google cached data • Change in Bing cached links every time Bing’s cached repository is updated • Larger time frame is required to identify crawling behavior of each search engine • High Freshness was observed for Nutch as crawling interval was low • Future Work • Additional factors like number of incoming and outgoing links can be noted and its co-relation to crawling can be observed • Factors like ranking, popularity, number of outgoing links can be incorporated in Nutch Adaptive Fetch Policy

More Related