150 likes | 276 Views
Sic Transit Gloria Telae: Towards an Understanding of the Web's Decay. Ziv Bar-Yossef et al IBM Almaden and T.J Watson Research Centers. Mark Strohmaier. Problem Motivation. Determining if a link is dead is not trivial Using dead links as a decay signal is very noisy.
E N D
Sic Transit Gloria Telae: Towards an Understanding of the Web's Decay Ziv Bar-Yossef et al IBM Almaden and T.J Watson Research Centers Mark Strohmaier
Problem Motivation • Determining if a link is dead is not trivial • Using dead links as a decay signal is very noisy
Estimating Proportion of Dead Pages • Begin at a page • Probability 1-σ of randomly walking off, probability σ of declaring success • If walk to a dead page, declare failure • Overall decay score of the page is chance of failing for that page
Differences from PageRank • Decay of a page can be processed in isolation • Very easy to reduce a page's decay score
Three types of dead pages • Malformed URL • Host does not exist • Page does not exist on host
Detecting 'soft-404s' • Query a given server for a page that does not likely exist • Record the server's response to the dummy request • Observe the behaviour when the legitimate request is sent
Measure of Decay • If D is the set of all dead pages
First Round of Experiments • 1000 pages were randomly chosen from a web crawl of two billion • 475 were already dead • Of 710 dead links, 207 pointed to soft-404s
Conclusions and Remarks • A number of tools exist for identifying dead links, but few exist for identifying decay • Incorporating decay calculations into search results could be used to improve rankings • Decay computations could also be used to improve crawling