1 / 18

Reviewed By: Johnny Sia , csia005 Allen Wang , awan015 Li Li , lli057 Hui Zhang , hzha113

Project Topic: Performance and Cost Tradeoffs in Web Search Nick Craswell, Francis Crimmins, David Hawking, Alistair Moffat. Reviewed By: Johnny Sia , csia005 Allen Wang , awan015 Li Li , lli057 Hui Zhang , hzha113. Outline. Motivation and Introduction by Johnny

Download Presentation

Reviewed By: Johnny Sia , csia005 Allen Wang , awan015 Li Li , lli057 Hui Zhang , hzha113

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Project Topic:Performance and Cost Tradeoffs in Web SearchNick Craswell, Francis Crimmins, David Hawking, Alistair Moffat Reviewed By: Johnny Sia, csia005 Allen Wang, awan015 Li Li, lli057 Hui Zhang, hzha113

  2. Outline • Motivation and Introduction by Johnny • Background information by Allen • Case Study and cost model analysis (Research Finder) by Li • New hybrid approach and Conclusion by Henry

  3. Motivation • Web search engines crawl the web to gather the data that they index • Slowly crawling the web to download pages from many websites result in a large amount of data being transferred across networks • These network costs must be paid for! • In the case of Google, ONE crawl of the 3 billion websites it indexes would have a network cost of over $1.5 million

  4. 8 Billion web pages!!!

  5. Introduction • Two standard approaches of providing a search service • Periodic Crawling (eg. Google) • Metasearch (eg. MetaCrawler) • A new alternative • A crawl-metasearch hybrid model

  6. Aim • Aim: To find the most cost-effective way to support web search services. • Where does cost come from? Nothing is FREE!! Data traffic cost a lot!! • Two common approaches: • Web-Crawling • MetaSearch

  7. Web-Crawling • What is a crawler? A: a program that automatically collects Web pages to create a local index. • Pros. • Less query processing time required • Fast response to users • Fixed amount of cost • Cons. • Expensive!!! • Indexed data become stale

  8. Metasearch How does it work? Users Metasearch QUERY Results Merging Query Wrapping Final Results QUERY QUERY QUERY QUERY University of Auckland AUT MIT MSDN (Microsoft) Results Results Results Results Local Search Engines

  9. Metasearch • Pros. • Cheap to maintain (really?) • “Fresh” data • Cons. • Quality of the search depends on local servers. • Need “wrapper” to forward queries • Results from various servers need to be merged

  10. Case Study • Panoptic (Research Finder) • Searchable full-text index-based retrieval system • Based on regular crawl • The newest version also introduced metasearch model • Operated by a range of Australia research institutions • Eight largest Australian Universities contribute more data to the Panoptic crawl

  11. Case Study (cont’d)

  12. Case Study (cont’d) • Rate of change • Pages which have disappeared • Pages which changed so much => bad results • Pages which changed a little => good answers • Changes (c) could be ignored • Changes (a) and (b) are most important in a search system. Why? A crawl becomes stale, the users are more likely to see an embarrassing result.

  13. Case Study (cont’d) • Over a eight days period: • Disappearance: 1.6% • Small changes: 8.2% • Large changes: 6.4% • No changes: 83.8% Normally, pages in .com domain change more frequently than those in the .edu domain (Result from 151 million pages)

  14. Fq Unit: queries/month Value:10,000 Query arrival rate Sq Unit: GB/query Value:2x10^(-5) = 20kB Size of query resp pg Nc Value:175 Nbr of servers being federated Cost Model Fc Unit: crawls/month Value:1 Crawl frequency For crawling: Fc x Sd x So x Ct Sd Unit: GB Value:33.3 Combined data size For answering queries: Fq x Sq x (Nc + 1) x Ct So B_fetched/B_indexed Ct Value:1.7 Crawling overhead Unit:$/GB Trans cost 1NZD = 0.9368AUD 0.07NZD/MB =0.066AUD/MB  Value:22.5 =0.0225$/MB

  15. Cost Model (cont’d) Fc x Sd x So x Ct Fq x Sq x (Nc + 1) x Ct The number of query per month is low: Metasearch is cost effective The number of query per month is high: Crawl is cost effective

  16. Performance and Cost Tradeoffs in Web Search • New hybrid approach • A full index is suitable for large query load, however, metasearch would be better if query arrival rate is lower • Metasearches the largest organizations and crawls the others • Can reduces the crawl cost by approximately half • e.g. proof of concept demonstration at: http://thylacine.panopticsearch.com/hybriddemo/index.cgi • Still face the disadvantages of metasearch • e.g. The need to write wrappers, response time issues, the rely on quality local search services

  17. Performance and Cost Tradeoffs in Web Search

  18. Performance and Cost Tradeoffs in Web Search • Conclusion • The group presented useful cost models and discussed several alternative approaches • Regrettably, many of the discussed options are not currently feasible • The most promising cost-reduction alternative in the current situation seems to be an incremental, variable frequency crawling • This model could be incorporated into a hybrid metasearch model with further savings, provided result merging can be performed sufficiently in the future • No single reasonable solution for all the operational search systems • The state-of-the-art in this research area remains a challenging and attractive subject

More Related