1 / 26

Internet Search Engine freshness by Web Server help

Internet Search Engine freshness by Web Server help. Presented by: Barilari Alessandro. Introduction. Search engines are an important source of information and keeping them up-to-date will result in more accurate answers to search queries.

varden
Download Presentation

Internet Search Engine freshness by Web Server help

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

  2. Introduction • Search engines are an important source of information and keeping them up-to-date will result in more accurate answers to search queries. • Search engines create their databases by probing web servers on a per-URL basis with a little help from the web servers. Alessandro Barilari

  3. Main Problem • There are no standard for facilitating the push of updates from servers to search engines: • It takes up to six months for a few page to be indexed by popular web search engines; • The data which is indexed by the search engines is often stale. Alessandro Barilari

  4. Solution… • Web server help to facilitate search engine freshness results in a favorable situation for web sites, search engines and users. Alessandro Barilari

  5. …and its problems • The number of updates per second is very large. • Must balance between: • The number of interactions between web sites and search engines, and • The freshness of the search engines. Alessandro Barilari

  6. Page rank impact • Pages which are popular will have higher page ranks: • Use popularity in addition to age and freshness to compute the mismatch between a web site and a search engine Alessandro Barilari

  7. Summary • Definitions and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari

  8. Some definitions • Update: an update u to a file f is a modification to f that has been flushed to the disk; • Propagation of an update: an update is said to be propagate when the web site has informed the search engine about the update. A SE may or may not retrieve that update; • Meta-update propagation: At any time t, let U(t) be the set of unpropagated updates. The web site informs the search engine about all the updates U(t); Alessandro Barilari

  9. Some definitions (2) • Weight of a file: given a content file, its weight f (non-negative) denotes the importance of the file; the weights are chosen such that: • Last_modification_time(u,t): the last time before t when the file f(u) was updated. Alessandro Barilari

  10. The Cost Model • Components: • Communication cost; • Opportunity cost: represents the stalenes of the search engine data as compared to the data on the web server. • CPU cost is ignored Alessandro Barilari

  11. Opportunity cost (OC) • Given an unpropagated update u to a content file f; the opportunity cost for update u at time t is: OC(u,t)=f(u)x(t - last_modification_time(u,t)) • Definition for meta-update propagation: Alessandro Barilari

  12. Communication cost (CC) • sizef(u)(t): the size of file f(u) at time t; Alessandro Barilari

  13. Potential Communication cost (PCC) • Represents the communication cost which would need to be incurred in case update u were to be propagated after time t: Alessandro Barilari

  14. The Cost Function • Given that an update u is unpropagated at time t, the cost function for that update at time t is given by: Alessandro Barilari

  15. Summary • Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari

  16. FreshFlow Algorithm When OC_tot equals PCC_tot at any time t, the web server can inform the search engine about all the unpropagated updates. Alessandro Barilari

  17. Summary • Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari

  18. Analysis • The cost of the FreshFlow algorithm (called FF) is compared with the cost of an optimal off-line algorithm (called ADV) Alessandro Barilari

  19. Analysis (2) • Lemma (1): OC(u,t) is monotonically non-decreasing; • Lemma (2): suppose an update u to a file f, and suppose FF transmits but ADV does not. Then OCADV(u,t)≥OCFF(u,t). • Lemma (3): if the update is transmitted by the adversary (ADV), then CCADV(u,t) ≥CCFF(u,t). Alessandro Barilari

  20. Theorem • FF is 2-competitive: CostFF(u,t) ≤ 2 x CostADV(u,t) Alessandro Barilari

  21. Summary • Definition and Cost Model • Algorithm • Analysis • Pratical issues Alessandro Barilari

  22. Pratical issues • There are multiple search engines: • Synchronization effect: pushing the updates would put pressure on the last-hop link to the web server; • Search engine load: some search engines might deny the receipt of updates. Alessandro Barilari

  23. The middleman approach • Each web server contacts only one middleman for sending its updates; • Could be a group of middlemen. Alessandro Barilari

  24. Benefits • The middleman can solve some additional issues: • Verifying trustworthiness of web servers; • Restricting the rate at which updates get transmitted to search engines; Alessandro Barilari

  25. Limitations • The algorithm has not been used in practice; • The search engines need the cooperation of the web servers to keep track of updates to their URLs. Whether web servers will incorporate such a service remains to be seen. Alessandro Barilari

  26. Conclusions • The FreshFlow algorithm is a solution that improve the data updates of the search engines, mantaining high level efficiency and performance; • The authors are planning to implement the algorithm in a real system (and have a future pubblication!) Alessandro Barilari

More Related