1 / 33

WIC : A General-Purpose Algorithm for Monitoring Web Information Sources

WIC : A General-Purpose Algorithm for Monitoring Web Information Sources. Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University. Dynamic Information on the Web. Bulletin boards Online auctions News Weather Roadway conditions, Sports scores, etc….

baris
Download Presentation

WIC : A General-Purpose Algorithm for Monitoring Web Information Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University

  2. Dynamic Information on the Web • Bulletin boards • Online auctions • News • Weather • Roadway conditions, Sports scores, etc…

  3. Online Shopping, Auctions

  4. Stock Market

  5. Continuous Query Systems • Process information from dynamic Web sources automatically • e.g., CONQUER [Liu et al. WWW 1999] Niagara [Naughton et al. SIGMOD 2000] WebCQ [Liu et al. CIKM 2000]

  6. Past Research on CQ Systems • Focus on language design, query processing • Assume “push” model of information access • Information shows up at doorstep • Web sources are “pull” oriented • Must explicitly download Web pages, check for changes, submit changes to CQ engine

  7. Converting Pull  Push Auction sites pull ? push WIC CQ engine Sports sites pull

  8. Converting Pull  Push • Topic has received little attention • So far only heuristics with no formal guarantees • Periodical polling of sources • Not scalable • CAM [Pandey et al. WWW’03] Gal et al. [JACM 2001]: • Take into account predicted change behavior • Create monitoring schedule in advance

  9. A good first step, but … • No formal guarantees • Suits narrow range of applications

  10. Example Application Scenarios Append-only Complete overwrite Timeliness not critical Timeliness is critical

  11. Outline • Introduction • Problem statement • WIC: Web Information Collector • Formal results: • WIC is a 2-approximation • Experimental results: • Timeliness-completeness tradeoff

  12. Databases @Carnegie Mellon Model of Pull-Oriented Sources • Proposed by Wolf et al. [WWW 2002] • Set of Web pages of interest P1 … Pn • Importance weight associated with each page • Time is divided into discrete time instants • Change: An update posted on a Web page • Known probability πij that page Pi will change at time Tj • We do not address the problem of estimating change probabilities

  13. Databases @Carnegie Mellon 1.0 1.0 0.9 0.6 0.6 0.4 0.4 0.4 0.3 0.3 0.2 0.1 1.0 0.8 0.6 0.6 0.4 0.3 0.3 0.2 0.2 0.2 0.1 0.1 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.6 0.7 0.6 0.4 0.1 Our Model P1 P2 P3 Time

  14. Modeling the Change Characteristics Append-only Complete overwrite Timeliness not critical Timeliness is critical

  15. Databases @Carnegie Mellon Modeling the Change Characteristics the probability of a change to page Piat time Tj to remain available at time Tk Tj Case 1: changes overwrite old info. Case 2: append-only Also: sliding window, others …

  16. Web Monitoring Requirements Append-only Complete overwrite Timeliness not critical Timeliness is critical

  17. Databases @Carnegie Mellon Conflicting Requirements • Completeness: maximize number of changes captured • Timeliness: minimize delay in capturing changes • Limited resources • Up to C pages can be monitored per time instant • When resources are not plentiful, the twoobjectives can be at odds with each other

  18. Databases @Carnegie Mellon 1.0 1.0 0.9 0.6 0.6 0.4 0.4 0.4 0.3 0.3 0.2 0.1 0.9 0.8 0.9 0.5 0.5 0.3 0.3 0.3 0.2 0.2 0.1 0.0 Timeliness-Completeness tradeoff Resource constraint: C=1 P1 (append-only) P2 (overwrite)

  19. Databases @Carnegie Mellon 0.9 0.8 0.9 0.5 0.5 0.3 0.3 0.3 0.2 0.2 0.1 0.0 Only Timeliness Objective: Changes must be captured with zero delay 1.0 1.0 0.9 P1 (append-only) 0.6 0.6 0.4 0.4 0.4 0.3 0.3 0.2 0.1 P2 (overwrite)

  20. Databases @Carnegie Mellon Only Completeness Objective: Maximize the number of changes captured 1.0 1.0 0.9 P1 (append-only) 0.6 0.6 0.4 0.4 0.4 0.3 0.3 0.2 0.1 0.9 0.8 0.9 P2 (overwrite) 0.5 0.5 0.3 0.3 0.3 0.2 0.2 0.1 0.0

  21. Databases @Carnegie Mellon Controlling the Tradeoff Urgency : Importance of information captured as a function of delay in capturing Example urgency functions

  22. Web Monitoring Requirements Append-only Complete overwrite Timeliness not critical Timeliness is critical gradual urgency curve steep urgency curve

  23. Web Monitoring Objective • Maximize Utility • Utility = Expected number of changes captured, weighted by delay according to urgency function • Each monitoring action takes unit amount of resource • Resource constraint: amount of resource per time unit constrained

  24. Our Solution • Web Information Collector (WIC) • 2-approximation for all scenarios • Total utility accrued at least half that accrued by optimal monitoring schedule • Finds optimal solution in the following special case: • Timeliness is critical, changes overwrite

  25. Web Information Collector (WIC) • Online, greedy strategy • At each time instant, download page(s) with highest utility • Utility combines: • Probability that a change has occurred • Probability that change has not been erased • Delay in capturing change (weighted according to urgency function)

  26. Databases @Carnegie Mellon WIC continued • Running time: • O(# pages) per time instant under most settings of life and urgency • WIC is an online algorithm • Forecasting can be done at last minute

  27. Proof of 2-Approximation • See our paper

  28. Experiments Append-only Complete overwrite Timeliness not critical Timeliness is critical • Data: 7550 auction pages • Exponential decaying urgency function parameterized by r

  29. Experimental Results in Paper • Sensitivity to error in prediction • Not unduly sensitive • Comparison against prior approach (CAM) • Up to 80% improvement • Handles more applications • Timeliness-Completeness tradeoff

  30. Timeliness-Completeness tradeoff favor timeliness favor completeness

  31. Databases @Carnegie Mellon Summary • Pull->push • Can’t have it all - Choose a combination of timeliness and completeness • Our solution: WIC - Handles many applications - Formal guarantee: 2-approximation - Online algorithm

  32. Urgency Parameter Controls Timeliness-Completeness Tradeoff • Best curve to use depends on application • Ap 1: Agent to monitor and bid in online auctions on behalf of many customers • Use steep curve (timeliness is critical) • Ap 2: Program to maintain database of large number of online resumes • Use gradual curve (timeliness less critical)

  33. Experiments • Determine exact change occurrence times • Add noise to simulate prediction inaccuracy: - False positives - False negatives - Gaussian spreading

More Related