monitoring the dynamic web to respond to continuous queries l.
Skip this Video
Loading SlideShow in 5 Seconds..
Monitoring the dynamic Web to respond to Continuous Queries PowerPoint Presentation
Download Presentation
Monitoring the dynamic Web to respond to Continuous Queries

Loading in 2 Seconds...

play fullscreen
1 / 21

Monitoring the dynamic Web to respond to Continuous Queries - PowerPoint PPT Presentation

  • Uploaded on

Monitoring the dynamic Web to respond to Continuous Queries Sandeep Pandey Krithi Ramamritham Soumen Chakrabarti IIT Bombay Motivation Web pages change rapidly: 40% commercial pages 23% of all pages change per day (Sethuraman et al.)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Monitoring the dynamic Web to respond to Continuous Queries' - Jeffrey

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
monitoring the dynamic web to respond to continuous queries

Monitoring the dynamic Web to respond to Continuous Queries

Sandeep PandeyKrithi RamamrithamSoumen Chakrabarti

IIT Bombay

  • Web pages change rapidly:
    • 40% commercial pages
    • 23% of all pages

change per day (Sethuraman et al.)

  • Current search engine users
    • Need to repeat queries (how often?) and
    • Diff results with recent versions
    • Or poll frequently updated collections(e.g., Google news)
continuous queries cq
Continuous Queries (CQ)
  • Users register long-lived queries of interest
  • Pages of interest may be added, modified, and deleted
  • System continually updates responses
  • Example applications
    • Commuter updates: traffic and weather conditions
    • Alerts on cricket scores, stock portfolios
discrete vs continuous queries
Query lives for an “instant”, one-shot anwer

Optimize corpus freshness at all times

Objective penalizes delay from update to refresh

Usually handled by bulk crawls with diverse periods

Queries have positive lifetime, many updates over time

Updates must track changes closely

Objective penalizes number or importance of missed updates

Dynamic monitoring with more restrictive network resources

Discrete vs. continuous queries
talk outline
Talk outline
  • Introduction and motivation
  • Previous approaches
  • Our contributions
    • Continuous Adaptive Monitoring (CAM)
    • How to allocate limited polling resources among pages
    • How to schedule poll instants
  • Experiments
  • Conclusion
related work
Related work
  • CONQUER and WebCQ (Liu, Pu and Tang)
    • Query language and architecture for CQ
    • Do not address monitoring for freshness optimization
  • NIAGARA (DeWitt and Naughton)
    • Query evaluation and optimization techniques
    • Database query optimization setting
  • ChangeDetector (Boyapati et al.)
    • Fixed-priority polling for given set of pages
  • Freshness for discrete queries
    • Poisson updates (Cho and Garcia-Molina)
    • Quasi-deterministic and other distributions (Sethuraman, Wolf, Squillante, Yu)
our contributions
Our contributions
  • New statistical recency objective for CQs
  • New monitoring framework to fit statistical models of page change behavior
  • Recency optimization problem constrained by network resources
  • Two-phase solution to optimization tailored to CQ search systems
    • Resource allocation (knapsack)
    • Poll scheduling (flow-shop)
continuous adaptive monitoring
Continuous Adaptive Monitoring
  • Planning horizon or “epoch”
  • Time proceeds in discrete steps {j } over epoch
  • Each time step j, each page i has probability ρi,jof an update
    • Can capture predictable bursts, periodicity
    • jρi,j= i, the expected #updates to page i(“change rate”)
  • Decision variables yij
    • Is page i polled at time step j?
profit relevance and importance
Profit, relevance and importance
  • Each registered query q has a profit q
  • Relevance riq of page i w.r.t. query q
    • We use cosine in TFIDF space as in IR
    • Other measures (e.g. PageRank) may be integrated
  • Page i has “importance” Wi—function of
    • Currently resident queries and their “profits”
    • Relevance of page i to each resident query
  • Importance
returned information ratio
Returned Information Ratio
  • Update information reported for page i is
  • Goal is to maximize importance-weighted updates reported, iWiRi subject to polling resource constraint
  • Returned info ratio (RIR) is

Importance-weighted updatescaptured by system

Total importance-weightedexpected updates

cam system overview
CAM system overview
  • Time proceeds in epochs
  • At the end of every epoch we re-evaluate
    • Relevance
    • Update probabilities
  • For the next epoch
    • We select instants at which to poll each page (resource allocation)
    • Schedule these instants subject to resource constraint

Determiningrelevant pages





resource allocation
Resource allocation
  • Existing policies
    • Uniform: Resources (#polls) distributed uniformly among all pages irrespective of their change frequency
    • Proportional: #polls allocated to a page is proportional to the frequency with which it changes
  • For discrete queries, uniform better than proportional for any inter-update distribution
  • CAM: solve a knapsack problem
    • Better than uniform and proportional
    • Proportional better than uniform
    • Evidence that CQ objective  discrete objective

Determiningrelevant pages

  • Suppose our crawler can fetch M pages concurrently, and
  • An epoch is T time steps long
  • Then we can fetch a total of C=MT pages during an epoch
    • Ensured by resource allocation phase
  • But at each instant we cannot schedule more than M fetches
    • Want small planned-to-actual poll delays
    • May fail to schedule all poll jobs in an epoch




Tentative yijs


a flow shop problem
A flow-shop problem
  • M “machines” available at any time
  • Each yij which is equal to 1 is a “job”
  • Job “k” is “released” at time step rk (= j )
  • “Processing time” = crawl time = tj
  • “Completion time” of job j is Cj
  • Want to minimize “total flow”
  • NP-hard problem
    • We use earliest deadline heuristic



  • Synthetic data
    • Change frequency distribution: a few pages change very often (Zipfian)
    • Update probability distribution: a few ρi,j ’s are large, most are small (Zipfian again)
    • Page importance distribution: also Zipfian (Wolman, 1999)
  • Real data
    • Eight cricket score sites
    • High update rate


cam proportional uniform
CAM > Proportional > Uniform
  • Uniform update andimportance distrib.
  • Plot RIR against ratioof resources toexpected changes
  • RIR for CAM is >3times better
  • Proportional is betterthan uniform in theCQ setting
    • Intuition from “minimum total stale duration” does not apply to CQ
resource allocation17
Resource allocation
  • Sort pages by increasing change rate
  • Place in ten equally populated bins (10=fastest)
  • Uniform spends same resource for each bin
  • Proportional wastes fewer resources on slow-changing bins, but is not aggressive enough
  • CAM invests more aggressively in fast-changing bins, achieving the greatest RIR
skew handling and adaptation
Skew-handling and adaptation
  • Fixed monitoring/ change ratio
  • Vary skew in update probability distribution
  • CAM’s gains increase with skew
  • CAM improves over initial epochs
  • Change distribution estimates stabilize within a few epochs


experiments on real pages
Experiments on real pages
  • Eight sites with dynamic cricket match information
    • In fact, Zipfian updates
  • Adversarial setup: monitor/change < 1
    • CAM close to best possible
  • For M/C=2, CAM updates on 80% of the information changed
  • Continual queries are inherently different from discrete queries
  • Approach used in CAM
    • Identify relevant pages
    • Track the pages as they change
    • Characterize page change behavior
    • Decide when to monitor the pages in future
  • CAM approach performs better than other naïve approaches
  • J. Cho, H. Gracia-Molina. Synchronizing the database to improve freshness. ACM-SIGMOD, 2000.
  • J. Cho, H. Gracia-Molina. Estimating frequency of change. Technical Report, 2000.
  • J. Sethuram, J. L. Wolf, M. S. Squillante, P. S. Yu. Optimal Crawling strategies for Web search-engines. World Wide Web, 2002.