Monitoring the dynamic web to respond to continuous queries
Download
1 / 21

Monitoring the dynamic Web to respond to Continuous Queries - PowerPoint PPT Presentation


  • 376 Views
  • Uploaded on
  • Presentation posted in: Sports / Games

Monitoring the dynamic Web to respond to Continuous Queries Sandeep Pandey Krithi Ramamritham Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/laiir/ Motivation Web pages change rapidly: 40% commercial pages 23% of all pages change per day (Sethuraman et al.)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Monitoring the dynamic Web to respond to Continuous Queries

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Monitoring the dynamic Web to respond to Continuous Queries

Sandeep PandeyKrithi RamamrithamSoumen Chakrabarti

IIT Bombay

www.cse.iitb.ac.in/laiir/


Motivation

  • Web pages change rapidly:

    • 40% commercial pages

    • 23% of all pages

      change per day (Sethuraman et al.)

  • Current search engine users

    • Need to repeat queries (how often?) and

    • Diff results with recent versions

    • Or poll frequently updated collections(e.g., Google news)


Continuous Queries (CQ)

  • Users register long-lived queries of interest

  • Pages of interest may be added, modified, and deleted

  • System continually updates responses

  • Example applications

    • Commuter updates: traffic and weather conditions

    • Alerts on cricket scores, stock portfolios


Query lives for an “instant”, one-shot anwer

Optimize corpus freshness at all times

Objective penalizes delay from update to refresh

Usually handled by bulk crawls with diverse periods

Queries have positive lifetime, many updates over time

Updates must track changes closely

Objective penalizes number or importance of missed updates

Dynamic monitoring with more restrictive network resources

Discrete vs. continuous queries


Talk outline

  • Introduction and motivation

  • Previous approaches

  • Our contributions

    • Continuous Adaptive Monitoring (CAM)

    • How to allocate limited polling resources among pages

    • How to schedule poll instants

  • Experiments

  • Conclusion


Related work

  • CONQUER and WebCQ (Liu, Pu and Tang)

    • Query language and architecture for CQ

    • Do not address monitoring for freshness optimization

  • NIAGARA (DeWitt and Naughton)

    • Query evaluation and optimization techniques

    • Database query optimization setting

  • ChangeDetector (Boyapati et al.)

    • Fixed-priority polling for given set of pages

  • Freshness for discrete queries

    • Poisson updates (Cho and Garcia-Molina)

    • Quasi-deterministic and other distributions (Sethuraman, Wolf, Squillante, Yu)


Our contributions

  • New statistical recency objective for CQs

  • New monitoring framework to fit statistical models of page change behavior

  • Recency optimization problem constrained by network resources

  • Two-phase solution to optimization tailored to CQ search systems

    • Resource allocation (knapsack)

    • Poll scheduling (flow-shop)


Continuous Adaptive Monitoring

  • Planning horizon or “epoch”

  • Time proceeds in discrete steps {j } over epoch

  • Each time step j, each page i has probability ρi,jof an update

    • Can capture predictable bursts, periodicity

    • jρi,j= i, the expected #updates to page i(“change rate”)

  • Decision variables yij

    • Is page i polled at time step j?


Profit, relevance and importance

  • Each registered query q has a profit q

  • Relevance riq of page i w.r.t. query q

    • We use cosine in TFIDF space as in IR

    • Other measures (e.g. PageRank) may be integrated

  • Page i has “importance” Wi—function of

    • Currently resident queries and their “profits”

    • Relevance of page i to each resident query

  • Importance


Returned Information Ratio

  • Update information reported for page i is

  • Goal is to maximize importance-weighted updates reported, iWiRi subject to polling resource constraint

  • Returned info ratio (RIR) is

Importance-weighted updatescaptured by system

Total importance-weightedexpected updates


CAM system overview

  • Time proceeds in epochs

  • At the end of every epoch we re-evaluate

    • Relevance

    • Update probabilities

  • For the next epoch

    • We select instants at which to poll each page (resource allocation)

    • Schedule these instants subject to resource constraint

Determiningrelevant pages

Parametertracking

Monitoring

Resourceallocation

Scheduling


Resource allocation

  • Existing policies

    • Uniform: Resources (#polls) distributed uniformly among all pages irrespective of their change frequency

    • Proportional: #polls allocated to a page is proportional to the frequency with which it changes

  • For discrete queries, uniform better than proportional for any inter-update distribution

  • CAM: solve a knapsack problem

    • Better than uniform and proportional

    • Proportional better than uniform

    • Evidence that CQ objective  discrete objective


Scheduling

Determiningrelevant pages

  • Suppose our crawler can fetch M pages concurrently, and

  • An epoch is T time steps long

  • Then we can fetch a total of C=MT pages during an epoch

    • Ensured by resource allocation phase

  • But at each instant we cannot schedule more than M fetches

    • Want small planned-to-actual poll delays

    • May fail to schedule all poll jobs in an epoch

Parametertracking

Monitoring

Resourceallocation

Tentative yijs

Scheduling


A flow-shop problem

  • M “machines” available at any time

  • Each yij which is equal to 1 is a “job”

  • Job “k” is “released” at time step rk (= j )

  • “Processing time” = crawl time = tj

  • “Completion time” of job j is Cj

  • Want to minimize “total flow”

  • NP-hard problem

    • We use earliest deadline heuristic

Time

Job


Experiments

  • Synthetic data

    • Change frequency distribution: a few pages change very often (Zipfian)

    • Update probability distribution: a few ρi,j ’s are large, most are small (Zipfian again)

    • Page importance distribution: also Zipfian (Wolman, 1999)

  • Real data

    • Eight cricket score sites

    • High update rate

FIXME


CAM > Proportional > Uniform

  • Uniform update andimportance distrib.

  • Plot RIR against ratioof resources toexpected changes

  • RIR for CAM is >3times better

  • Proportional is betterthan uniform in theCQ setting

    • Intuition from “minimum total stale duration” does not apply to CQ


Resource allocation

  • Sort pages by increasing change rate

  • Place in ten equally populated bins (10=fastest)

  • Uniform spends same resource for each bin

  • Proportional wastes fewer resources on slow-changing bins, but is not aggressive enough

  • CAM invests more aggressively in fast-changing bins, achieving the greatest RIR


Skew-handling and adaptation

  • Fixed monitoring/ change ratio

  • Vary skew in update probability distribution

  • CAM’s gains increase with skew

  • CAM improves over initial epochs

  • Change distribution estimates stabilize within a few epochs

RIR


Experiments on real pages

  • Eight sites with dynamic cricket match information

    • In fact, Zipfian updates

  • Adversarial setup: monitor/change < 1

    • CAM close to best possible

  • For M/C=2, CAM updates on 80% of the information changed


Conclusion

  • Continual queries are inherently different from discrete queries

  • Approach used in CAM

    • Identify relevant pages

    • Track the pages as they change

    • Characterize page change behavior

    • Decide when to monitor the pages in future

  • CAM approach performs better than other naïve approaches


References

  • J. Cho, H. Gracia-Molina. Synchronizing the database to improve freshness. ACM-SIGMOD, 2000.

  • J. Cho, H. Gracia-Molina. Estimating frequency of change. Technical Report, 2000.

  • J. Sethuram, J. L. Wolf, M. S. Squillante, P. S. Yu. Optimal Crawling strategies for Web search-engines. World Wide Web, 2002.


ad
  • Login