Clustering web content for efficient replication
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

Clustering Web Content for Efficient Replication PowerPoint PPT Presentation


  • 48 Views
  • Uploaded on
  • Presentation posted in: General

Clustering Web Content for Efficient Replication. Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz EECS Department UC Berkeley *Microsoft Research. Motivation. Amazing growth in WWW traffic Daily growth of roughly 7M Web pages Annual growth of 200% predicted for next 4 years

Download Presentation

Clustering Web Content for Efficient Replication

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Clustering web content for efficient replication

Clustering Web Content for Efficient Replication

Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz

EECS Department

UC Berkeley

*Microsoft Research


Motivation

Motivation

  • Amazing growth in WWW traffic

    • Daily growth of roughly 7M Web pages

    • Annual growth of 200% predicted for next 4 years

  • Content Distribution Network (CDN) commercialized to improve Web performance

    • Un-cooperative pull-based replication

  • Paradigm shift: cooperative push more cost-effective

    • Strategically push replicas can achieve close to optimal performance [JJKRS01, QPV01]

    • Improving availability during flash crowds and disasters

  • Orthogonal issue: granularity of replication

    • Per Website? Per URL? -> Clustering!

    • Clustering based on aggregated clients’ access patterns

  • Adapt tousers’ dynamic access patterns

    • Incremental clustering (online and offline)


Outlines

Outlines

  • Motivation

  • Simulation methodology

  • Architecture

  • Problem Formulation

  • Granularity of replication

  • Dynamic replication

    • Static clustering

    • Incremental clustering

  • Conclusions


Simulation methodology

Simulation Methodology

  • Network Topology

    • Pure-random, Waxman & transit-stub models from GT-ITM

    • A real AS-level topology from 7 widely-dispersed BGP peers

  • Web Workload

  • Aggregate MSNBC Web clients with BGP prefix

    • BGP tables from a BBNPlanet router

    • 10K groups left, chooses top 10% covering >70% of requests

  • Aggregate NASA Web clients with domain names

  • Map the client groups onto the topology

  • Performance Metric: average retrieval cost

    • Sum of edge costs from client to its closest replica


  • Outlines1

    Outlines

    • Motivation

    • Simulation methodology

    • Architecture

    • Problem Formulation

    • Granularity of replication

    • Dynamic replication

      • Static clustering

      • Incremental clustering

    • Conclusions


    Conventional cdn un cooperative pull

    5.GET request

    6.GET request if cache miss

    7. Response

    Local CDN server

    Local CDN server

    4. local CDN server IP address

    8. Response

    1. GET request

    2. Request for hostname resolution

    Web content server

    3. Reply: local CDN server IP address

    Local DNS server

    Client 2

    CDN name server

    Local DNS server

    Conventional CDN: Un-cooperative Pull

    Client 1

    ISP 1

    Big waste of replication!

    ISP 2


    Cooperative push based cdn

    5.GET request if no replica yet

    6. Response

    Local CDN server

    Local CDN server

    4. Redirected server IP address

    1. GET request

    2. Request for hostname resolution

    Web content server

    3. Reply: nearby replica server or Web server IP address

    0. Push replicas

    6. Response

    Local DNS server

    Client 2

    5.GET request

    Local DNS server

    Cooperative Push-based CDN

    Client 1

    CDN name server

    ISP 1

    Significantly reduce # of replicas and consequently,

    the update cost (only 4% of un-coop pull)

    ISP 2


    Problem formulation

    Problem Formulation

    • Subject to the total replication cost

    • Find a replication strategy that minimize the total access cost


    Outlines2

    Outlines

    • Motivation

    • Simulation methodology

    • Architecture

    • Problem Formulation

    • Granularity of replication

    • Dynamic replication

      • Static clustering

      • Incremental clustering

    • Conclusions


    Clustering web content for efficient replication

    Replica Placement: Per Website vs. Per URL

    • Use greedy placement

    • 30 – 70% average retrieval cost reduction for Per URL

    • Per URL is too expensive for management!

    Where R: # of replicas/URLK: # of clusters M: # of URLs (M >> K)

    C: # of clients S: # of CDN servers

    f: placement adaptation frequency


    Clustering web content

    Clustering Web Content

    • General clustering framework

      • Define the correlation distance between URLs

      • Cluster diameter: the max distance b/t any two members

        • Worst correlation in a cluster

      • Generic clustering: minimize the max diameter of all clusters

    • Correlation distance definition based on

      • Spatial locality

      • Temporal locality

      • Popularity

      • Semantics (e.g., directory)


    Spatial clustering

    Spatial Clustering

    • URL spatial access vector

      • Blue URL

    1

    2

    4

    3

    • Correlation distance between two URLs defined as

      • Euclidean distance

      • Vector similarity


    Clustering web content cont d

    Clustering Web Content (cont’d)

    • Temporal clustering

      • Divide traces into multiple individuals’ access sessions [ABQ01]

      • In each session,

      • Average over multiple sessions in one day

    • Popularity-based clustering

      • OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation


    Performance of cluster based replication

    Performance of Cluster-based Replication

    • Tested over various topologies and traces

    • Spatial clustering with Euclidean distance and popularity-based clustering perform the best

      • Even small # of clusters (with only 1-2% of # of URLs) can achieve close to per-URL performance

    MSNBC, 8/2/1999, 5 replicas/URL

    NASA, 7/1/1995, 3 replicas/URL


    Outlines3

    Outlines

    • Motivation

    • Simulation methodology

    • Architecture

    • Problem Formulation

    • Granularity of replication

    • Dynamic replication

      • Static clustering

      • Incremental clustering

    • Conclusions


    Static clustering and replication

    Static clustering and replication

    • Two daily traces: old traceand new trace

    • Static clustering performs poorly beyond a week

      • Average retrieval cost almost doubles


    Incremental clustering

    Incremental Clustering

    • Generic framework

      • If new URL u match with existing clusters c, add u to c and replicate u to existing replicas of c

      • Else create new clusters and replicate them

    • Online incremental clustering

      • Push before accessed -> high availability

      • Predict access patterns based on semantics

      • Simplify to popularity prediction

      • Groups of URLs with similar popularity? Use hyperlink structures!

        • Groups of siblings

        • Groups of the same hyperlink depth: smallest # of links from root


    Online popularity prediction

    access freq span=

    Online Popularity Prediction

    • Measure the divergence of URL popularity within a group:

    • Experiments

      • Use WebReaper to crawl http://www.msnbc.com on 5/3/2002 with hyperlink depth 4, then group the URLs

      • Use corresponding access logs to analyze the correlation

      • Groups of siblings has the best correlation


    Online incremental clustering

    1

    1

    2

    3

    4

    5

    6

    6

    5

    1

    3

    4

    2

    4

    3

    +

    ?

    2

    6

    5

    Online Incremental Clustering

    • Semantics-based incremental clustering

      • Put new URL into existing clusters with largest # of siblings

      • When there is a tie, choose the cluster with more replicas

    • Simulation on 5/3/2002 MSNBC

      • 8-10am trace: static popularity clustering + replication

      • At 10am: 16 new URLs emerged - online incremental clustering + replication

      • Evaluation with 10-12am trace: 16 URLs has 33,262 requests


    Online incremental clustering replication results

    Online Incremental Clustering & Replication Results

    Average retrieval cost reduction (16 URLs)

    • Compared with no replication of new URLs: - 12.5%

    • Compared with random replication of new URLs: - 21.7%

    • Compared with static clustering + replication (oracle): - 200%


    Conclusions

    Conclusions

    • CDN operators:cooperative, clustering-based replication

      • Cooperative: big savings on replica management and update cost

      • Per URL replication outperforms per Website scheme by 60-70%

      • Clustering solves the scalability issues, and gives the full spectrum of flexibility

        • Spatial clustering and popularity-based clustering recommended

    • To adapt to users’ access patterns: incremental clustering

      • Hyperlink-based online incremental clustering for

        • High availability

        • Performance improvement

      • Offline incremental clustering performs close to optimal


    Offline incremental clustering

    Offline Incremental Clustering

    • Study spatial clustering and popularity-based clustering

    • Step 1: assign new URLs into existing clusters

      • When the correlation within that cluster (diameter) is unchanged

      • Add it to existing replicas

    • Step 2: Un-matched URLs - static clustering and replication

    • Performance close to complete re-clustering + re-replication, with only 30-40% replication cost


  • Login