1 / 59

Efficient and Adaptive Replication using Content Clustering

Efficient and Adaptive Replication using Content Clustering. Yan Chen EECS Department UC Berkeley. Motivation. Internet has evolved to become a commercial infrastructure for service delivery Web delivery, VoIP, streaming media … Challenges for Internet-scale services

ryder
Download Presentation

Efficient and Adaptive Replication using Content Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient and Adaptive Replication using Content Clustering Yan Chen EECS Department UC Berkeley

  2. Motivation • Internet has evolved to become a commercial infrastructure for service delivery • Web delivery, VoIP, streaming media … • Challenges for Internet-scale services • Scalability: 600M users, 35M Web sites, 28Tb/s • Efficiency: bandwidth, storage, management • Agility: dynamic clients/network/servers • Security, etc. • Focus on content delivery - Content Distribution Network (CDN) • Totally 4 Billion Web pages, daily growth of 7M pages • Annual growth of 200% for next 4 years

  3. CDN and its Challenges

  4. CDN and its Challenges No coherence for dynamic content X Inefficient replication Unscalable network monitoring - O(M*N)

  5. SCAN: Scalable Content Access Network CDN Applications (e.g. streaming media) Provision: Cooperative Clustering-based Replication Coherence: Update Multicast Tree Construction Network Distance/ Congestion/ Failure Estimation User Behavior/ Workload Monitoring Network Performance Monitoring red: my work, black: out of scope

  6. SCAN Coherence for dynamic content s1, s4, s5 Cooperative clustering-based replication

  7. SCAN Coherence for dynamic content X s1, s4, s5 Cooperative clustering-based replication Scalable network monitoring - O(M+N)

  8. Internet-scale Simulation • Network Topology • Pure-random, Waxman & transit-stub synthetic topology • An AS-level topology from 7 widely-dispersed BGP peers • Web Workload • Aggregate MSNBC Web clients with BGP prefix • BGP tables from a BBNPlanet router • Aggregate NASA Web clients with domain names • Map the client groups onto the topology

  9. Internet-scale Simulation – E2E Measurement • NLANR Active Measurement Projectdata set • 111 sites on America, Asia, Australia and Europe • Round-trip time (RTT) between every pair of hosts every minute • 17M daily measurement • Raw data: Jun. – Dec. 2001, Nov. 2002 • Keynote measurement data • Measure TCP performance from about 100 worldwide agents • Heterogeneous core network: various ISPs • Heterogeneous access network: • Dial up 56K, DSL and high-bandwidth business connections • Targets • 40 most popular Web servers + 27 Internet Data Centers • Raw data: Nov. – Dec. 2001, Mar. – May 2002

  10. Clustering Web Content for Efficient Replication

  11. Overview • CDN uses non-cooperative replication - inefficient • Paradigm shift: cooperative push • Where to push – greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01] • But what content to be pushed? • At what granularity? • Clustering of objects for replication • Close-to-optimal performance with small overhead • Incremental clustering • Push before accessed: improve availability during flash crowds

  12. Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research

  13. 5.GET request 6.GET request if cache miss 7. Response Local CDN server Local CDN server 4. local CDN server IP address 8. Response 1. GET request 2. Request for hostname resolution 3. Reply: local CDN server IP address Local DNS server Client 2 CDN name server Local DNS server Conventional CDN: Non-cooperative Pull Client 1 Web content server ISP 1 Inefficient replication ISP 2

  14. 5.GET request if no replica yet 6. Response Local CDN server Local CDN server 4. Redirected server IP address 1. GET request 2. Request for hostname resolution Web content server 3. Reply: nearby replica server or Web server IP address 0. Push replicas 6. Response Local DNS server Client 2 5.GET request Local DNS server SCAN: Cooperative Push Client 1 CDN name server ISP 1 Significantly reduce the # of replicas and update cost ISP 2

  15. Comparison between Conventional CDNs and SCAN

  16. Problem Formulation • Find a scalable, adaptive replication strategy to reduce • Clients’ average retrieval cost • Replica location computation cost • Amount of replica directory state to maintain • Subject to certain total replication cost (e.g., # of URL replicas)

  17. Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research

  18. 1 2 Per URL 4 3 1 Per Web site 2 4 3

  19. Replica Placement: Per Website vs. Per URL • 60 – 70% average retrieval cost reduction for Per URL scheme • Per URL is too expensive for management!

  20. Overhead Comparison Where R: # of replicas per URL M: # of URLs To compute on average 10 replicas/URL for top 1000 URLs takes several days on a normal server!

  21. Overhead Comparison Where R: # of replicas per URL K: # of clusters M: # of URLs (M >> K)

  22. Clustering Web Content • General clustering framework • Define the correlation distance between URLs • Cluster diameter: the max distance between any two members • Worst correlation in a cluster • Generic clustering: minimize the max diameter of all clusters • Correlation distance definition based on • Spatial locality • Temporal locality • Popularity

  23. URL spatial access vector • Blue URL 1 2 4 3 Spatial Clustering • Correlation distance between two URLs defined as • Euclidean distance • Vector similarity

  24. Clustering Web Content (cont’d) • Temporal clustering • Divide traces into multiple individuals’ access sessions [ABQ01] • In each session, • Average over multiple sessions in one day • Popularity-based clustering • OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation

  25. Performance of Cluster-based Replication • Spatial clustering with Euclidean distance and popularity-based clustering perform the best • Small # of clusters (with only 1-2% of # of URLs) can achieve close to per-URL performance, with much less overhead MSNBC, 8/2/1999, 5 replicas/URL

  26. Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research

  27. Performance of static clustering almost doubles the optimal ! Static clustering and replication • Two daily traces: training traceand new trace • Static clustering performs poorly beyond a week

  28. Incremental Clustering • Generic framework • If new URL u match with existing clusters c, add u to c and replicate u to existing replicas of c • Else create new clusters and replicate them • Two types of incremental clustering • Online: without any access logs • High availability • Offline: with access logs • Close-to-optimal performance

  29. URL2 <a href=“URL5”> <a href=“URL6”> URL4 <a href=“URL3”> <a href=“URL7”> URL1 <a href=“URL2”> <a href=“URL3”> <a href=“URL4”> 1 1 2 2 4 4 3 3 7 7 5 5 6 6 Online Incremental Clustering • Predict access patterns based on semantics • Simplify to popularity prediction • Groups of URLs with similar popularity? Use hyperlink structures! Groups of siblings Groups of the same hyperlink depth (smallest # of links from root)

  30. access freq span= Online Popularity Prediction • Measure the divergence of URL popularity within a group: • Experiments • Crawl http://www.msnbc.com on 5/3/2002 with hyperlink depth 4, then group the URLs • Use corresponding access logs to analyze the correlation • Groups of siblings have the best correlation

  31. 1 1 2 3 4 5 6 6 5 1 3 4 2 4 3 + ? 2 6 5 Semantics-based Incremental Clustering • Put new URL into existing cluster with largest # of siblings • In case of a tie, choose the cluster w/ more replicas • Simulation on 5/3/2002 MSNBC • 8-10am trace: static popularity clustering + replication • At 10am: 16 new URLs - online inc. clustering + replication • Evaluation with 10-12am trace: 16 URLs has 33K requests

  32. Online Incremental Clustering and Replication Results 1/8 compared w/ no replication, and 1/5 for random replication

  33. Online Incremental Clustering and Replication Results Double the optimal retrieval cost, but only 4% of its replication cost

  34. Conclusions • Cooperative, clustering-based replication • Cooperative push: only 4 - 5% replication/update cost compared with existing CDNs • URL Clustering reduce the management/computational overhead by two orders of magnitude • Spatial clustering and popularity-based clustering recommended • Incremental clustering to adapt to emerging URLs • Hyperlink-based online incremental clustering for high availability and performance improvement • Self-organize replicas into app-level multicast tree for update dissemination • Scalable overlay network monitoring • O(M+N) instead of O(M*N), given M client groups and N servers

  35. Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research

  36. Future Research (I) • Measurement-based Internet study and protocol/architecture design • Use inference techniques to develop Internet behavior models • Network operators reluctant to reveal internal network configurations • Root cause analysis: large, heterogeneous data mining • Leverage graphics/visualization for interactive mining • Apply deeper understanding of Internet behaviors for reassessment/design of protocol/architecture • E.g., Internet bottleneck – peering links? How and Why? Implications?

  37. Future Research (II) • Network traffic anomaly characterization, identification and detection • Many unknown flow-level anomalies revealed from real router traffic analysis (AT&T) • Profile traffic patterns of new applications (e.g. P2P) –> benign anomalies • Understand the cause, pattern and prevalence of other unknown anomalies • Identify malicious patterns for intrusion detection • E.g., fight against Sapphire/Slammer Worm

  38. Backup Materials

  39. SCAN Coherence for dynamic content X s1, s4, s5 Cooperative clustering-based replication Scalable network monitoring O(M+N)

  40. Problem Formulation • Subject to certain total replication cost (e.g., # of URL replicas) • Find a scalable, adaptive replication strategy to reduce avg access cost

  41. Simulation Methodology • Network Topology • Pure-random, Waxman & transit-stub synthetic topology • An AS-level topology from 7 widely-dispersed BGP peers • Web Workload • Aggregate MSNBC Web clients with BGP prefix • BGP tables from a BBNPlanet router • Aggregate NASA Web clients with domain names • Map the client groups onto the topology

  42. Online Incremental Clustering • Predict access patterns based on semantics • Simplify to popularity prediction • Groups of URLs with similar popularity? Use hyperlink structures! • Groups of siblings • Groups of the same hyperlink depth: smallest # of links from root

  43. Challenges for CDN • Over-provisioning for replication • Provide good QoS to clients (e.g., latency bound, coherence) • Small # of replicas with small delay and bandwidth consumption for update • Replica Management • Scalability: billions of replicas if replicating in URL • O(104) URLs/server, O(105) CDN edge servers in O(103) networks • Adaptation to dynamics of content providers and customers • Monitoring • User workload monitoring • End-to-end network distance/congestion/failures monitoring • Measurement scalability • Inference accuracy and stability

  44. replica cache always update adaptive coherence client Tapestry mesh SCAN Architecture • Leverage Decentralized Object Location and Routing (DOLR) - Tapestry for • Distributed, scalable location with guaranteed success • Search with locality • Soft state maintenance of dissemination tree(for each object) data source data plane Dynamic Replication/Update and Content Management Web server Request Location SCAN server network plane

  45. Distance measured from a host to its monitor Distance measured among monitors Wide-area Network Measurement and Monitoring System (WNMMS) • Select a subset of SCAN servers to be monitors • E2E estimation for • Distance • Congestion • Failures Cluster C Cluster B Cluster A network plane Monitors SCAN edge servers Clients

  46. Dynamic Provisioning • Dynamic replica placement • Meeting clients’ latency and servers’ capacity constraints • Close-to-minimal # of replicas • Self-organized replicas into app-level multicast tree • Small delay and bandwidth consumption for update multicast • Each node only maintains states for its parent & direct children • Evaluated based on simulation of • Synthetic traces with various sensitivity analysis • Real traces from NASA and MSNBC • Publication • IPTPS 2002 • Pervasive Computing 2002

  47. Effects of the Non-Uniform Size of URLs 1 • Replication cost constraint : bytes • Similar trends exist • Per URL replication outperforms per Website dramatically • Spatial clustering with Euclidean distance and popularity-based clustering are very cost-effective 2 4 3

  48. Diagram of Internet Iso-bar Cluster C Cluster B Cluster A Landmark End Host

  49. Diagram of Internet Iso-bar Distance probes from monitor to its hosts Distance probes among monitors Cluster C Cluster B Cluster A Landmark Monitor End Host

  50. Real Internet Measurement Data • NLANR Active Measurement Projectdata set • 119 sites on US (106 after filtering out most offline sites) • Round-trip time (RTT) between every pair of hosts every minute • Raw data: 6/24/00 – 12/3/01 • Keynote measurement data • Measure TCP performance from about 100 agents • Heterogeneouscore network: various ISPs • Heterogeneousaccess network: • Dial up 56K, DSL and high-bandwidth business connections • Targets • Web site perspective: 40 most popular Web servers • 27 Internet Data Centers (IDCs)

More Related