1 / 68

Efficient and Adaptive Replication using Content Clustering

Efficient and Adaptive Replication using Content Clustering. Yan Chen EECS Department UC Berkeley. Motivation. The Internet has evolved to become a commercial infrastructure for service delivery Web delivery, VoIP, streaming media … Challenges for Internet-scale services

calliem
Download Presentation

Efficient and Adaptive Replication using Content Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient and Adaptive Replication using Content Clustering Yan Chen EECS Department UC Berkeley

  2. Motivation • The Internet has evolved to become a commercial infrastructure for service delivery • Web delivery, VoIP, streaming media … • Challenges for Internet-scale services • Scalability: 600M users, 35M Web sites, 2.1Tb/s • Efficiency: bandwidth, storage, management • Agility: dynamic clients/network/servers • Security, etc. • Focus on content delivery - Content Distribution Network (CDN) • Totally 4 Billion Web pages, daily growth of 7M pages • Annual traffic growth of 200% for next 4 years

  3. How CDN Works

  4. New Challenges for CDN • Large multimedia files ― Efficient replication • Dynamic content ― Coherence support • Network congestion/failures ― Scalable network monitoring

  5. Existing CDNs Fail to Address these Challenges No coherence for dynamic content X Unscalable network monitoring - O(M ×N) M: # of client groups, N: # of server farms Non-cooperative replication inefficient

  6. Provisioning (replica placement) Access/Deployment Mechanisms Granularity Non-cooperative Cooperative Per object Per cluster Per Website Pull Existing CDNs Push SCAN Network Monitoring Coherence Support Ad hoc pair-wise monitoring O(M×N) Tomography-based monitoring O(M+N) Unicast App-level multicast on P2P DHT IP multicast SCAN: Scalable Content Access Network

  7. s1 s4 s5 SCAN Coherence for dynamic content s1, s4, s5 Cooperative clustering-based replication

  8. SCAN Coherence for dynamic content X s1, s4, s5 Scalable network monitoring - O(M+N) M: # of client groups, N: # of server farms Cooperative clustering-based replication

  9. Algorithm design Realistic simulation Evaluation of Internet-scale Systems iterate • Network topology • Web workload • Network end-to-end latency measurement Real evaluation? Analytical evaluation

  10. Network Topology and Web Workload • Network Topology • Pure-random, Waxman & transit-stub synthetic topology • An AS-level topology from 7 widely-dispersed BGP peers • Web Workload • Aggregate MSNBC Web clients with BGP prefix • BGP tables from a BBNPlanet router • Aggregate NASA Web clients with domain names • Map the client groups onto the topology

  11. Network E2E Latency Measurement • NLANR Active Measurement Projectdata set • 111 sites on America, Asia, Australia and Europe • Round-trip time (RTT) between every pair of hosts every minute • 17M daily measurement • Raw data: Jun. – Dec. 2001, Nov. 2002 • Keynote measurement data • Measure TCP performance from about 100 worldwide agents • Heterogeneous core network: various ISPs • Heterogeneous access network: • Dial up 56K, DSL and high-bandwidth business connections • Targets • 40 most popular Web servers + 27 Internet Data Centers • Raw data: Nov. – Dec. 2001, Mar. – May 2002

  12. Clustering Web Content for Efficient Replication

  13. Overview • CDN uses non-cooperative replication - inefficient • Paradigm shift: cooperative push • Where to push – greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01] • But what content to be pushed? • At what granularity? • Clustering of objects for replication • Close-to-optimal performance with small overhead • Incremental clustering • Push before accessed: improve availability during flash crowds

  14. Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research

  15. 3.GET request 4.GET request if cache miss 5. Response Local CDN server Local CDN server 6. Response 2. Reply: local CDN server IP address 1. Request for hostname resolution Local DNS server Client 2 CDN name server Local DNS server Conventional CDN: Non-cooperative Pull Client 1 Web content server ISP 1 Inefficient replication ISP 2

  16. 3.GET request if no replica yet Local CDN server Local CDN server 4. Response s2 Web content server 2. Reply: nearby replica server or Web server IP address 0. Push replicas 1. Request for hostname resolution 4. Response Local DNS server Client 2 3.GET request Local DNS server SCAN: Cooperative Push Client 1 CDN name server ISP 1 Significantly reduce the # of replicas and update cost ISP 2

  17. Comparison between Conventional CDNs and SCAN

  18. Problem Formulation • How to use cooperative push for replication to reduce • Clients’ average retrieval cost • Replica location computation cost • Amount of replica directory state to maintain • Subject to certain total replication cost (e.g., # of object replicas)

  19. Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research

  20. 1 2 Per object 4 3 1 Per Web site 2 4 3

  21. Replica Placement: Per Site vs. Per Object • 60 – 70% average retrieval cost reduction for Per object scheme • Per object is too expensive for management!

  22. Overhead Comparison Where R: # of replicas per object M: total # of objects in the Website To compute on average 10 replicas/object for top 1000 objects takes several days on a normal server!

  23. Overhead Comparison Where R: # of replicas per object K: # of clusters M: total # of objects in the Website (M >> K)

  24. Clustering Web Content • General clustering framework • Define the correlation distance between objects • Cluster diameter: the max distance between any two members • Worst correlation in a cluster • Generic clustering: minimize the max diameter of all clusters • Correlation distance definition based on • Spatial locality • Temporal locality • Popularity

  25. Object spatial access vector • Blue object 1 2 4 3 Spatial Clustering • Correlation distance between two objects defined as • Euclidean distance • Vector similarity

  26. Clustering Web Content (cont’d) • Temporal clustering • Divide traces into multiple individuals’ access sessions [ABQ01] • In each session, • Average over multiple sessions in one day • Popularity-based clustering • OR even simpler, sort them and put the first N/K elements into the first cluster, etc.

  27. Performance of Cluster-based Replication • Use greedy algorithm for replication • Spatial clustering with Euclidean distance and popularity-based clustering perform the best • Small # of clusters (with only 1-2% of # of objects) can achieve close to per-object performance, with much less overhead

  28. Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research

  29. Retrieval cost of static clustering almost doubles the optimal ! Static clustering and replication • Two daily traces: training traceand new trace • Static clustering performs poorly beyond a week

  30. Incremental Clustering • Generic framework • If new object o matches with existing cluster c, add o to c and replicate o to existing replicas of c • Else create new cluster and replicate them • Two types of incremental clustering • Online: without any access logs • High availability • Offline: with access logs • Close-to-optimal performance

  31. Object 1 <a href=“object2”> <a href=“object3”> <a href=“object4”> Object 4 <a href=“object3”> <a href=“object7”> Object 2 <a href=“object5”> <a href=“object6”> 1 1 2 2 4 4 3 3 7 7 5 5 6 6 Online Incremental Clustering • Predict access patterns based on semantics • Simplify to popularity prediction • Groups of objects with similar popularity? Use hyperlink structures! Groups of siblings Groups of the same hyperlink depth (smallest # of links from root)

  32. access freq span= Online Popularity Prediction • Measure the divergence of object popularity within a group: • Experiments • Crawl http://www.msnbc.com with hyperlink depth 4, then group the objects • Use corresponding access logs to analyze the correlation • Groups of siblings have better correlation

  33. 1 1 2 3 4 5 6 6 5 1 3 4 2 4 3 + ? 2 6 5 Semantics-based Incremental Clustering • Put new object into existing cluster with largest number of siblings • In case of a tie, choose the cluster w/ more replicas • Simulation on MSNBC daily traces • 8-10am trace: static popularity clustering + replication • At 10am: M new objects - online inc. clustering + replication • Evaluated with 10-12am trace: each new object O(103) requests

  34. Online Incremental Clustering and Replication Results 1/8 compared w/ no replication, and 1/5 for random replication

  35. Online Incremental Clustering and Replication Results Double the optimal retrieval cost, but only 4% of its replication cost

  36. Conclusions • Cooperative, clustering-based replication • Cooperative push: only 4 - 5% replication/update cost compared with existing CDNs • Clustering reduce the management/computational overhead by two orders of magnitude • Spatial clustering and popularity-based clustering recommended • Incremental clustering to adapt to emerging objects • Hyperlink-based online incremental clustering for high availability and performance improvement

  37. Tie Back to SCAN • Self-organize replicas into app-level multicast tree for update dissemination • Scalable overlay network monitoring • O(M+N) instead of O(M×N), given M client groups and N servers • For more info: http://www.cs.berkeley.edu/~yanchen/resume.html#Publications

  38. Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research

  39. Future Research (I) • Measurement-based Internet study and protocol/architecture design • Use inference techniques to develop Internet behavior models • Network operators reluctant to reveal internal network configs • Root cause analysis: large, heterogeneous data mining • Leverage graphics/visualization for interactive mining • Apply deeper understanding of Internet behaviors for reassessment/design of protocol/architecture • E.g., Internet bottleneck – peering links? How and Why? Implications?

  40. Future Research (II) • Network traffic anomaly characterization, identification and detection • Many unknown flow-level anomalies revealed from real router traffic analysis (AT&T) • Profile traffic patterns of new applications (e.g., P2P) –> benign anomalies • Understand the causes, patterns and prevalence of other unknown anomalies • Apply malicious patterns for intrusion detection • E.g., fight against Sapphire/Slammer Worm • Leverage Forensix for auditing and querying

  41. Backup Materials

  42. Tomography-based Network Monitoring B A 1 – P = (1 – l_0)(1 – l_1)(1 – l_2) P_i L_j O(M + N) Given O(M+N) end hosts, power-law degree topology imply O(M+N) links Transform to the topology matrix Pick O(M + N) paths to compute the link loss rates Use link loss rates to compute the loss rates of other paths M × N

  43. Path Loss Rate Inference • Ideal case: rank = # of links (K) • Rank deficiency solved through topology transformation Topology transformation Real links Virtual links

  44. Future Research (I) • Internet behavior modeling and protocol / architecture design • Use inference techniques to develop Internet behavior models • Root cause analysis: large, heterogeneous data mining • Leverage graphics/visualization for interactive mining • Leverage SciClone Cluster for parallel network tomography • Apply deeper understanding of Internet behaviors for reassessment/design of protocol/architecture • E.g., Internet bottleneck – peering links? How and Why? Implications?

  45. Tomography-based Network Monitoring • Observations • # of lossy links is small, dominate E2E loss • Loss rates are stable (in the order of hours ~ days) • Routing is stable (in the order of days) • Identify the lossy links and only monitor a few paths to examine lossy links • Make inference for other paths End hosts Routers Normal links Lossy links

  46. SCAN Coherence for dynamic content X s1, s4, s5 Cooperative clustering-based replication Scalable network monitoring O(M+N)

  47. Problem Formulation • Subject to certain total replication cost (e.g., # of URL replicas) • Find a scalable, adaptive replication strategy to reduce avg access cost

  48. SCAN: Scalable Content Access Network CDN Applications (e.g. streaming media) Provision: Cooperative Clustering-based Replication Coherence: Update Multicast Tree Construction Network Distance/ Congestion/ Failure Estimation User Behavior/ Workload Monitoring Network Performance Monitoring red: my work, black: out of scope

  49. Evaluation of Internet-scale System • Analytical evaluation • Realistic simulation • Network topology • Web workload • Network end-to-end latency measurement • Network topology • Pure-random, Waxman & transit-stub synthetic topology • A real AS-level topology from 7 widely-dispersed BGP peers

  50. Web Workload • Aggregate MSNBC Web clients with BGP prefix • BGP tables from a BBNPlanet router • Aggregate NASA Web clients with domain names • Map the client groups onto the topology

More Related