Efficient and Adaptive Replication using Content Clustering

Efficient and Adaptive Replication using Content Clustering Yan Chen EECS Department UC Berkeley

Motivation • Internet has evolved to become a commercial infrastructure for service delivery • Web delivery, VoIP, streaming media … • Challenges for Internet-scale services • Scalability: 600M users, 35M Web sites, 28Tb/s • Efficiency: bandwidth, storage, management • Agility: dynamic clients/network/servers • Security, etc. • Focus on content delivery - Content Distribution Network (CDN) • Totally 4 Billion Web pages, daily growth of 7M pages • Annual growth of 200% for next 4 years

CDN and its Challenges

CDN and its Challenges No coherence for dynamic content X Inefficient replication Unscalable network monitoring - O(M*N)

SCAN: Scalable Content Access Network CDN Applications (e.g. streaming media) Provision: Cooperative Clustering-based Replication Coherence: Update Multicast Tree Construction Network Distance/ Congestion/ Failure Estimation User Behavior/ Workload Monitoring Network Performance Monitoring red: my work, black: out of scope

SCAN Coherence for dynamic content s1, s4, s5 Cooperative clustering-based replication

SCAN Coherence for dynamic content X s1, s4, s5 Cooperative clustering-based replication Scalable network monitoring - O(M+N)

Internet-scale Simulation • Network Topology • Pure-random, Waxman & transit-stub synthetic topology • An AS-level topology from 7 widely-dispersed BGP peers • Web Workload • Aggregate MSNBC Web clients with BGP prefix • BGP tables from a BBNPlanet router • Aggregate NASA Web clients with domain names • Map the client groups onto the topology

Internet-scale Simulation – E2E Measurement • NLANR Active Measurement Projectdata set • 111 sites on America, Asia, Australia and Europe • Round-trip time (RTT) between every pair of hosts every minute • 17M daily measurement • Raw data: Jun. – Dec. 2001, Nov. 2002 • Keynote measurement data • Measure TCP performance from about 100 worldwide agents • Heterogeneous core network: various ISPs • Heterogeneous access network: • Dial up 56K, DSL and high-bandwidth business connections • Targets • 40 most popular Web servers + 27 Internet Data Centers • Raw data: Nov. – Dec. 2001, Mar. – May 2002

Clustering Web Content for Efficient Replication

Overview • CDN uses non-cooperative replication - inefficient • Paradigm shift: cooperative push • Where to push – greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01] • But what content to be pushed? • At what granularity? • Clustering of objects for replication • Close-to-optimal performance with small overhead • Incremental clustering • Push before accessed: improve availability during flash crowds

Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research

5.GET request 6.GET request if cache miss 7. Response Local CDN server Local CDN server 4. local CDN server IP address 8. Response 1. GET request 2. Request for hostname resolution 3. Reply: local CDN server IP address Local DNS server Client 2 CDN name server Local DNS server Conventional CDN: Non-cooperative Pull Client 1 Web content server ISP 1 Inefficient replication ISP 2

5.GET request if no replica yet 6. Response Local CDN server Local CDN server 4. Redirected server IP address 1. GET request 2. Request for hostname resolution Web content server 3. Reply: nearby replica server or Web server IP address 0. Push replicas 6. Response Local DNS server Client 2 5.GET request Local DNS server SCAN: Cooperative Push Client 1 CDN name server ISP 1 Significantly reduce the # of replicas and update cost ISP 2

Comparison between Conventional CDNs and SCAN

Problem Formulation • Find a scalable, adaptive replication strategy to reduce • Clients’ average retrieval cost • Replica location computation cost • Amount of replica directory state to maintain • Subject to certain total replication cost (e.g., # of URL replicas)

1 2 Per URL 4 3 1 Per Web site 2 4 3

Replica Placement: Per Website vs. Per URL • 60 – 70% average retrieval cost reduction for Per URL scheme • Per URL is too expensive for management!

Overhead Comparison Where R: # of replicas per URL M: # of URLs To compute on average 10 replicas/URL for top 1000 URLs takes several days on a normal server!

Overhead Comparison Where R: # of replicas per URL K: # of clusters M: # of URLs (M >> K)

Clustering Web Content • General clustering framework • Define the correlation distance between URLs • Cluster diameter: the max distance between any two members • Worst correlation in a cluster • Generic clustering: minimize the max diameter of all clusters • Correlation distance definition based on • Spatial locality • Temporal locality • Popularity

URL spatial access vector • Blue URL 1 2 4 3 Spatial Clustering • Correlation distance between two URLs defined as • Euclidean distance • Vector similarity

Clustering Web Content (cont’d) • Temporal clustering • Divide traces into multiple individuals’ access sessions [ABQ01] • In each session, • Average over multiple sessions in one day • Popularity-based clustering • OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation

Performance of Cluster-based Replication • Spatial clustering with Euclidean distance and popularity-based clustering perform the best • Small # of clusters (with only 1-2% of # of URLs) can achieve close to per-URL performance, with much less overhead MSNBC, 8/2/1999, 5 replicas/URL

Performance of static clustering almost doubles the optimal ! Static clustering and replication • Two daily traces: training traceand new trace • Static clustering performs poorly beyond a week

Incremental Clustering • Generic framework • If new URL u match with existing clusters c, add u to c and replicate u to existing replicas of c • Else create new clusters and replicate them • Two types of incremental clustering • Online: without any access logs • High availability • Offline: with access logs • Close-to-optimal performance

URL2 <a href=“URL5”> <a href=“URL6”> URL4 <a href=“URL3”> <a href=“URL7”> URL1 <a href=“URL2”> <a href=“URL3”> <a href=“URL4”> 1 1 2 2 4 4 3 3 7 7 5 5 6 6 Online Incremental Clustering • Predict access patterns based on semantics • Simplify to popularity prediction • Groups of URLs with similar popularity? Use hyperlink structures! Groups of siblings Groups of the same hyperlink depth (smallest # of links from root)

access freq span= Online Popularity Prediction • Measure the divergence of URL popularity within a group: • Experiments • Crawl http://www.msnbc.com on 5/3/2002 with hyperlink depth 4, then group the URLs • Use corresponding access logs to analyze the correlation • Groups of siblings have the best correlation

1 1 2 3 4 5 6 6 5 1 3 4 2 4 3 + ? 2 6 5 Semantics-based Incremental Clustering • Put new URL into existing cluster with largest # of siblings • In case of a tie, choose the cluster w/ more replicas • Simulation on 5/3/2002 MSNBC • 8-10am trace: static popularity clustering + replication • At 10am: 16 new URLs - online inc. clustering + replication • Evaluation with 10-12am trace: 16 URLs has 33K requests

Online Incremental Clustering and Replication Results 1/8 compared w/ no replication, and 1/5 for random replication

Online Incremental Clustering and Replication Results Double the optimal retrieval cost, but only 4% of its replication cost

Conclusions • Cooperative, clustering-based replication • Cooperative push: only 4 - 5% replication/update cost compared with existing CDNs • URL Clustering reduce the management/computational overhead by two orders of magnitude • Spatial clustering and popularity-based clustering recommended • Incremental clustering to adapt to emerging URLs • Hyperlink-based online incremental clustering for high availability and performance improvement • Self-organize replicas into app-level multicast tree for update dissemination • Scalable overlay network monitoring • O(M+N) instead of O(M*N), given M client groups and N servers

Future Research (I) • Measurement-based Internet study and protocol/architecture design • Use inference techniques to develop Internet behavior models • Network operators reluctant to reveal internal network configurations • Root cause analysis: large, heterogeneous data mining • Leverage graphics/visualization for interactive mining • Apply deeper understanding of Internet behaviors for reassessment/design of protocol/architecture • E.g., Internet bottleneck – peering links? How and Why? Implications?

Future Research (II) • Network traffic anomaly characterization, identification and detection • Many unknown flow-level anomalies revealed from real router traffic analysis (AT&T) • Profile traffic patterns of new applications (e.g. P2P) –> benign anomalies • Understand the cause, pattern and prevalence of other unknown anomalies • Identify malicious patterns for intrusion detection • E.g., fight against Sapphire/Slammer Worm

Backup Materials

SCAN Coherence for dynamic content X s1, s4, s5 Cooperative clustering-based replication Scalable network monitoring O(M+N)

Problem Formulation • Subject to certain total replication cost (e.g., # of URL replicas) • Find a scalable, adaptive replication strategy to reduce avg access cost

Simulation Methodology • Network Topology • Pure-random, Waxman & transit-stub synthetic topology • An AS-level topology from 7 widely-dispersed BGP peers • Web Workload • Aggregate MSNBC Web clients with BGP prefix • BGP tables from a BBNPlanet router • Aggregate NASA Web clients with domain names • Map the client groups onto the topology

Online Incremental Clustering • Predict access patterns based on semantics • Simplify to popularity prediction • Groups of URLs with similar popularity? Use hyperlink structures! • Groups of siblings • Groups of the same hyperlink depth: smallest # of links from root

Challenges for CDN • Over-provisioning for replication • Provide good QoS to clients (e.g., latency bound, coherence) • Small # of replicas with small delay and bandwidth consumption for update • Replica Management • Scalability: billions of replicas if replicating in URL • O(104) URLs/server, O(105) CDN edge servers in O(103) networks • Adaptation to dynamics of content providers and customers • Monitoring • User workload monitoring • End-to-end network distance/congestion/failures monitoring • Measurement scalability • Inference accuracy and stability

replica cache always update adaptive coherence client Tapestry mesh SCAN Architecture • Leverage Decentralized Object Location and Routing (DOLR) - Tapestry for • Distributed, scalable location with guaranteed success • Search with locality • Soft state maintenance of dissemination tree(for each object) data source data plane Dynamic Replication/Update and Content Management Web server Request Location SCAN server network plane

Distance measured from a host to its monitor Distance measured among monitors Wide-area Network Measurement and Monitoring System (WNMMS) • Select a subset of SCAN servers to be monitors • E2E estimation for • Distance • Congestion • Failures Cluster C Cluster B Cluster A network plane Monitors SCAN edge servers Clients

Dynamic Provisioning • Dynamic replica placement • Meeting clients’ latency and servers’ capacity constraints • Close-to-minimal # of replicas • Self-organized replicas into app-level multicast tree • Small delay and bandwidth consumption for update multicast • Each node only maintains states for its parent & direct children • Evaluated based on simulation of • Synthetic traces with various sensitivity analysis • Real traces from NASA and MSNBC • Publication • IPTPS 2002 • Pervasive Computing 2002

Effects of the Non-Uniform Size of URLs 1 • Replication cost constraint : bytes • Similar trends exist • Per URL replication outperforms per Website dramatically • Spatial clustering with Euclidean distance and popularity-based clustering are very cost-effective 2 4 3

Diagram of Internet Iso-bar Cluster C Cluster B Cluster A Landmark End Host

Diagram of Internet Iso-bar Distance probes from monitor to its hosts Distance probes among monitors Cluster C Cluster B Cluster A Landmark Monitor End Host

Real Internet Measurement Data • NLANR Active Measurement Projectdata set • 119 sites on US (106 after filtering out most offline sites) • Round-trip time (RTT) between every pair of hosts every minute • Raw data: 6/24/00 – 12/3/01 • Keynote measurement data • Measure TCP performance from about 100 agents • Heterogeneouscore network: various ISPs • Heterogeneousaccess network: • Dial up 56K, DSL and high-bandwidth business connections • Targets • Web site perspective: 40 most popular Web servers • 27 Internet Data Centers (IDCs)

Efficient and Adaptive Replication using Content Clustering

Efficient and Adaptive Replication using Content Clustering

Presentation Transcript

Efficient Parameter-free Clustering Using First Neighbor Relations

Tomographic Image Reconstruction Using Content-Adaptive Mesh Modeling

Tomographic Image Reconstruction Using Content-Adaptive Mesh Modeling

Efficient Constraint Monitoring Using Adaptive Thresholds

Privacy protection in published data using an efficient clustering method

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

Oracle Clustering and Replication Technologies

Efficient Clustering of Cognitive Radio Networks Using Affinity Propagation

Adaptive and Efficient Mutual Exclusion

Clustering of Web Content for Efficient Replication

Clustering Web Content for Efficient Replication

Reliable MySQL Using Replication

Adaptive Hypermedia Content Authoring using MOT3.0

Moving data using replication

Accurate, Efficient, and Adaptive Calling Context Profiling

Efficient Storage and Processing of Adaptive Triangular Grids using Sierpinski Curves

Clustering Web Content for Efficient Replication

Low Energy Adaptive Clustering Hierarchy (LEACH)

Low-Energy Adaptive Clustering Hierarchy

Eliminating Synchronization Bottlenecks in Object-Based Programs Using Adaptive Replication

Efficient and Adaptive Replication using Content Clustering

Moving data using replication