1 / 70

Efficient On-Demand Operations in Large-Scale Infrastructures

Efficient On-Demand Operations in Large-Scale Infrastructures. Final Defense Presentation by Steven Y. Ko. Thesis Focus. One class of operations that faces one common set of challenges cutting across four diverse and popular types of distributed infrastructures.

hansel
Download Presentation

Efficient On-Demand Operations in Large-Scale Infrastructures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient On-Demand Operations in Large-Scale Infrastructures Final Defense Presentation by Steven Y. Ko

  2. Thesis Focus • One class of operations that faces one common set of challenges • cutting across four diverse and popular types of distributed infrastructures

  3. Large-Scale Infrastructures • Are everywhere • Research testbeds • CCT, PlanetLab, Emulab, … • For computer scientists • Grids • TeraGrid, Grid5000, … • For scientific communities Internet • Wide-area peer-to-peer • BitTorrent, LimeWire, etc. • For Web users • Clouds • Amazon, Google, HP, IBM, … • For Web users, app developers

  4. On-Demand Operations • Operations that act upon the most up-to-date state of the infrastructure • E.g., what’s going on right now in my data center? • Four operations in the thesis • On-demand monitoring (prelim) • Clouds, research testbeds • On-demand replication (this talk) • Clouds, research testbeds • On-demand scheduling • Grids • On-demand key/value store • Peer-to-peer • Diverse infrastructures & essential operations • All have one common set of challenges

  5. Common Challenges • Scale and dynamism • On-demand monitoring • Scale: 200K machines (Google) • Dynamism: values can change any time (e.g., CPU-util) • On-demand replication • Scale: ~300TB total, adding 2TB/day (Facebook HDFS) • Dynamism: resource availability (failures, b/w, etc.) • On-demand scheduling • Scale: 4K CPUs running 44,000 tasks accessing 588,900 files (Coadd) • Dynamism: resource availability (CPU, mem, disk, etc.) • On-demand key/value store • Scale: Millions of peers (BitTorrent) • Dynamism: short-term unresponsiveness

  6. Common Challenges • Scale and dynamism Scale On-Demand Operations Dynamism Static Dynamic

  7. On-Demand Monitoring (Recap) • Need • Users and admins need to query & monitor the most-up-to-date attributes (CPU-util, disk-cap, etc.) of an infrastructure • Scale • # of machines, the amount of monitoring data, etc. • Dynamism • Static attributes (e.g., CPU-type) vs. dynamic attributes (e.g., CPU-util) • Moara • Expressive queries, e.g., avg. CPU-util where disk > 10 G or (mem-util < 50% and # of processes < 20) • Quick response time and bandwidth-efficiency

  8. Common Challenges • On-demand monitoring: DB vs. Moara (Data) Scale Scaling centralized (e.g., Replicated DB) My Solution (Moara), etc. Centralized (e.g., DB) Centralized (e.g., DB) (Attribute) Dynamism Static Attributes (e.g., CPU type) Dynamic Attributes (e.g., CPU util.)

  9. Common Challenges • On-demand replication of intermediate data (in detail later) (Data) Scale Distributed File Systems (e.g., Hadoop w/ HDFS) My Solution (ISS) Local File Systems (e.g., compilers) Replication (e.g., Hadoop w/ HDFS) (Availability) Dynamism Static Dynamic

  10. Thesis Statement • On-demand operations can be implemented efficiently in spite of scale and dynamism in a variety of distributed systems.

  11. Thesis Statement • On-demand operations can be implemented efficiently in spite of scale and dynamism in a variety of distributed systems. • Efficiency: responsiveness & bandwidth

  12. Thesis Contributions • Identifying on-demand operations as an important class of operations in large-scale infrastructures • Identifying scale and dynamism as two common challenges • Arguing the need for an on-demand operation in each case • Showing how efficient each can be by simulations and real deployments

  13. Thesis Overview • ISS (Intermediate Storage System) • Implements on-demand replication • USENIX HotOS ’09 • Moara • Implements on-demand monitoring • ACM/IFIP/USENIX Middleware ’08 • Worker-centric scheduling • Implements on-demand scheduling • ACM/IFIP/USENIX Middleware ’07 • MPIL (Multi-Path Insertion and Lookup) • Implements on-demand key/value store • IEEE DSN ’05

  14. Thesis Overview Worker-Centric Scheduling • Research testbeds • CCT, PlanetLab, Emulab, … • For computer scientists • Grids • TeraGrid, Grid5000, … • For scientific communities Internet Moara & ISS MPIL • Wide-area peer-to-peer • BitTorrent, LimeWire, etc. • For Web users • Clouds • Amazon, Google, HP, IBM, … • For Web users, app developers

  15. Related Work • Management operations: MON [Lia05], vxargs, PSSH[Pla], etc. • Scheduling [Ros04, Ros05, Vis04] • Gossip-based multicast [Bir02,Gup02,Ker01] • On-demand scaling • Amazon AWS, Google AppEngine, MS Azure, RightScale, etc.

  16. ISS: Intermediate Storage System

  17. Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds

  18. Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks

  19. Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data • Scale and dynamism

  20. Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data • Scale and dynamism • ISS (Intermediate Storage System) with on-demand replication

  21. Dataflow Programming Frameworks • Runtime systems that execute dataflow programs • MapReduce (Hadoop), Pig, Hive, etc. • Gaining popularity for massive-scale data processing • Distributed and parallel execution on clusters • A dataflow program consists of • Multi-stage computation • Communication patterns between stages

  22. Example 1: MapReduce • Two-stage computation with all-to-all comm. • Google introduced, Yahoo! open-sourced (Hadoop) • Two functions – Map and Reduce – supplied by a programmer • Massively parallel execution of Map and Reduce Stage 1: Map Shuffle (all-to-all) Stage 2: Reduce

  23. Example 2: Pig and Hive • Multi-stage with either all-to-all or 1-to-1 Stage 1: Map Shuffle (all-to-all) Stage 2: Reduce 1-to-1 comm. Stage 3: Map Stage 4: Reduce

  24. Usage • Google (MapReduce) • Indexing: a chain of 24 MapReduce jobs • ~200K jobs processing 50PB/month (in 2006) • Yahoo! (Hadoop + Pig) • WebMap: a chain of 100 MapReduce jobs • Facebook (Hadoop + Hive) • ~300TB total, adding 2TB/day (in 2008) • 3K jobs processing 55TB/day • Amazon • Elastic MapReduce service (pay-as-you-go) • Academic clouds • Google-IBM Cluster at UW (Hadoop service) • CCT at UIUC (Hadoop & Pig service)

  25. One Common Characteristic • Intermediate data • Intermediate data? data between stages • Similarities to traditional intermediate data [Bak91,Vog99] • E.g., .o files • Critical to produce the final output • Short-lived, written-once and read-once, & used-immediately • Computational barrier

  26. One Common Characteristic • Computational Barrier Stage 1: Map Computational Barrier Stage 2: Reduce

  27. Why Important? • Scale and Dynamism • Large-scale: possibly very large amount of intermediate data • Dynamism: Loss of intermediate data=> the task can’t proceed Stage 1: Map Stage 2: Reduce

  28. Failure Stats • 5 average worker deaths per MapReduce job (Google in 2006) • One disk failure in every run of a 6-hour MapReduce job with 4000 machines (Google in 2008) • 50 machine failures out of 20K machine cluster (Yahoo! in 2009)

  29. Hadoop Failure Injection Experiment • Emulab setting • 20 machines sorting 36GB • 4 LANs and a core switch (all 100 Mbps) • 1 failure after Map • Re-execution of Map-Shuffle-Reduce • ~33% increase in completion time Map Shuffle Reduce Map Shuffle Reduce

  30. Re-Generation for Multi-Stage • Cascaded re-execution: expensive Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce

  31. Importance of Intermediate Data • Why? • Scale and dynamism • A lot of data + when lost, very costly • Current systems handle it themselves. • Re-generate when lost: can lead to expensive cascaded re-execution • We believe that the storage can provide a better solutionthanthe dataflow programming frameworks

  32. Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data • ISS (Intermediate Storage System) • Why storage? • Challenges • Solution hypotheses • Hypotheses validation

  33. Why Storage? On-Demand Replication • Stops cascaded re-execution Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce

  34. So, Are We Done? • No! • Challenge: minimal interference • Network is heavily utilized during Shuffle. • Replication requires network transmission too, and needs to replicate a large amount. • Minimizing interference is critical for the overall job completion time. • Efficiency: completion time + bandwidth • HDFS (Hadoop Distributed File System): much interference

  35. Default HDFS Interference • On-demand replication of Map and Reduce outputs (2 copies in total)

  36. Background Transport Protocols • TCP-Nice [Ven02] & TCP-LP [Kuz06] • Support background & foreground flows • Pros • Background flows do not interfere with foreground flows (functionality) • Cons • Designed for wide-area Internet • Application-agnostic • Not designed for data center replication • Can do better!

  37. Revisiting Common Challenges • Scale: need to replicate a large amount of intermediate data • Dynamism: failures, foreground traffic (bandwidth availability) • Two challenges create much interference.

  38. Revisiting Common Challenges • Intermediate data management (Data) Scale Distributed File Systems (e.g., Hadoop w/ HDFS) ISS Local File Systems (e.g., compilers) Replication (e.g., Hadoop w/ HDFS) (Availability) Dynamism Static Dynamic

  39. Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data • ISS (Intermediate Storage System) • Why is storage the right abstraction? • Challenges • Solution hypotheses • Hypotheses validation

  40. Three Hypotheses • Asynchronous replication can help. • HDFS replication works synchronously. • The replication process can exploit the inherent bandwidth heterogeneity of data centers (next). • Data selection can help (later).

  41. Bandwidth Heterogeneity • Data center topology: hierarchical • Top-of-the-rack switches (under-utilized) • Shared core switch (fully-utilized)

  42. Data Selection • Only replicate locally-consumed data Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce

  43. Three Hypotheses • Asynchronous replication can help. • The replication process can exploit the inherent bandwidth heterogeneity of data centers. • Data selection can help. • The question is not if, but how much. • If effective, these become techniques.

  44. Experimental Setting • Emulab with 80 machines • 4 X 1 LAN with 20 machines • 4 X 100Mbps top-of-the-rack switch • 1 X 1Gbps core switch • Various configurations give similar results. • Input data: 2GB/machine, random-generation • Workload: sort • 5 runs • Std. dev. ~ 100 sec.: small compared to the overall completion time • 2 replicas of Map outputs in total

  45. Asynchronous Replication • Modification for asynchronous replication • With an increasing level of interference • Four levels of interference • Hadoop: original, no replication, no interference • Read: disk read, no network transfer, no actual replication • Read-Send: disk read & network send, no actual replication • Rep.: full replication

  46. Asynchronous Replication • Network utilization makes the difference • Both Map & Shuffle get affected • Some Maps need to read remotely

  47. Three Hypotheses (Validation) • Asynchronous replication can help, but still can’t eliminate the interference. • The replication process can exploit the inherent bandwidth heterogeneity of data centers. • Data selection can help.

  48. Rack-Level Replication • Rack-level replication is effective. • Only 20~30 rack failures per year, mostly planned (Google 2008)

  49. Three Hypotheses (Validation) • Asynchronous replication can help, but still can’t eliminate the interference • The rack-level replication can reduce the interference significantly. • Data selection can help.

  50. Locally-Consumed Data Replication • It significantly reduces the amount of replication.

More Related