1 / 62

Cloud Computing Systems

Cloud Computing Systems. Lin Gu. Hong Kong University of Science and Technology. Sept. 14, 2011. How to effectively compute in a datacenter?. Is MapReduce the best answer to computation in the cloud? What is the limitation of MapReduce?

belva
Download Presentation

Cloud Computing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cloud Computing Systems Lin Gu Hong Kong University of Science and Technology Sept. 14, 2011

  2. How to effectively compute in a datacenter? Is MapReduce the best answer to computation in the cloud? What is the limitation of MapReduce? How to provide general-purpose parallel processing in DCs?

  3. Program Execution on Web-Scale Data The MapReduce Approach • MapReduce—parallel computing for Web-scale data processing • Fundamental component in Google’s technological architecture • Why didn’t Google use parallel Fortran, MPI, …? • Followed by many technology firms

  4. MapReduce • Map and Fold • Map: do something to all elements in a list • Fold: aggregate elements of a list • Used in functional programming languages such as Lisp Old ideas can be fabulous, too! ( = Lisp “Lost In Silly Parentheses”) ?

  5. MapReduce • Map is a higher-order function: apply an op to all elements in a list • Result is a new list • Parallelizable (map (lambda (x) (* x x)) '(1 2 3 4 5))  '(1 4 9 16 25) f f f f f

  6. Program Execution on Web-Scale Data The MapReduce Approach • Reduce is also a higher-order function • Like “fold”: aggregate elements of a list • Accumulator set to initial value • Function applied to list element and the accumulator • Result stored in the accumulator • Repeated for every item in the list • Result is the final value in the accumulator (fold + 0 '(1 2 3 4 5))  15 (fold * 1 '(1 2 3 4 5))  120 f f f f f Initial value final result

  7. Program Execution on Web-Scale Data The MapReduce Approach Massive parallel processing made simple • Example: word count • Map: parse a document and generate <word, 1> pairs • Reduce: receive all pairs for a specific word, and count (sum) Map Reduce // D is a document for each word w in D output <w, 1> Reduce for key w: count = 0 for each input item count = count + 1 output <w, count>

  8. Big data, but simple dependence Relatively easy to partition data Supported by a distributed system Distributed OS services across thousands of commodity PCs (e.g., GFS) First users are search oriented Crawl, index, search Design Context Designed years ago, still working today, growing adoptions

  9. Single Master node Worker threads Worker threads Workflow Single master, numerous worker threads

  10. 1. The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines. 2. One of the copies of the program is the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task. Workflow

  11. 3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory. 4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. Workflow

  12. 5. When a reduce worker is notified by the master about these locations, it uses RPCs to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. 6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition. 7. When all map tasks and reduce tasks have been completed, the MapReduce returns back to the user code. Workflow

  13. How to write a MapReduce programto Generate inverted indices? Sort? How to express more sophisticated logic? What if some workers (slaves) or the master fails? Programming

  14. Workflow Master informed ofresult locations Initial data split into 64MB blocks R reducers retrieve Data from mappers Computed, resultslocally stored Final output written Where is the communication-intensive part?

  15. Data Storage – Key-Value Store • Distributed, scalable storage for key-value pairs • Example: Dynamo (Amazon) • Another example may be P2P storage (e.g., Chord) • Key-value store can be a general foundation for more complex data structures • But performance may suffer

  16. Dynamo: a decentralized, scalable key-value store Used in Amazon Use consistent hashing to distributed data among nodes Replicated, versioning, load balanced Easy-to-use interface: put()/get() Data Storage – Key-Value Store

  17. Data Storage – Network Block Device • Networked block storage • ND by SUN Microsystems • Remote block storage over Internet • Use S3 as a block device [Brantner] • Block-level remote storage may become slow in networks with long latencies

  18. Data Storage – Traditional File Systems • PC file systems • Link together all clusters of a file • Directory entry: filename, attributes, date/time, starting cluster, file size • Boot sector (superblock) : file system wide information • File allocation table, root directory, … Boot sector FAT 1 FAT 2 (dup) ROOT dir Normal directories and files

  19. Data Storage – Network File System • NFS—Network File System [Sandberg] • Designed by SUN Microsystems in the 1980’s • Transparent remote access to files stored remotely • XDR, RPC, VNode, VFS • Mountable file system, synchronous behavior • Stateless server

  20. Data Storage – Network File System Client Server NFS organization

  21. Data Storage – Google File System (GFS) • A distributed file system at work (GFS) • Single master and numerous slaves communicate with each other • File data unit, “chunk”, is up to 64MB. Chunks are replicated. • “master” is a single point of failure and bottleneck of scalability, the consistency model is difficult to use

  22. A 42342 E A 42342 E B 42521 W B 42521 W B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E C 66354 W C 66354 W D 12352 E D 12352 E E 75656 C E 75656 C F 15677 E F 15677 E Data Storage – Database PNUTS – a relational database service Indexes and views CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Replication Parallel database Structured schema Designed and used by Yahoo!

  23. MapReduce/Hadoop • Around 2004,Google invented MapReduce to parallelize computation of large data sets. It’s been a key component in Google’s technology foundation • Around 2008, Yahoo! developed the open-sourcevariant of MapReduce named Hadoop • After 2008, MapReduce/Hadoop become a key technology component in cloud computing • In 2010, the U.S. conferred the MapReducepatent to Google MapReduce Hadoop … Hadoop or variants …

  24. MapReduce—Limitations • MapReduce provides an easy-to-use framework for parallel programming, but is it the most efficient and best solution to program execution in datacenters? • MapReduce has its discontents • DeWitt and Stonebraker: “MapReduce: A major step backwards” – MapReduce is far less sophisticated and efficient than parallel query processing • MapReduce is a parallel processing framework, not a database system, nor a query language • It is possible to use MapReduce to implement some of the parallel query processing functions • What are the real limitations? • Inefficient for general programming (and not designed for that) • Hard to handle data with complex dependence, frequent updates, etc. • High overhead, bursty I/O, difficult to handle long streaming data • Limited opportunity for optimization

  25. MapReduce: A major step backwards -- David J. DeWitt and Michael Stonebraker (MapReduce) is A giant step backward in the programming paradigm for large-scale data intensive applications A sub-optimal implementation, in that it uses brute force instead of indexing Not novel at all Missing features Incompatible with all of the tools DBMS users have come to depend on Critiques

  26. MapReduce—Limitations • Inefficient for general programming (and not designed for that) • Hard to handle data with complex dependence, frequent updates, etc. • High overhead, bursty I/O • Experience with developing a Hadoop-based distributed compiler • Workload: compile Linux kernel • 4 machines available to Hadoop for parallel compiling • Observation: parallel compiling on 4 nodes with Hadoop can be even slower than sequential compiling on one node

  27. Re-thinking MapReduce • Proprietary solution developed in an environment with one prevailing application (web search) • The assumptions introduce several important constraints in data and logic • Not a general-purpose parallel execution technology • Design choices in MapReduce • Optimizes for throughput rather than latency • Optimizes for large data set rather than small data structures • Optimizes for coarse-grained parallelism rather than fine-grained

  28. MRlite: Lightweight Parallel Processing • A lightweight parallelization framework following the MapReduce paradigm • Implemented in C++ • More than just an efficient implementation of MapReduce • Goal: a lightweight “parallelization” service that programs can invoke during execution • MRlite follows several principles • Memory is media—avoid touching hard drives • Static facility for dynamic utility—use and reuse threads for map tasks

  29. MRlite:Towards Lightweight, Scalable, and General Parallel Processing • The MRlite master accepts jobs from clients and schedules them to execute on slaves slave • Distributed nodes accept tasks from master and execute them application slave MRlite master scheduler MRlite client slave • Linked together with the app, the MRlite client library accepts calls from app and submits jobs to the master slave • High speed distributed storage, stores intermediate files High speed Distributed storage Data flow Command flow

  30. Computing Capability Z. Ma and L. Gu. The Limitation of MapReduce: a Probing Case and a Lightweight Solution. CLOUD COMPUTING 2010 Using MRlite, the parallel compilation jobs, mrcc, is 10 times faster than that running on Hadoop!

  31. Inside MapReduce-Style Computation Network activities under MapReduce/Hadoop workload • Hadoop: open-source implementation of MapReduce • Processing data with 3 servers (20 cores) • 116.8GB input data • Network activities captured with Xen virtual machines

  32. Workflow Master informed ofresult locations Initial data split into 64MB blocks R reducers retrieve Data from mappers Computed, resultslocally stored Final output written Where is the communication-intensive part?

  33. Inside MapReduce • Packet reception under MapReduce/Hadoop workload • Large data volume • Bursty network traffic • Genrality—widely observed in MapReduce workloads Packet reception on a slave server

  34. Inside MapReduce Packet reception on the master server

  35. Inside MapReduce Packet transmission on the master server

  36. Datacenter Networking Major Components of a Datacenter • Computing hardware (equipment racks) • Power supply and distribution hardware • Cooling hardware and cooling fluid distribution hardware • Network infrastructure • IT Personnel and office equipment

  37. Growth Trends in Datacenters Datacenter Networking • Load on network & servers continues to rapidly grow • Rapid growth: a rough estimate of annual growth rate: enterprise data centers: ~35%, Internet data centers: 50% - 100% • Information access anywhere, anytime, from many devices • Desktops, laptops, PDAs & smart phones, sensor networks, proliferation of broadband • Mainstream servers moving towards higher speed links • 1-GbE to10-GbE in 2008-2009 • 10-GbE to 40-GbE in 2010-2012 • High-speed datacenter-MAN/WAN connectivity • High-speed datacenter syncing for disaster recovery

  38. Datacenter Networking • A large part of the total cost of the DC hardware • Large routers and high-bandwidth switches are very expensive • Relatively unreliable – many components may fail. • Many major operators and companies design their own datacenter networking to save money and improve reliability/scalability/performance. • The topology is often known • The number of nodes is limited • The protocols used in the DC are known • Security is simpler inside the data center, but challenging at the border • We can distribute applications to servers to distribute load and minimize hot spots

  39. Networking components (examples) Datacenter Networking 64 10-GE port Upstream 768 1-GE port Downstream • High Performance & High Density Switches & Routers • Scaling to 512 10GbE ports per chassis • No need for proprietary protocols to scale • Highly scalable DC Border Routers • 3.2 Tbps capacity in a single chassis • 10 Million routes, 1 Million in hardware • 2,000 BGP peers • 2K L3 VPNs, 16K L2 VPNs • High port density for GE and 10GE application connectivity • Security

  40. Internet Layer-3 router Core Aggregation Layer-2/3 switch Access Layer-2 switch Servers Datacenter Networking Common data center topology Data Center

  41. Datacenter Networking Data center network design goals • High network bandwidth, low latency • Reduce the need for large switches in the core • Simplify the software, push complexity to the edge of the network • Improve reliability • Reduce capital and operating cost

  42. Data Center Networking Avoid this… and simplify this…

  43. Interconnect Can we avoid using high-end switches? • Expensive high-end switches to scale up • Single point of failure and bandwidth bottleneck • Experiences from real systems ? • One answer: DCell

  44. Interconnect DCell Ideas • #1: Use mini-switches to scale out • #2: Leverage servers to be part of the routing infrastructure • Servers have multiple ports and need to forward packets • #3: Use recursion to scale and build complete graph to increase capacity

  45. Data Center Networking One approach: switched network with a hypercube interconnect • Leaf switch: 40 1Gbps ports+2 10 Gbps ports. • One switch per rack. • Not replicated (if a switch fails, lose one rack of capacity) • Core switch: 10 10Gbps ports • Form a hypercube • Hypercube – high-dimensional rectangle

  46. Interconnect Hypercube properties • Minimum hop count • Even load distribution for all-all communication. • Can route around switch/link failures. • Simple routing: • Outport = f(Dest xor NodeNum) • No routing tables

  47. Interconnect A 16-node (dimension 4) hypercube

  48. Interconnect Core switch: 10Gbps port x 10 How many servers can be connected in this system? 81920 servers with 1Gbps bandwidth Leaf switch: 1Gbps port x 40 + 10Gbps port x 2.

  49. Data Center Networking The Black Box

  50. Data Center Network Shipping Container as Data Center Module • Data Center Module • Contains network gear, compute, storage, & cooling • Just plug in power, network, & chilled water • Increased cooling efficiency • Water & air flow • Better air flow management • Meet seasonal load requirements

More Related