Lecture 12Scalable Computing Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National Taiwan University
Scalable Internet Services • Lessions from Giant-Scale Serviceshttp://www.computer.org/internet/ic2001/w4046abs.htm • Access anywhere, anytime. • Availability via multiple devices. • Groupware support. • Lower overall cost. • Simplified service updates.
Network Interface • A simple network connecting two machines • Message
Network Topologies • Relative performance for 64 nodes
Load Management • Balancing loads (load balancer) • Round-robin DNS • Layer-4 (Transport layer, e.g. TCP) switches • Layer-7 (Application layer) switches
The 7 OSI (Open System Interconnection) Layers • Application (Layer 7) This layer supports application and end-user processes. Communication partners are identified, quality of service is identified, user authentication and privacy are considered, and any constraints on data syntax are identified. Everything at this layer is application-specific. file transfers, e-mail, and other networksoftware services. Telnet and FTP. • Presentation (Layer 6) This layer provides independence from differences in data representation (e.g., encryption) by translating from application to network format, and vice versa. The presentation layer works to transform data into the form that the application layer can accept. This layer formats and encrypts data to be sent across a network, providing freedom from compatibility problems. It is sometimes called the syntax layer. • Session (Layer 5) This layer establishes, manages and terminates connections between applications. The session layer sets up, coordinates, and terminates conversations, exchanges, and dialogues between the applications at each end. It deals with session and connection coordination.
The 7 OSI (Open System Interconnection) Layers • Transport (Layer 4) This layer provides transparent transfer of data between end systems, or hosts, and is responsible for end-to-end error recovery and flow control. It ensures complete data transfer. TCP/IP. • Network (Layer 3) This layer provides switching and routing technologies, creating logical paths, known as virtual circuits, for transmitting data from node to node. Routing and forwarding are functions of this layer, as well as addressing, internetworking, error handling, congestion control and packet sequencing. • Data Link (Layer 2) At this layer, data packets are encoded and decoded into bits. It furnishes transmission protocol knowledge and management and handles errors in the physical layer, flow control and frame synchronization. The data link layer is divided into two sublayers: The Media Access Control (MAC) layer and the Logical Link Control (LLC) layer. The MAC sublayer controls how a computer on the network gains access to the data and permission to transmit it. The LLC layer controls frame synchronization, flow control and error checking. • Physical (Layer 1) This layer conveys the bit stream - electrical impulse, light or radio signal -- through the network at the electrical and mechanical level. It provides the hardware means of sending and receiving data on a carrier, including defining cables, cards and physical aspects. Fast Ethernet, RS232, and ATM are protocols with physical layer components.
High Availability • High availability is a major driving requirement behind giant-scale system design. • Uptime: typically measured in nines, and traditional infrastructure systems such as the phone system aim for four or five nines (“four nines” implies 0.9999 uptime, or less than 60 seconds of downtime per week). • Meantime-between-failure (MTBF) • Mean-time-to-repair (MTTR) • uptime = (MTBF – MTTR)/MTBF • yield = queries completed/queries offered • harvest = data available/complete data • DQ Principle: Data per query × queries per second →constant • Graceful Degradation
Clusters in Giant-Scale Services • Scalability • Cost/performance • Independent components
Lesson Learned • Get the basics right. Start with a professional data center and layer-7 switches, and use symmetry to simplify analysis and management. • Decide on your availability metrics. Everyone should agree on the goals and how to measure them daily. Remember that harvest and yield are more useful than just uptime. • Focus on MTTR at least as much as MTBF. Repair time is easier to affect for an evolving system and has just as much impact. • Understand load redirection during faults. Data replication is insufficient for preserving uptime under faults; you also need excess DQ. • Graceful degradation is a critical part of a high-availability strategy. Intelligent admission control and dynamic database reduction are the key tools for implementing the strategy. • Use DQ analysis on all upgrades. Evaluate all proposed upgrades ahead of time, and do capacity planning. • Automate upgrades as much as possible. Develop a mostly automatic upgrade method, such as rolling upgrades. Using a staging area will reduce downtime, but be sure to have a fast, simple way to revert to the old version.
Deep Scientific ComputingKramer et. al., IBM J. R&D March 2004 • High-performance computing (HPC) • Resolution of a simulation • Complexity of an analysis • Computational power • Data storage • New paradigms of computing • Grid computing • Network
Themes (1/2) • Deep science applications must now integrate simulation with data analysis. In many ways this integration is inhibited by limitations in storing, transferring, and manipulating the data required. • Very large, scalable, high-performance archives, combining both disk and tape storage, are required to support this deep science. These systems must respond to large amounts of data—both many files and some very large files. • High-performance shared file systems are critical to large systems. The approach here separates the project into three levels—storage systems, interconnect fabric, and global file systems. All three levels must perform well, as well as scale, in order to provide applications with the performance they need. • New network protocols are necessary as the data flows are beginning to exceed the capability of yesterdays protocols. A number of elements can be tuned and improved in the interim, but long-term growth requires major adjustments.
Themes (2/2) • Data management methods are key to being able to organize and find the relevant information in an acceptable time. • Security approaches are needed that allow openness and service while providing protection for systems. The security methods must understand not just the application levels but also the underlying functions of storage and transfer systems. • Monitoring and control capabilities are necessary to keep pace with the system improvements. This is key, as the application developers for deep computing must be able to drill through virtualization layers in order to understand how to achieve the needed performance.
Networking for HPC Systems • End-to-end network performance is a product of • Application behavior • Machine capabilities • Network path • Network protocol • Competing traffic • Difficult to ascertain the limiting factor without monitoring/diagnostic capabilities • End host issues • Routers and gateways • Deep Security
End Host Issues • Throughput limit • Time to copy data from user memory to kernel across memory bus (2 memory cycles) • Time to copy from kernel to NIC (1 I/O cycle) • If limited by memory BW:
Memory & I/O Bandwidth • Memory BW • DDR: 650-2500 MB/s • I/O BW • 32-bit/33Mhz PCI: 132 MB/s • 64-bit/33Mhz PCI: 264 MB/s • 64-bit/66Mhz PCI: 528 MB/s • 64-bit/133Mhz PCI-X: 1056 MB/s • PCI-E x1: ~1 Gbit/s • PCI-E x16: ~16 Gbit/s
Network Bandwidth • VT600, 32-bit/33mhz PCI, DDR400, AMD2700+, 850 MB/s memory BW • 485 Mbit/s • 64-bit/133Mhz PCI-X, 1100-2500 MB/s memory BW • Limited to 5000 Mbit/s • Also limited by DMA overhead • Only reach half of 10Gb NIC • I/O architecture • On-chip NIC? • OS architecture • Reduce number of memory copy? Zero copy? • TCP/IP overhead • TCP/IP offload • Maximum Transfer Unit (MTU)
Conclusion • High performance storage and network • End host performance • Data management • Security • Monitoring and control
Science-driven System Architecture • Leadership Computing Systems • Processor performance • Interconnect performance • Software: scalability & optimized lib • Blue Planet • Redesigned Power5-based HPC system • Single core node • High-memory bandwidth per processor • ViVA (Virtual Vector Architecture) allows the eight processors in a node to be treated as a single processor with peak performance of 60+ Gigaflop/s.
ViVA-2: Application Accelerator • Accelerates particular application-specific or domain-specific features. • Irregular access patterns • High load/store issue rates • Low cache line utilization • ISA enhancement • Inst to support prefetch irregular data access • Inst to upport sparse, non-cache-resident loads • More registers for SW pipelining • Inst to initiate many dense/indexed/sparce loads • Proper compiler support will be a critical component
Leadership Computing Applications • Major computational advances • Nanoscience • Combustion • Fusion • Climate • Life Sciences • Astrophysics • Teamwork • Project team • Facilities • Computational scientist
Supercomputers 1993-2000 • Clusters vs MPPs
Clusters • Cost-performance
Google • Built with lots of PC’s • 80 PC’s in one rack
Google • Performance • Latency: <0.5s • Bandwidth, scaled with # of users • Cost • Cost of PC keeps shrinking • Switches, Rack, etc. • Power • Reliability • Software failure • Hardware failure (1/10 of SW failure) • DRAM (1/3) • Disks (2/3) • Switch failure • Power outage • Network outage