Parallel Scalable Operating Systems Presented by Dr. Florin IsailaUniversidad Carlos III de MadridVisiting Scholar Argonne National Lab
Contents • Preliminaries • Top500 • Scalability • Blue Gene • History • BG/L, BG/C, BG/P, BG/Q • Scalable OS for BG/L • Scalable file systems • BG/P at Argonne National Lab • Conclusions
Generalities • Since 1993 twice a year: June and November • Ranking of the most powerful computing systems in the world • Ranking criteria: performance of the LINPACK benchmark • Jack Dongarra alma máter • Site web: www.top500.org
HPL: High-Performance Linpack • solves a dense system of linear equations • Variant of LU factorization of matrices of size N • measure of a computer’s floating-point rate of execution • computation done in 64 bit floating point arithmetic • Rpeak : theoretic system performance • upper bound for the real performance (in MFLOP) • Ex: Intel Itanium 2 at 1.5 GHz 4 FP/s -> 6GFLOPS • Nmax: obtained by varying N and choosing the maximum performance • Rmax : maximum real performance achieved for Nmax • N1/2: size of problem needed to achieve ½ ofRmax
Amdahl´s law • Suppose a fraction f of your application is not parallelizable • 1-f : parallelizable on p processors Speedup(P) = T1 /Tp <= T1/(f T1 + (1-f) T1 /p) = 1/(f + (1-f)/p) <= 1/f
Sequential Work Speedup ≤ Max Work on any Processor Load Balance • Work: data access, computation • Not just equal work, but must be busy at same time • Ex: Speedup ≤1000/400 = 2.5
Sequential Work Speedup < Max (Work + Synch Wait Time + Comm Cost) Communication and synchronization • Communication is expensive! • Measure: communication to computation ratio • Inherent communication • Determined by assignment of tasks to processes • Actual communication may be larger (artifactual) • One principle: Assign tasks that access same data to same process Process 1 Process 2 Process 3 Communication Work Synchronization point Synchronization wait time
Blue Gene partners • IBM • “Blue”: The corporate color of IBM • “Gene”: The intended use of the Blue Gene clusters – Computational biology, specifically, protein folding • Lawrence Livermore National Lab • Department of Energy • Academia
Family • BG/L • BG/C • BG/P • BG/Q
System Blue Gene/L 64 Racks, 64x32x32 Rack 32 Node Cards Node Card 180/360 TF/s 32 TB (32 chips 4x4x2) 16 compute, 0-2 IO cards 2.8/5.6 TF/s 512 GB Compute Card 2 chips, 1x2x1 90/180 GF/s 16 GB Chip 2 processors 5.6/11.2 GF/s 1.0 GB 2.8/5.6 GF/s 4 MB
Technical specifications • 64 cabinets which contain 65.536 high-performance compute nodes (chips) • 1.024 I/O nodes. • 32-bit PowerPC processors • 5 networks • The main memory has a size of 33 terabytes. • Maximum performance of 183.5 TFLOPS when using one processor for computation and the other one for communication, and 367 TFLOPS if using both for computation.
Blue Gene / L • Networks: • 3D Torus • Collective Network • Global Barrier/Interrupt • Gigabit Ethernet (I/O & Connectivity) • Control (system boot, debug, monitoring)
Networks • Three dimensional torus • - Compute nodes • Global tree • - collective communication • - I/O • Ethernet • Control network
Three-dimensional (3D) torus network in which the nodes (red balls) are connected to their six nearest-neighbor nodes in a 3D mesh.
Blue Gene / L • Processor: PowerPC 440 700Mhz • Low power allows dense packaging • External Memory: 512MB SDRAM per node / 1GB • Slow embedded core at a clock speed of 700 Mhz • 32 KB L1 cache • L2 is a small prefetch buffer • 4MB Embedded DRAM L3 cache
BG/L compute ASIC • Non-cache coherent L1 • Pre-fetch buffer L2 • Shared 4MB DRAM (L3) • Interface to external DRAM • 5 network interfaces • Torus, collective, global barrier, Ethernet, control
Blue Gene / L • Compute Nodes: • Dual processor, 1024 per Rack • I/O Nodes: • Dual processor, 16-128 per Rack
Blue Gene / L • Compute Nodes: • Proprietary kernel (tailored to processor design) • I/O Nodes: • Embedded Linux • Front-end and service nodes: • Suse SLES 9 Linux (familiarity with users)
Blue Gene / L • Performance: • Peak performance per rack: 5,73 TFlops • Linpack performance per rack: 4,71 TFlops
Blue Gene / C • a.k.a Cyclops64 • massively parallel (first supercomputer on a chip) • Processors with a 96 port, 7 stage non-internally blocking crossbar switch. • Theoretical peak performance (chip): 80 GFlops
Blue Gene / C • Cellular architecture • 64-bit Cyclops64 chip: • 500 Mhz • 80 processors ( each has 2 thread units and a FP unit) • Software • Cyclops64 exposes much of the underyling hardware to the programmer, allowing the programer to write very high performance, finely tuned software.
Blue Gene / C • Picture of BG/C • Performances: • Board: 320 GFlops • Rack: 15,76 Tflops • System: 1,1 PFlops
Blue Gene / P • Similar Architecture to BG/L, but • Cache coherent L1 cache • 4 cores per nodes • 10 Gbit Ethernet external IO infrastructure • Scales upto 3-PFLOPS • More energy efficient • 167TF/s by 2007, 1PF by 2008
Blue Gene / Q • Continuation of Blue Gene/L and /P • Targeting 10PF/s by 2010/2011 • Higher freq at similar performance / watt • Similar number of nodes • Many more cores • More generally useful • Aggressive compiler • New network: Scalable and cheap
Motivationfor a scalable OS • Blue Gene/L is currently the world’s fastest and most scalable supercomputer • Several system components contribute to that scalability. • The Operating Systems for the different nodes of Blue Gene/L are among the components responsible for that scalability. • The OS overhead on one node affects the scalability of the whole system • Goal: design a scalable solution for the OS.
High level view of BG/L • Principle: the structure of the software should reflect the structure of the hardware.
BG/L Partitioning • Space-sharing • Divided along natural boundaries into partitions • Each partition can run only one job • Each node can be in one of this modes • Coprocessor: one processor assists the other • Virtual node: two separate processors, each of them with its own memory space
OS • Compute nodes: dedicated OS • I/O nodes: dedicated OS • Service nodes: conventional off-the-shelf OS • Front-end nodes: program compilation, debug, submit • File servers: store data , no specific for BG/L
BG/L OS solution • Components: I/O, service nodes, CNK • The compute and I/O nodes organized into logical entities called processing sets or psets: 1 I/O node + a collection of CNs • 8, 16, 64, 128 CNs • Logical concept • Should reflect physical proximity => fast communication • Job: collection of N compute processes (on CNs) • Own private address space • Message passing • MPI: ranks 0, N-1
BG/L OS solution:CNK • Compute node: run only compute processes an all the compute nodes of a particular partition can execute in two different modes: • Coprocessor mode • Virtual node mode • Compute Node Kernel (CNK): simple OS • Creates an address spaces • Load code and initialize data • Transfer processor control to the loaded executable
CNK • consumes 1MB • Creates either • One address space of 511/1023MB • 2 address spaces of 255/511MB • No virtual memory, no paging • The entire mapping fits into the TLB of PowerPC • Load in push mode: 1 CN reads the executable from FS and sends to all the others • One image loaded and then stays out of the way!!!
CNK • No OS scheduling (one thread) • No memory management (No TLB overhead) • No local file services • User level execution until: • Process requests a system call • Hardware interrupts: timer (requested by application), abnormal events • Syscall • Simple: handled locally (getting the time, set an alarm) • Complex: forward to I/O nodes • Unsupported (fork/mmap): error
Benefits of the simple solution • Robustness: simple design, implementation, test, debugging • Scalability: no interference among compute nodes • Low system noise • Performance measurements
I/O node • Two roles in Blue Gene/L: • Act as an effective master of its corresponding pset • To offer services request from compute nodes in its pset • Mainly I/O operations on locally mounted FSs • Only one processor used: due to the lack of memory coherency • Executes an embedded version of the Linux operating system: • Does not use any swap space • it has an in-memory root file system • it uses little memory • lacks the majority of LINUX daemons.
I/O node • Complete TCP/IP stack • Supported FS: NFS, GPFS, Lustre, PVFS • Main process: Control and I/O daemon (CIOD) • Launch a job • Job manager sends the request to the service node • Service node contacts the CIOD • CIOD sends the executable to all processes in pset
Service nodes • run the Blue Gene/L control system. • Tight integration with CNs and IONs • CN and IONs: stateless, no persistent memory • Responsible for operation and monitoring the CNs and I/ONs • Creates system partitions and isolates it • Computes network routing for torus, collective and global interrupt networks • loads OS code for CNs and I/ONs
Problems • Not fully POSIX compliant • Many applications need • Process/thread creation • Full server sockets • Shared memory segments • Memory mapped files
File systems for BG systems • Need for scalable file systems: NFS is not a solution • Most supercomputers and clusters in top 500 use one of these parallel file systems • GPFS • Lustre • PVFS2
GPFS/PVFS/Lustre mounted on the I/O nodes File system servers