1 / 41

Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Overview of the Blue Gene supercomputers. Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY. Supercomputer trends Blue Gene/L and Blue Gene/P architecture Blue Gene applications Terminology: FLOPS = Floating Point Operations Per Second

emory
Download Presentation

Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of the Blue Gene supercomputers Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

  2. Supercomputer trends • Blue Gene/L and Blue Gene/P architecture • Blue Gene applications Terminology: FLOPS = Floating Point Operations Per Second Giga = 10^9, Tera = 10^12, Peta= 10^15, Exa = 10^18 Peak speed v.s. Sustained Speed Top 500 list (top500.org): Based on the Linpack Benchmark: Solve dense linear matrix equation, A x = b A is N x N dense matrix, total FP operations, ~ 2/3 N^3 + 2 N^2 Green 500 list (green500.org): Rate Top 500 supercomputers in FLOPS/Watt

  3. CMOS Scaling in Petaflop Era • Three decades of exponential clock rate (and electrical power!) growth has ended • Instruction Level Parallelism (ILP) growth has ended • Single threaded performance improvement is dead (Bill Dally) • Yet Moore’s Law continues in transistor count • Industry response: Multi-core (i.e. double the number of cores every 18 months instead of the clock frequency (and power!) Source: “The Landscape of Computer Architecture,” John Shalf, NERSC/LBNL, presented at ISC07, Dresden, June 25, 2007

  4. TOP500 Performance Trend Over the long haul IBM has demonstrated continued leadership in various TOP500 metrics, even as the performance continues it’s relentless growth. IBM has most aggregate performance for last 22 lists IBM has #1 system for 10 out of last 12 lists (13 in total) IBM has most in Top10 for last 14 lists IBM has most systems 14 out of last 22 lists 32.43 PF 1.759 PF 433.2 TF # 1 Total Aggregate Performance 24.67 TF # 10 # 500 Source: www.top500.org Blue Square Markers Indicate IBM Leadership

  5. President Obama Honors IBM's Blue Gene Supercomputer With National Medal Of Technology And Innovation Ninth time IBM has received nation's most prestigious tech award Blue Gene has led to breakthroughs in science, energy efficiency and analytics WASHINGTON, D.C. - 18 Sep 2009: President Obama recognized IBM (NYSE: IBM) and its Blue Gene family of supercomputers with the National Medal of Technology and Innovation, the country's most prestigious award given to leading innovators for technological achievement. President Obama will personally bestow the award at a special White House ceremony on October 7.  IBM, which earned the National Medal of Technology and Innovation on eight other occasions, is the only company recognized with the award this year.   Blue Gene's speed and expandability have enabled business and science to address a wide range of complex problems and make more informed decisions -- not just in the life sciences, but also in astronomy, climate, simulations, modeling and many other areas.  Blue Gene systems have helped map the human genome, investigated medical therapies, safeguarded nuclear arsenals, simulated radioactive decay, replicated brain power, flown airplanes, pinpointed tumors, predicted climate trends, and identified fossil fuels – all without the time and money that would have been required to physically complete these tasks.  The system also reflects breakthroughs in energy efficiency. With the creation of Blue Gene, IBM dramatically shrank the physical size and energy needs of a computing system whose processing speed would have required a dedicated power plant capable of generating power to thousands of homes.  The influence of the Blue Gene supercomputer's energy-efficient design and computing model can be seen today across the Information Technology industry. Today, 18 of the top 20 most energy efficient supercomputers in the world are built on IBM high performance computing technology, according to the latest Supercomputing 'Green500 List' announced by Green500.org in July, 2009. 

  6. Blue Gene Roadmap • BG/L (5.7 TF/rack) – 130nm ASIC (1999-2004GA) • 104 racks, 212,992 cores, 596 TF/s, 210 MF/W; dual-core system-on-chip, • 0.5/1 GB/node • BG/P (13.9 TF/rack) – 90nm ASIC (2004-2007GA) • 72 racks, 294,912 cores, 1 PF/s, 357 MF/W; quad core SOC, DMA • 2/4 GB/node • SMP support, OpenMP, MPI • BG/Q (209 TF/rack) • 20 PF/s

  7. Blue Gene/Q Power Multi Core Scalable to 100 PF Blue Gene/P (PPC 450 @ 850MHz) Scalable to 3.56 PF Blue Gene/L (PPC 440 @ 700MHz) Scalable to 595 TFlops Blue Gene Technology Roadmap Performance 2004 2007 2010 Note: All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.

  8. BlueGene/L System Buildup System 64 Racks, 64x32x32 Rack 32 Node Cards Node Card 180/360 TF/s 64 TB (32 chips 4x4x2) 16 compute, 0-2 IO cards 2.8/5.6 TF/s 1 TB Compute Card 2 chips, 1x2x1 Chip 90/180 GF/s 32 GB 2 processors 5.6/11.2 GF/s 2.0 GB 2.8/5.6 GF/s 4 MB

  9. BlueGene/L Compute ASIC

  10. Quadword Load Data Secondary FPR S0 Primary FPR P0 S31 P31 Quadword Store Data Double Floating-Point Unit • Two replicas of a standard single-pipe PowerPC FPU • 2 x 32 64-bit registers • Attached to the PPC440 core using the APU interface • Issues instructions across APU interface • Instruction decode performed in Double FPU • Separate APU interface from LSU to provide up to 16B data for load and store • Datapath width is 16 bytes • Feeds two FPUs with 8 bytes each every cycle • Two FP multiply-add operations per cycle • 2.8 GF/s peak

  11. Blue Gene L/Memory Charateristics Memory: Node System (64k nodes) L1 32kB/32kB L2 2kB per processor SRAM 16kB L3 4MB (ECC)/node Main store 512MB (ECC)/node 32TB Bandwidth: L1 to Registers 11.2 GB/s Independent R/W and Instruction L2 to L1 5.3 GB/s Independent R/W and Instruction L3 to L2 11.2 GB/s Main (DDR) 5.3GB/s Latency: L1 miss, L2 hit 13 processor cycles (pclks) L2 miss, L3 hit 28 pclks (EDRAM page hit/EDRAM page miss) L2 miss (main store)75 pclks for DDR closed page access (L3 disabled/enabled)

  12. Blue Gene Interconnection Networks 3 Dimensional Torus • Interconnects all compute nodes (65,536) • Virtual cut-through hardware routing • 1.4Gb/s on all 12 node links (2.1 GB/s per node) • Communications backbone for computations • 0.7/1.4 TB/s bisection bandwidth, 67TB/s total bandwidth Global Collective Network • One-to-all broadcast functionality • Reduction operations functionality • 2.8 Gb/s of bandwidth per link; Latency of tree traversal 2.5 µs • ~23TB/s total binary tree bandwidth (64k machine) • Interconnects all compute and I/O nodes (1024) Low Latency Global Barrier and Interrupt • Round trip latency 1.3 µs Control Network • Boot, monitoring and diagnostics Ethernet • Incorporated into every node ASIC • Active in the I/O nodes (1:64) • All external comm. (file I/O, control, user interaction, etc.)

  13. Node Card (32 chips 4x4x2) 32 compute, 0-1 IO cards System BlueGene/P 72 Racks, 72x32x32 Cabled 8x8x16 Rack 32 Node Cards 1 PF/s 144 (288) TB 13.9 TF/s 2 (4) TB Compute Card 435 GF/s 64 (128) GB 1 chip, 20 DRAMs Chip 13.6 GF/s 2.0 GB DDR2 (4.0GB 6/30/08) 4 processors 13.6 GF/s 8 MB EDRAM

  14. BlueGene/P compute ASIC 32k I1/32k D1 Snoop filter snoop PPC450 Shared L3 Directory for eDRAM w/ECC 4MB eDRAM L3 Cache or On-Chip Memory Multiplexing switch 128 L2 Double FPU 256 512b data 72b ECC 32k I1/32k D1 Snoop filter 256 PPC450 128 L2 Double FPU 32 Shared SRAM Multiplexing switch Snoop filter 32k I1/32k D1 Shared L3 Directory for eDRAM w/ECC 4MB eDRAM L3 Cache or On-Chip Memory PPC450 128 L2 512b data 72b ECC Double FPU Snoop filter 32k I1/32k D1 PPC450 128 L2 Double FPU Arb DMA Hybrid PMU w/ SRAM 256x64b DDR-2 Controller w/ ECC DDR-2 Controller w/ ECC JTAG Access Torus Global Barrier Collective Ethernet 10 Gbit 4 global barriers or interrupts JTAG 10 Gb/s 6 3.4Gb/sbidirectional 3 6.8Gb/sbidirectional 13.6 Gb/s DDR-2 DRAM bus

  15. Blue Gene/P Memory Characteristics Memory: Node L1 32kB/32kB L2 2kB per processor L3 8MB (ECC)/node Main store 2-4GB (ECC)/node Bandwidth: L1 to Registers 6.8 GB/s instruction Read 6.8 GB/s data Read 6.8 GB/s Write L2 to L1 5.3 GB/s Independent R/W and Instruction L3 to L2 13.6 GB/s Main (DDR) 13.6 GB/s Latency: L1 hit 3 processor cycles (pclks) L1 miss, L2 hit 13 pclks L2 miss, L3 hit 46 pclks(EDRAM page hit/EDRAM page miss) L2 miss (main store) 104 pclks for DDR closed page access (L3 disabled/enabled)

  16. BlueGene/P Interconnection Networks 3 Dimensional Torus • Interconnects all compute nodes (73,728) • Virtual cut-through hardware routing • 3.4 Gb/s on all 12 node links (5.1 GB/s per node) • 0.5 µs latency between nearest neighbors, 5 µs to the farthest • MPI: 3 µs latency for one hop, 10 µs to the farthest • Communications backbone for computations • 1.7/3.9 TB/s bisection bandwidth, 188TB/s total bandwidth Collective Network • One-to-all broadcast functionality • Reduction operations functionality • 6.8 Gb/s of bandwidth per link per direction • Latency of one way tree traversal 1.3 µs, MPI 5 µs • ~62TB/s total binary tree bandwidth (72k machine) • Interconnects all compute and I/O nodes (1152) Low Latency Global Barrier and Interrupt • Latency of one way to reach all 72K nodes 0.65 µs, MPI 1.6 µs

  17. November 2007 Green 500 Linpack GFLOPS/W 0.09 0.05 0.02

  18. IBM BG/P Relative power, space and cooling efficiencies(Published specs per peak performance)

  19. System Power Efficiency Linpack GF/Watt Source: www.top500.org

  20. HPCC 2009 IBM BG/P 0.557 PF peak (40 racks) • Class 1: Number 1 on G-Random Access (117 GUPS) • Class 2: Number 1 Cray XT5 2.331 PF peak • Class 1: Number 1 on G-HPL (1533 TF/s) • Class 1: Number 1 on EP-Stream (398 TB/s) • Number 1 on G-FFT (11 TF/s) Source: www.top500.org

  21. Main Memory Capacity per Rack

  22. Peak Memory Bandwidth per node (byte/flop)

  23. Main Memory Bandwidth per Rack

  24. Interprocessor Peak Bandwidth per node (byte/flop)

  25. Failures per Month per TF From:http://acts.nersc.gov/events/Workshop2006/slides/Simon.pdf

  26. Execution Modes in BG/P per Node node core core Hardware Abstractions Black Software Abstractions Blue • Next Generation HPC • Many Core • Expensive Memory • Two-Tiered Programming Model core core SMP Mode 1 Process 1-4 Threads/Process Dual Mode 2 Processes 1-2 Threads/Process Quad Mode (VNM) 4 Processes 1 Thread/Process P1 P0 P0 P0 P0 P2 T0 T0 T0 T2 T0 T0 T0 T1 T1 T1 T3 T1 P1 P3 P1 P0 P0 P0 T0 T0 T0 T0 T0 T2 T0 T1

  27. Blue Gene Software Hierarchical Organization • Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) • I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination • Service node performs system management services (e.g., partitioning, heart beating, monitoring errors) - transparent to application software Front-end nodes, file system 10 Gb Ethernet 1 Gb Ethernet

  28. Noise measurements (from Adolphy Hoisie)

  29. C-Node n C-Node n C-Node 0 C-Node 0 CNK CNK CNK CNK Blue Gene/P System Architecture tree Service Node I/O Node SystemConsole Front-endNodes FileServers Linux app app fs client ciod Functional Ethernet (10Gb) MMCS torus DB2 I/O Node Linux I2C app app Control Ethernet (1Gb) LoadLeveler fs client ciod FPGA JTAG

  30. Application User/Sched BG/P Software Stack Source Availability System System I/O and Compute Nodes Service Node/Front End Nodes HPC Toolkit XL Runtime ESSL Open Toolchain Runtime PerfMon BG Nav CSM Firmware Firmware ISV Schedulers, debuggers Loadleveler mpirun Bridge API MPI GA GPSHMEM MPI-IO Message Layer CIOD totalviewd High Level Control System (MMCS) Partitioning, Job management and monitoring, RAS, Administrator interfaces, CIODB DB2 Hardware Hardware CNK Linux kernel GPFS (1) Messaging SPIs Node SPIs Low Level Control System Power On/Off, Hardware probe, Hardware init, Parallel monitoring Parallel boot, Mailbox Common Node Services Hardware init, RAS, Recovery, Mailbox Diags Bootloader Compute node Link card Compute node Link card Compute node Compute node I/O node FEN Service card Compute node I/O node Compute node SN FEN Compute node Node card Node card Key: Notes: 1. GPFS does have an open build license available which customers may utilize. Closed. No source provided. Not buildable. Closed. Buildable source. No redistribution of derivative works allowed under license. New open source reference implementation licensed under CPL. New open source community under CPL license. Active IBM participation. Existing open source communities under various licenses. BG code will be contributed and/or new sub-community started..

  31. Areas Where BG is Used • Weather/Climate Modeling (GOVERNMENT / INDUSTRY / UNIVERSITIES) • Computational Fluid Dynamics – Airplane and Jet Engine Design, Chemical Flows, Turbulence (ENGINEERING / AEROSPACE) • Seismic Processing : (PETROLEUM, Nuclear industry) • Particle Physics : (LATTICE Gauge QCD) • Systems Biology – Classical and Quantum Molecular Dynamics (PHARMA / MED INSURANCE / HOSPITALS / UNIV) • Modeling Complex Systems(PHARMA / BUSINESS / GOVERNMENT / UNIVERSITIES) • Large Database Search • Nuclear Industry • Astronomy(UNIVERSITIES) • Portfolio Analysis via Monte Carlo (BANKING / FINANCE / INSURANCE)

  32. LLNL Applications

  33. IDC Technical Computing Systems Forecast

  34. Genome Sequencing Biological Modeling Materials Science Drug Discovery Fluid Dynamics Financial Modeling Climate Modeling Geophysical Data Processing Pandemic Research What is driving the need for more HPC cycles?

  35. Capability Calculations not possible on small machines Usually these calculations involve systems where many disparate scales are modeled. One scale defines required work per “computation step” A different scale determines total time to solution. Complexity Calculations which seek to combine multiple components to produce an integrated model of a complex system. Individual components can have significant computational requirements. Coupling between components requires that all components be modeled simultaneously. As components are modeled, changes in interfaces are constantly transferred between the components Understanding Repetition of a basic calculation many times with different model parameters, inputs and boundary conditions. Goal is to develop a clear understanding of behavior / dependencies / and sensitivities of the solution over a range of parameters Examples Protein Folding: 10-15.secs – 1 sec Refined grids in Weather forecasting: 10km today -> 1km in a few years Full Simulation of Human Brain Examples Water Cycle Modeling in Climate/Environment Geophysical Modeling for Oil Recovery Virtual Fab Multisystem / Coupled Systems Modeling Examples Multiple independent simulations of Hurricane paths to develop probability estimates of possible paths, possible strength, Thermodynamics of Protein / Drug Interactions Sensitivity Analysis in Oil Reservoir Modeling Optimization of Aircraft Wing Design, HPC Use Cases Useful as proofs of concept Critical to manage multiple scales in physical systems Essential to develop parameter understanding, and sensitivity analysis

  36. Capability

  37. Historical – Present – Near future – Seasonal – Long term – Far future Complexity: Modern Integrated Water Management Sensors Partner Ecosystem • Physical • Chemical • Biological • Environmental • In-situ • Remotely sensed • Planning and placement • Climatologists • Environmental Observation Systems Companies • Sensors Companies • Environmental Sciences Consultants • Engineering Services Companies. • Subject Matter Experts • Universities Enabling IT Adv Water Mgmt Reference IT Architecture • HPC • Visualization • Data management Model Strategy Physical Models Analyses • Selection • Integration & coupling • Validation • Temporal/spatial scales • Climate • Hydrological • Meteorological • Ecological • Stochastic model & stats • Machine learning • Optimization

  38. Overall Efficiencies of BG Applications - Major Scientific Advances • Qbox (DFT) LLNL: 56.5%; 2006 Gordon-Bell Award 64 L racks, 16 PCPMD IBM: 30% highest scaling 64 LMGDC highest scaling 32 P • ddcMD (Classical MD) LLNL: 27.6% 2005 Gordon-Bell Award 64 LNew ddcMD LLNL: 17.4% 2007 Gordon-Bell Award 104 LMDCASK LLNL, SPaSM LANL: highest scaling 64 LLAMMPS SNL: highest scaling 64 L, 32 PRXFF, GMD: highest scaling 64 LRosetta UW: highest scaling 20 LAMBER 4 L • Quantum Chromodynamics CPS: 30%; 2006 GB Special Award 64L, 32PMILC, Chroma 32 P • sPPM (CFD) LLNL: 18%; highest scaling 64 LMiranda, Raptor LLNL: highest scaling 64 LDNS3D highest scaling 32 PNEK5 (Thermal Hydraulics) ANL: 22% 32 PHYPO4D, PLB (Lattice Boltzmann) 32 P • ParaDis (dislocation dynamics) LLNL: highest scaling 64 L • WRF (Weather) NCAR: 10%; highest scaling 64 LPOP (Oceanography): highest scaling 8 LHOMME (Climate) NCAR: 12%; highest scaling 32 L, 24Ki P • GTC (Plasma Physics) PPPL: 7%; highest scaling 20 L, 32 PNimrod GA: 17% • FLASH (Supernova Ia) highest scaling 64 L, 40 PCactus (General Relativity) highest scaling 16 L, 32 P • DOCK5, DOCK6 highest scaling 32 P • Argonne v18 Nuclear Potential 16% 2010 Bonner Prize 32 P • “Cat” Brain 2009 GB Special Award 36 P

  39. High Performance Computing Trends 1PF: 2008 10PF: 2011 • Three distinct phases . • Past: Exponential growth in processor performance mostly through CMOS technology advances • Near Term: Exponential (or faster) growth in level of parallelism. • Long Term: Power cost = System cost ; invention required • Curve is not only indicative of peak performance but also performance/$ Long Term Near Term Past

More Related