Understanding POWER4 Architecture for High-Performance Computing

HPCF Architecture - 2 Norbert Kreitz, March 2004 Based on material provided by Daniel Boulet and IBM

Topics • superscalar attributes • computational units and their pipelines • memory hierarchy • floating point For more details, see the POWER4 System Microarchitecture white paper at http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/power4.html (or search for "power4 architecture" in the Search window at www.ibm.com)

Superscalar architecture • each POWER4 processor has eight independent computational units • each computational unit can execute instructions in parallel with other computational units • This ability to execute up to 8 instructions in parallel is called superscalar execution.

POWER4 computational units • Each processor contains • two FXUs (fixed point units) • two FPUs (floating point units) • two load/store units • a branching unit • a CR (condition register) unit

POWER4 instruction execution • Each processor can: • fetch up to eight instructions per cycle • complete up to five instructions per cycle • track over 200 instructions at any given time • perform out-of-order execution of independent groups of instructions • perform speculative execution • perform register renaming

Instruction execution rate vs duration • The POWER4's superscalar architecture means that instructions may complete at a rate of up to five instructions per cycle but each individual instruction takes longer than one cycle.

Pipelined architecture • Typical floating point instructions take six cycles but can complete at the rate of one instruction per cycle:

Vector vs superscalar pipelines • Vector pipeline • uses pre-loaded vector registers as input • dependent operations not possible - code must be "vectorised" • Superscalar pipeline • uses pool of registers • result from one operation can be fed back as input to the next • dependent operations can cause pipeline stalls

Amdahl's Law (1 of 2) • The performance of any system is constrained by the speed or capacity of the slowest point. A system which is constrained by input availability.

Amdahl's Law (2 of 2) • A system which is constrained by the potential output rate.

Performance constraints • On a single processor, performance is • constrained by: • memory bandwidth (speed at which data can be fetched from or stored to memory) • speed at which calculations can take place • availability of independent operations for independent computational units • availability of flow of operations for individual computational units • availability of registers

POWER4 instruction roles • All computational instructions use registers • Separate load and store instructions perform memory accesses

Floating point data paths FPU 0 floating point registers ... floating point units FPU 1 L/S 0 L/S 1 load/store units Cache and memory hierarchy

Register renaming (1 of 3) • each POWER4 processor maintains a pool of physical registers • architectural registers (i.e. registers defined in the POWER4 instruction set) are mapped to physical registers • This mechanism is called register renaming.

Register renaming (2 of 3) • A pending store operation from a register could delay a pending load or compute operation into the same register. • Register renaming eliminates or reduces the severity of this potential bottleneck.

Register renaming (3 of 3) • 32 architectural general purpose registers (GPRs) are mapped to 80 physical GPRs • 32 architectural FP registers are mapped to 72 physical FP registers • the condition register (CR) and other architectural registers are also renamed

Pipeline stalls (1 of 3) • An observation: • each FPU and each FXU can consume multiple input values every cycle • each FPU and each FXU can produce a new output value every cycle • A pipeline stall occurs if an FPU or an FXU runs out of work to do because the two load/store units can't keep up.

Pipeline stalls (2 of 3) • Another observation: • each of the two load/store units can deliver a maximum of one new input value from memory each cycle • each of the two FPUs can request multiple new input values each cycle • Pipeline stalls are inevitable if all input values for a computation come from memory.

Pipeline stalls (3 of 3) • Therefore: • A POWER4 processor requires an exceptionally fast memory subsystem if the computational units are to be kept busy. • AND • A programmer seeking to achieve maximum performance in a POWER4 program must pay careful attention to how memory is being used.

The POWER4 chip

POWER4 memory hierarchy • L1 instruction cache (128K/chip; 64K/processor) • L1 data cache (64K/chip; 32K/processor) • L2 cache (1440K/chip; shared between processors) • L3 cache (128M/MCM; shared) • real memory (8G/LPAR or 32G/LPAR in ECMWF systems) • paging space and file system working storage (size depends on system configuration)

Memory coherency • IMPORTANT: • All memory within a single p690 system is coherent. • This means that any value stored into memory by any POWER4 processor is IMMEDIATELY available/visible to all other POWER4 processors.

The POWER4 L1 instruction cache 256 lines of 128 bytes each (32 KB) direct mapped ... L1 Instruction Cache Load Store Memory via L2 cache 0 ... 32 KB 64 KB 96 KB 128 KB ... ... ... ... 32*n KB The L1 instruction cache is "direct mapped" which means that each memory location can be cached in exactly one 128 byte cache line.

The POWER4 L1 data cache 128 lines of 128 bytes each (16 KB) 64 congruence classes ... L1 Data Cache 2 locations for any particular line. Load Store Memory via L2 cache 0 ... 16 KB 32 KB 48 KB 64 KB ... ... ... ... 16*n KB The L1 data cache is "2-way set associative" which means that each memory location can be cached in either of two 128 byte lines.

The POWER4 L2 cache • Each POWER4 chip has three L2 cache controllers: • each L2 cache controller manages 480K of cache • L2 cache is unified (instructions, data and page table entries) • shared between the two POWER4 processors on each chip • Each L2 cache controller can deliver 32 bytes per cycle to • the L1 caches for a rate of 41.6 gigabytes / second on • 1.3GHz POWER4 systems: • that's an aggregate rate of 124.8 gigabytes / second per • POWER4 chip!

The POWER4 L3 cache • Each Multi-Chip Module (MCM) has 128M of L3 cache: • 8-way set associative • 512 byte blocks managed as four contiguous 128 byte lines • Systems with multiple MCMs share their L3 caches • Memory coherency is primarily managed by the L3 caches

Peak bandwidths on a 1.3GHz p690 13.9GB/sec (x16 = 222GB/sec for 32-way) For each 2-way chip M E M O R Y L3 Shared L2 12.8GB/sec (x16 = 205GB/sec for 32-way) Distributed switch GX bus I/O Hub 20 PCI adapters I/O Drawer 2GB/sec (x8 = 16GB/sec for 32-way)

POWER4 memory fetches • POWER4 processor requests data from appropriate L1 cache if available (maximum of two 8 byte requests per cycle) • otherwise, 'reload' 128 byte L1 cache line from L2 cache if available • otherwise, 'reload' 128 byte L2 cache line from L3 cache if available • otherwise, load 512 byte L3 cache from memory and provide appropriate 128 bytes to L2 cache • otherwise, page fault or segmentation fault

POWER4 data prefetching • POWER4 processors detect sequential memory access patterns (in either direction). • If the POWER4 determines that memory is being referenced sequentially then data will be pre-fetched into the L1, L2 and L3 caches.

POWER4 memory stores • POWER4 stores at most one 8 byte value per cycle to L1 cache • L1, L2 and L3 caches are store-through: • data stored to the L1 cache is immediately queued to be stored to the L2 cache, the L3 cache and main memory • separate stores to separate parts of a cache line are merged if possible

Multi-Chip Module (MCM) physical packaging Shared L2 Shared L2 Distributed switch Distributed switch Distributed switch Distributed switch Shared L2 Shared L2 Shared L2 8-way(4 chip) POWER4 SMP system on a Multi-chip Module (MCM) Distributed switch 2-way POWER4 SMP system on a single chip! (174 million transistors) pSeries 690 basic building block POWER4 1.1 or 1.3GHz Microprocessor

Multi-Chip Module (MCM) interconnections Mem Ctrl Mem Ctrl M E M O R Y S L O T M E M O R Y S L O T L3 L3 Shared L2 Shared L2 Distributed switch Distributed switch GX Bus GX Bus Distributed switch Distributed switch Mem Ctrl Mem Ctrl Shared L2 Shared L2 L3 L3 4 GX Bus links for external connections L3 cache shared across all processors

Interconnections in a fully configured p690 GX Slot (i/o) Mem Slot Mem Slot Mem Slot Mem Slot L3 L3 L3 L3 L3 L3 L3 L3 GX GX GX GX L3 L3 L3 L3 Shared L2 Shared L2 Shared L2 Shared L2 GX GX Mem Book Mem Book L3 L3 L3 L3 Mem Book Mem Book GX GX Shared L2 Shared L2 Shared L2 Shared L2 L3 L3 L3 L3 GX Slot (i/o) GX Slot (i/o) L3 L3 L3 L3 Shared L2 Shared L2 Shared L2 Shared L2 GX GX Mem Book Mem Book Mem Book Mem Book L3 L3 L3 L3 GX Shared L2 Shared L2 GX Shared L2 Shared L2 L3 L3 L3 L3 GX GX GX GX L3 L3 L3 L3 L3 L3 L3 L3 Mem Slot Mem Slot Mem Slot Mem Slot GX Slot (i/o)

Timing information • a load into a register from L1 cache takes 1 cycle • a load from L2 cache takes 6 or 7 cycles • a load from L3 cache takes about 36 cycles! • subsequent loads from the same 128 byte line will take 1 cycle (because the entire line is loaded into the L1 cache regardless of the size of the original request) • a page fault takes roughly 10,000,000 cycles!!!???

POWER4 hardware FP instructions • Fused Multiply Add (FMA) • Single Precision (SP) equivalents • use same registers as Double Precision (DP) • same in-register performance as DP • better memory hierarchy performance because half the space • DIVIDE • SQRT

The Fused Multiply Add (FMA) instruction • combines a floating point multiply with an add • i.e. (a * b) + c • each FPU can complete one per cycle (if there are no pipeline stalls) • four flavours: • Mult/Add frt = fra*frc + frb • Mult/Sub frt = fra*frc - frb • -ve Mult/Add frt = -(fra*frc + frb) • -ve Mult/Sub frt = -(fra*frc - frb)

Floating point performance (1 of 2) • FMAs: • about 5 cycle latency • pipeline capable of completing one FMA per FPU per cycle (i.e. 2 FLOPs per cycle per FPU) • a floating point MULTIPLY or a floating point ADD is just a FMA with one of the operands omitted

Floating point performance (2 of 2) • DIVIDE and SQRT: • not pipelined so very costly • but each FPU can do one if they are independent (i.e. processor can perform two concurrently) • divides take about 30 cycles (or an average of about 15 cycles if each FPU is performing a divide) • square roots take about 36 cycles (or an average of about 18 cycles if each FPU is performing one)

POWER4 vs VPP5000 single-CPU performance Function VPP5000 (ops/cycle) POWER4 (ops/cycle) relative performance Multiply 16 2 1.85 Add 16 2 1.85 Multiply and Add 16 2 1.85 Divide 16/4 2/30 13.85 Square root 48/20 2/36 10 Note: values are "per processor" and take into account the ECMWF VPP5000's clock speed of 333MHz vs the ECMWF POWER4's clock speed of 1.3GHz.

IEEE floating point • FMA (DP & SP) does not round between M and A • more accurate than IEEE (except in pathological cases) • for example, d = a*b - a*b may yield rounding error instead of zero • technically violates IEEE standard • -qnomaf compiler option forces IEEE conformance • results will be slower and almost certainly less accurate! • don't use it unless you REALLY need to

Understanding POWER4 Architecture for High-Performance Computing

Understanding POWER4 Architecture for High-Performance Computing

Presentation Transcript

Chapter 2. Software architecture

Chapter 2. Software architecture

Computer architecture, part 2

Chapter 2 NGN Architecture

Drafting 2 - Architecture

Week 2: Asynchronous Architecture

Computer Architecture chapter 2

Introduction to the NERSC HPCF NERSC User Services

WP 2 Architecture

Software Architecture - 2

Organizational Data Architecture (2/19 – 2/21)

Section 2 – Storage Systems Architecture

Lecture 2 System architecture

Chapter 2 Microprocessor Architecture

Software Architecture - 2

Computer Architecture Lecture 2

PlayStation 2 Architecture

Concrete Architecture of Firefox 2

Software Architecture - 2

Chapter 2 Styles of Architecture