HPCF  Architecture - 2
Download
1 / 40

HPCF Architecture - 2 - PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on

HPCF Architecture - 2. Norbert Kreitz, March 2004 Based on material provided by Daniel Boulet and IBM. Topics. superscalar attributes computational units and their pipelines memory hierarchy floating point.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' HPCF Architecture - 2' - justine-watson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

HPCF Architecture - 2

Norbert Kreitz, March 2004

Based on material provided by

Daniel Boulet and IBM


Topics

  • superscalar attributes

  • computational units and their pipelines

  • memory hierarchy

  • floating point

For more details, see the POWER4 System Microarchitecture white paper at http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/power4.html

(or search for "power4 architecture" in the Search window at www.ibm.com)


Superscalar architecture

  • each POWER4 processor has eight independent computational units

  • each computational unit can execute instructions in parallel with other computational units

  • This ability to execute up to 8 instructions in parallel is called superscalar execution.


POWER4 computational units

  • Each processor contains

  • two FXUs (fixed point units)

  • two FPUs (floating point units)

  • two load/store units

  • a branching unit

  • a CR (condition register) unit


POWER4 instruction execution

  • Each processor can:

  • fetch up to eight instructions per cycle

  • complete up to five instructions per cycle

  • track over 200 instructions at any given time

  • perform out-of-order execution of independent groups of instructions

  • perform speculative execution

  • perform register renaming


Instruction execution rate vs duration

  • The POWER4's superscalar architecture means that instructions may complete at a rate of up to five instructions per cycle but each individual instruction takes longer than one cycle.


Pipelined architecture

  • Typical floating point instructions take six cycles but can complete at the rate of one instruction per cycle:


Vector vs superscalar pipelines

  • Vector pipeline

    • uses pre-loaded vector registers as input

    • dependent operations not possible - code must be "vectorised"

  • Superscalar pipeline

    • uses pool of registers

    • result from one operation can be fed back as input to the next

    • dependent operations can cause pipeline stalls


Amdahl's Law (1 of 2)

  • The performance of any system is constrained by the speed or capacity of the slowest point.

A system which is constrained by input availability.


Amdahl's Law (2 of 2)

  • A system which is constrained by the potential output rate.


Performance constraints

  • On a single processor, performance is

  • constrained by:

    • memory bandwidth (speed at which data can be fetched from or stored to memory)

    • speed at which calculations can take place

    • availability of independent operations for independent computational units

    • availability of flow of operations for individual computational units

    • availability of registers


POWER4 instruction roles

  • All computational instructions use registers

  • Separate load and store instructions perform memory accesses


Floating point data paths

FPU 0

floating point registers

...

floating point units

FPU 1

L/S 0

L/S 1

load/store units

Cache and memory

hierarchy


Register renaming (1 of 3)

  • each POWER4 processor maintains a pool of physical registers

  • architectural registers (i.e. registers defined in the POWER4 instruction set) are mapped to physical registers

  • This mechanism is called register renaming.


Register renaming (2 of 3)

  • A pending store operation from a register could delay a pending load or compute operation into the same register.

  • Register renaming eliminates or reduces the severity of this potential bottleneck.


Register renaming (3 of 3)

  • 32 architectural general purpose registers (GPRs) are mapped to 80 physical GPRs

  • 32 architectural FP registers are mapped to 72 physical FP registers

  • the condition register (CR) and other architectural registers are also renamed


Pipeline stalls (1 of 3)

  • An observation:

    • each FPU and each FXU can consume multiple input values every cycle

    • each FPU and each FXU can produce a new output value every cycle

  • A pipeline stall occurs if an FPU or an FXU runs out of work to do because the two load/store units can't keep up.


Pipeline stalls (2 of 3)

  • Another observation:

    • each of the two load/store units can deliver a maximum of one new input value from memory each cycle

    • each of the two FPUs can request multiple new input values each cycle

  • Pipeline stalls are inevitable if all input values for a computation come from memory.


Pipeline stalls (3 of 3)

  • Therefore:

    • A POWER4 processor requires an exceptionally fast memory subsystem if the computational units are to be kept busy.

  • AND

    • A programmer seeking to achieve maximum performance in a POWER4 program must pay careful attention to how memory is being used.



POWER4 memory hierarchy

  • L1 instruction cache (128K/chip; 64K/processor)

  • L1 data cache (64K/chip; 32K/processor)

  • L2 cache (1440K/chip; shared between processors)

  • L3 cache (128M/MCM; shared)

  • real memory (8G/LPAR or 32G/LPAR in ECMWF systems)

  • paging space and file system working storage (size depends on system configuration)


Memory coherency

  • IMPORTANT:

    • All memory within a single p690 system is coherent.

  • This means that any value stored into memory by any POWER4 processor is IMMEDIATELY available/visible to all other POWER4 processors.


The POWER4 L1 instruction cache

256 lines of 128 bytes each (32 KB)

direct mapped

...

L1 Instruction Cache

Load

Store

Memory

via L2 cache

0

...

32 KB

64 KB

96 KB

128 KB

...

...

...

...

32*n KB

The L1 instruction cache is "direct mapped" which means that each memory location can be cached in exactly one 128 byte cache line.


The POWER4 L1 data cache

128 lines of 128 bytes each (16 KB)

64 congruence classes

...

L1 Data Cache

2 locations for any

particular line.

Load

Store

Memory

via L2 cache

0

...

16 KB

32 KB

48 KB

64 KB

...

...

...

...

16*n KB

The L1 data cache is "2-way set associative" which means that each memory location can be cached in either of two 128 byte lines.


The POWER4 L2 cache

  • Each POWER4 chip has three L2 cache controllers:

    • each L2 cache controller manages 480K of cache

    • L2 cache is unified (instructions, data and page table entries)

    • shared between the two POWER4 processors on each chip

  • Each L2 cache controller can deliver 32 bytes per cycle to

  • the L1 caches for a rate of 41.6 gigabytes / second on

  • 1.3GHz POWER4 systems:

  • that's an aggregate rate of 124.8 gigabytes / second per

  • POWER4 chip!


The POWER4 L3 cache

  • Each Multi-Chip Module (MCM) has 128M of L3 cache:

    • 8-way set associative

    • 512 byte blocks managed as four contiguous 128 byte lines

  • Systems with multiple MCMs share their L3 caches

  • Memory coherency is primarily managed by the L3 caches


Peak bandwidths on a 1.3GHz p690

13.9GB/sec (x16 = 222GB/sec for 32-way)

For each 2-way chip

M

E

M

O

R

Y

L3

Shared L2

12.8GB/sec (x16 = 205GB/sec for 32-way)

Distributed switch

GX bus

I/O Hub

20 PCI adapters

I/O Drawer

2GB/sec (x8 = 16GB/sec for 32-way)


POWER4 memory fetches

  • POWER4 processor requests data from appropriate L1 cache if available (maximum of two 8 byte requests per cycle)

    • otherwise, 'reload' 128 byte L1 cache line from L2 cache if available

      • otherwise, 'reload' 128 byte L2 cache line from L3 cache if available

        • otherwise, load 512 byte L3 cache from memory and provide appropriate 128 bytes to L2 cache

          • otherwise, page fault or segmentation fault


POWER4 data prefetching

  • POWER4 processors detect sequential memory access patterns (in either direction).

  • If the POWER4 determines that memory is being referenced sequentially then data will be pre-fetched into the L1, L2 and L3 caches.


POWER4 memory stores

  • POWER4 stores at most one 8 byte value per cycle to L1 cache

  • L1, L2 and L3 caches are store-through:

    • data stored to the L1 cache is immediately queued to be stored to the L2 cache, the L3 cache and main memory

    • separate stores to separate parts of a cache line are merged if possible


Multi-Chip Module (MCM)

physical packaging

Shared L2

Shared L2

Distributed switch

Distributed switch

Distributed switch

Distributed switch

Shared L2

Shared L2

Shared L2

8-way(4 chip) POWER4

SMP system on a

Multi-chip Module (MCM)

Distributed switch

2-way POWER4 SMP system

on a single chip!

(174 million transistors)

pSeries 690

basic building block

POWER4

1.1 or 1.3GHz

Microprocessor


Multi-Chip Module (MCM) interconnections

Mem

Ctrl

Mem

Ctrl

M

E

M

O

R

Y

S

L

O

T

M

E

M

O

R

Y

S

L

O

T

L3

L3

Shared L2

Shared L2

Distributed switch

Distributed switch

GX

Bus

GX

Bus

Distributed switch

Distributed switch

Mem

Ctrl

Mem

Ctrl

Shared L2

Shared L2

L3

L3

4 GX Bus links for external connections

L3 cache shared across all

processors


Interconnections in a fully

configured p690

GX Slot (i/o)

Mem

Slot

Mem

Slot

Mem

Slot

Mem

Slot

L3

L3

L3

L3

L3

L3

L3

L3

GX

GX

GX

GX

L3

L3

L3

L3

Shared L2

Shared L2

Shared L2

Shared L2

GX

GX

Mem

Book

Mem

Book

L3

L3

L3

L3

Mem

Book

Mem

Book

GX

GX

Shared L2

Shared L2

Shared L2

Shared L2

L3

L3

L3

L3

GX Slot (i/o)

GX Slot (i/o)

L3

L3

L3

L3

Shared L2

Shared L2

Shared L2

Shared L2

GX

GX

Mem

Book

Mem

Book

Mem

Book

Mem

Book

L3

L3

L3

L3

GX

Shared L2

Shared L2

GX

Shared L2

Shared L2

L3

L3

L3

L3

GX

GX

GX

GX

L3

L3

L3

L3

L3

L3

L3

L3

Mem

Slot

Mem

Slot

Mem

Slot

Mem

Slot

GX Slot (i/o)


Timing information

  • a load into a register from L1 cache takes 1 cycle

  • a load from L2 cache takes 6 or 7 cycles

  • a load from L3 cache takes about 36 cycles!

  • subsequent loads from the same 128 byte line will take 1 cycle (because the entire line is loaded into the L1 cache regardless of the size of the original request)

  • a page fault takes roughly 10,000,000 cycles!!!???


POWER4 hardware FP instructions

  • Fused Multiply Add (FMA)

  • Single Precision (SP) equivalents

    • use same registers as Double Precision (DP)

    • same in-register performance as DP

    • better memory hierarchy performance because half the space

  • DIVIDE

  • SQRT


The Fused Multiply Add (FMA) instruction

  • combines a floating point multiply with an add

    • i.e. (a * b) + c

  • each FPU can complete one per cycle (if there are no pipeline stalls)

  • four flavours:

    • Mult/Add frt = fra*frc + frb

    • Mult/Sub frt = fra*frc - frb

    • -ve Mult/Add frt = -(fra*frc + frb)

    • -ve Mult/Sub frt = -(fra*frc - frb)


Floating point performance (1 of 2)

  • FMAs:

    • about 5 cycle latency

    • pipeline capable of completing one FMA per FPU per cycle (i.e. 2 FLOPs per cycle per FPU)

    • a floating point MULTIPLY or a floating point ADD is just a FMA with one of the operands omitted


Floating point performance (2 of 2)

  • DIVIDE and SQRT:

    • not pipelined so very costly

    • but each FPU can do one if they are independent (i.e. processor can perform two concurrently)

    • divides take about 30 cycles (or an average of about 15 cycles if each FPU is performing a divide)

    • square roots take about 36 cycles (or an average of about 18 cycles if each FPU is performing one)


POWER4 vs VPP5000

single-CPU performance

Function

VPP5000

(ops/cycle)

POWER4

(ops/cycle)

relative performance

Multiply

16

2

1.85

Add

16

2

1.85

Multiply and Add

16

2

1.85

Divide

16/4

2/30

13.85

Square root

48/20

2/36

10

Note: values are "per processor" and take into account the ECMWF VPP5000's clock speed of 333MHz vs the ECMWF POWER4's clock speed of 1.3GHz.


IEEE floating point

  • FMA (DP & SP) does not round between M and A

  • more accurate than IEEE (except in pathological cases)

    • for example, d = a*b - a*b may yield rounding error instead of zero

  • technically violates IEEE standard

  • -qnomaf compiler option forces IEEE conformance

    • results will be slower and almost certainly less accurate!

    • don't use it unless you REALLY need to


ad