slide1
Download
Skip this Video
Download Presentation
HPCF Architecture - 2

Loading in 2 Seconds...

play fullscreen
1 / 40

HPCF Architecture - 2 - PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on

HPCF Architecture - 2. Norbert Kreitz, March 2004 Based on material provided by Daniel Boulet and IBM. Topics. superscalar attributes computational units and their pipelines memory hierarchy floating point.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' HPCF Architecture - 2' - justine-watson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

HPCF Architecture - 2

Norbert Kreitz, March 2004

Based on material provided by

Daniel Boulet and IBM

slide2

Topics

  • superscalar attributes
  • computational units and their pipelines
  • memory hierarchy
  • floating point

For more details, see the POWER4 System Microarchitecture white paper at http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/power4.html

(or search for "power4 architecture" in the Search window at www.ibm.com)

slide3

Superscalar architecture

  • each POWER4 processor has eight independent computational units
  • each computational unit can execute instructions in parallel with other computational units
  • This ability to execute up to 8 instructions in parallel is called superscalar execution.
slide4

POWER4 computational units

  • Each processor contains
  • two FXUs (fixed point units)
  • two FPUs (floating point units)
  • two load/store units
  • a branching unit
  • a CR (condition register) unit
slide5

POWER4 instruction execution

  • Each processor can:
  • fetch up to eight instructions per cycle
  • complete up to five instructions per cycle
  • track over 200 instructions at any given time
  • perform out-of-order execution of independent groups of instructions
  • perform speculative execution
  • perform register renaming
slide6

Instruction execution rate vs duration

  • The POWER4\'s superscalar architecture means that instructions may complete at a rate of up to five instructions per cycle but each individual instruction takes longer than one cycle.
slide7

Pipelined architecture

  • Typical floating point instructions take six cycles but can complete at the rate of one instruction per cycle:
slide8

Vector vs superscalar pipelines

  • Vector pipeline
    • uses pre-loaded vector registers as input
    • dependent operations not possible - code must be "vectorised"
  • Superscalar pipeline
    • uses pool of registers
    • result from one operation can be fed back as input to the next
    • dependent operations can cause pipeline stalls
slide9

Amdahl\'s Law (1 of 2)

  • The performance of any system is constrained by the speed or capacity of the slowest point.

A system which is constrained by input availability.

slide10

Amdahl\'s Law (2 of 2)

  • A system which is constrained by the potential output rate.
slide11

Performance constraints

  • On a single processor, performance is
  • constrained by:
    • memory bandwidth (speed at which data can be fetched from or stored to memory)
    • speed at which calculations can take place
    • availability of independent operations for independent computational units
    • availability of flow of operations for individual computational units
    • availability of registers
slide12

POWER4 instruction roles

  • All computational instructions use registers
  • Separate load and store instructions perform memory accesses
slide13

Floating point data paths

FPU 0

floating point registers

...

floating point units

FPU 1

L/S 0

L/S 1

load/store units

Cache and memory

hierarchy

slide14

Register renaming (1 of 3)

  • each POWER4 processor maintains a pool of physical registers
  • architectural registers (i.e. registers defined in the POWER4 instruction set) are mapped to physical registers
  • This mechanism is called register renaming.
slide15

Register renaming (2 of 3)

  • A pending store operation from a register could delay a pending load or compute operation into the same register.
  • Register renaming eliminates or reduces the severity of this potential bottleneck.
slide16

Register renaming (3 of 3)

  • 32 architectural general purpose registers (GPRs) are mapped to 80 physical GPRs
  • 32 architectural FP registers are mapped to 72 physical FP registers
  • the condition register (CR) and other architectural registers are also renamed
slide17

Pipeline stalls (1 of 3)

  • An observation:
    • each FPU and each FXU can consume multiple input values every cycle
    • each FPU and each FXU can produce a new output value every cycle
  • A pipeline stall occurs if an FPU or an FXU runs out of work to do because the two load/store units can\'t keep up.
slide18

Pipeline stalls (2 of 3)

  • Another observation:
    • each of the two load/store units can deliver a maximum of one new input value from memory each cycle
    • each of the two FPUs can request multiple new input values each cycle
  • Pipeline stalls are inevitable if all input values for a computation come from memory.
slide19

Pipeline stalls (3 of 3)

  • Therefore:
    • A POWER4 processor requires an exceptionally fast memory subsystem if the computational units are to be kept busy.
  • AND
    • A programmer seeking to achieve maximum performance in a POWER4 program must pay careful attention to how memory is being used.
slide21

POWER4 memory hierarchy

  • L1 instruction cache (128K/chip; 64K/processor)
  • L1 data cache (64K/chip; 32K/processor)
  • L2 cache (1440K/chip; shared between processors)
  • L3 cache (128M/MCM; shared)
  • real memory (8G/LPAR or 32G/LPAR in ECMWF systems)
  • paging space and file system working storage (size depends on system configuration)
slide22

Memory coherency

  • IMPORTANT:
    • All memory within a single p690 system is coherent.
  • This means that any value stored into memory by any POWER4 processor is IMMEDIATELY available/visible to all other POWER4 processors.
slide23

The POWER4 L1 instruction cache

256 lines of 128 bytes each (32 KB)

direct mapped

...

L1 Instruction Cache

Load

Store

Memory

via L2 cache

0

...

32 KB

64 KB

96 KB

128 KB

...

...

...

...

32*n KB

The L1 instruction cache is "direct mapped" which means that each memory location can be cached in exactly one 128 byte cache line.

slide24

The POWER4 L1 data cache

128 lines of 128 bytes each (16 KB)

64 congruence classes

...

L1 Data Cache

2 locations for any

particular line.

Load

Store

Memory

via L2 cache

0

...

16 KB

32 KB

48 KB

64 KB

...

...

...

...

16*n KB

The L1 data cache is "2-way set associative" which means that each memory location can be cached in either of two 128 byte lines.

slide25

The POWER4 L2 cache

  • Each POWER4 chip has three L2 cache controllers:
    • each L2 cache controller manages 480K of cache
    • L2 cache is unified (instructions, data and page table entries)
    • shared between the two POWER4 processors on each chip
  • Each L2 cache controller can deliver 32 bytes per cycle to
  • the L1 caches for a rate of 41.6 gigabytes / second on
  • 1.3GHz POWER4 systems:
  • that\'s an aggregate rate of 124.8 gigabytes / second per
  • POWER4 chip!
slide26

The POWER4 L3 cache

  • Each Multi-Chip Module (MCM) has 128M of L3 cache:
    • 8-way set associative
    • 512 byte blocks managed as four contiguous 128 byte lines
  • Systems with multiple MCMs share their L3 caches
  • Memory coherency is primarily managed by the L3 caches
slide27

Peak bandwidths on a 1.3GHz p690

13.9GB/sec (x16 = 222GB/sec for 32-way)

For each 2-way chip

M

E

M

O

R

Y

L3

Shared L2

12.8GB/sec (x16 = 205GB/sec for 32-way)

Distributed switch

GX bus

I/O Hub

20 PCI adapters

I/O Drawer

2GB/sec (x8 = 16GB/sec for 32-way)

slide28

POWER4 memory fetches

  • POWER4 processor requests data from appropriate L1 cache if available (maximum of two 8 byte requests per cycle)
    • otherwise, \'reload\' 128 byte L1 cache line from L2 cache if available
      • otherwise, \'reload\' 128 byte L2 cache line from L3 cache if available
        • otherwise, load 512 byte L3 cache from memory and provide appropriate 128 bytes to L2 cache
          • otherwise, page fault or segmentation fault
slide29

POWER4 data prefetching

  • POWER4 processors detect sequential memory access patterns (in either direction).
  • If the POWER4 determines that memory is being referenced sequentially then data will be pre-fetched into the L1, L2 and L3 caches.
slide30

POWER4 memory stores

  • POWER4 stores at most one 8 byte value per cycle to L1 cache
  • L1, L2 and L3 caches are store-through:
    • data stored to the L1 cache is immediately queued to be stored to the L2 cache, the L3 cache and main memory
    • separate stores to separate parts of a cache line are merged if possible
slide31

Multi-Chip Module (MCM)

physical packaging

Shared L2

Shared L2

Distributed switch

Distributed switch

Distributed switch

Distributed switch

Shared L2

Shared L2

Shared L2

8-way(4 chip) POWER4

SMP system on a

Multi-chip Module (MCM)

Distributed switch

2-way POWER4 SMP system

on a single chip!

(174 million transistors)

pSeries 690

basic building block

POWER4

1.1 or 1.3GHz

Microprocessor

slide32

Multi-Chip Module (MCM) interconnections

Mem

Ctrl

Mem

Ctrl

M

E

M

O

R

Y

S

L

O

T

M

E

M

O

R

Y

S

L

O

T

L3

L3

Shared L2

Shared L2

Distributed switch

Distributed switch

GX

Bus

GX

Bus

Distributed switch

Distributed switch

Mem

Ctrl

Mem

Ctrl

Shared L2

Shared L2

L3

L3

4 GX Bus links for external connections

L3 cache shared across all

processors

slide33

Interconnections in a fully

configured p690

GX Slot (i/o)

Mem

Slot

Mem

Slot

Mem

Slot

Mem

Slot

L3

L3

L3

L3

L3

L3

L3

L3

GX

GX

GX

GX

L3

L3

L3

L3

Shared L2

Shared L2

Shared L2

Shared L2

GX

GX

Mem

Book

Mem

Book

L3

L3

L3

L3

Mem

Book

Mem

Book

GX

GX

Shared L2

Shared L2

Shared L2

Shared L2

L3

L3

L3

L3

GX Slot (i/o)

GX Slot (i/o)

L3

L3

L3

L3

Shared L2

Shared L2

Shared L2

Shared L2

GX

GX

Mem

Book

Mem

Book

Mem

Book

Mem

Book

L3

L3

L3

L3

GX

Shared L2

Shared L2

GX

Shared L2

Shared L2

L3

L3

L3

L3

GX

GX

GX

GX

L3

L3

L3

L3

L3

L3

L3

L3

Mem

Slot

Mem

Slot

Mem

Slot

Mem

Slot

GX Slot (i/o)

slide34

Timing information

  • a load into a register from L1 cache takes 1 cycle
  • a load from L2 cache takes 6 or 7 cycles
  • a load from L3 cache takes about 36 cycles!
  • subsequent loads from the same 128 byte line will take 1 cycle (because the entire line is loaded into the L1 cache regardless of the size of the original request)
  • a page fault takes roughly 10,000,000 cycles!!!???
slide35

POWER4 hardware FP instructions

  • Fused Multiply Add (FMA)
  • Single Precision (SP) equivalents
    • use same registers as Double Precision (DP)
    • same in-register performance as DP
    • better memory hierarchy performance because half the space
  • DIVIDE
  • SQRT
slide36

The Fused Multiply Add (FMA) instruction

  • combines a floating point multiply with an add
    • i.e. (a * b) + c
  • each FPU can complete one per cycle (if there are no pipeline stalls)
  • four flavours:
    • Mult/Add frt = fra*frc + frb
    • Mult/Sub frt = fra*frc - frb
    • -ve Mult/Add frt = -(fra*frc + frb)
    • -ve Mult/Sub frt = -(fra*frc - frb)
slide37

Floating point performance (1 of 2)

  • FMAs:
    • about 5 cycle latency
    • pipeline capable of completing one FMA per FPU per cycle (i.e. 2 FLOPs per cycle per FPU)
    • a floating point MULTIPLY or a floating point ADD is just a FMA with one of the operands omitted
slide38

Floating point performance (2 of 2)

  • DIVIDE and SQRT:
    • not pipelined so very costly
    • but each FPU can do one if they are independent (i.e. processor can perform two concurrently)
    • divides take about 30 cycles (or an average of about 15 cycles if each FPU is performing a divide)
    • square roots take about 36 cycles (or an average of about 18 cycles if each FPU is performing one)
slide39

POWER4 vs VPP5000

single-CPU performance

Function

VPP5000

(ops/cycle)

POWER4

(ops/cycle)

relative performance

Multiply

16

2

1.85

Add

16

2

1.85

Multiply and Add

16

2

1.85

Divide

16/4

2/30

13.85

Square root

48/20

2/36

10

Note: values are "per processor" and take into account the ECMWF VPP5000\'s clock speed of 333MHz vs the ECMWF POWER4\'s clock speed of 1.3GHz.

slide40

IEEE floating point

  • FMA (DP & SP) does not round between M and A
  • more accurate than IEEE (except in pathological cases)
    • for example, d = a*b - a*b may yield rounding error instead of zero
  • technically violates IEEE standard
  • -qnomaf compiler option forces IEEE conformance
    • results will be slower and almost certainly less accurate!
    • don\'t use it unless you REALLY need to
ad