Instruction Sets

INF5062:Programming asymmetric multi-core processors Instruction Sets August 22, 2014

Instruction Sets • Why? • Give insight into designers’ mind • Allow debugging • Classical Programming • Von Neumann machines

CISC Complex instructions Many cycles for each instruction Irregular number of cycles per instruction Two or more memory locations per instruction possible Implicit load and store operations Small code sizes Chip surface used for instruction storage RISC Simple instructions Few cycles for each instruction Fixed or few different numbers of cycle per instruction No memory locations in computing instructions Explicit load and store operations Large code size Chip surface used for memory Instruction Sets

Instruction Sets

Instruction Sets • Typical mechanisms for performance improvement • Caching • Data and instruction prefetching • Pipelining • Speculative execution • Out-of-order execution • Symmetrical multithreading (“hyperthreading”) • These mechanisms are not generally applied in heterogeneous multi-cores

INF5062:Programming asymmetric multi-core processors IXP 2400 Engine instruction set

IXP Microengine Assembler • SISD architecture • Assembler provides • Detailed control of timing • Detailed control of hardware context switching • Understanding of actual instructions • Assembler requires • Knowledge of supporting hardware • Need to know • No stack

IXP: I/O and Context Swap Instructions • Can transfer bursts of data • Operate asynchronously • Can generate signals when finished • Needs two registers to specify addresses • Where memory is aligned, address register bits aren’t shifted, but least relevant bits are ignored • Can be combined with command tokens • Supports so-called indirect references: state of a previous ALU command modifies meaning

IXP: general instructions

IXP: general instructions LOCAL_CSR_WR[byte_index,2] NOP NOP NOP BYTE_ALIGN_BE[--,r0] BYTE_ALIGN_BE[dest1,r1] BYTE_ALIGN_BE[dest2,r2] BYTE_ALIGN_BE[dest3,r3] ... ... Aligning long sequences of registers This can be aligned with asynchronous load and store ops

IXP: general instructions • Status check • ALU[r1,--,B,r1] • move r1 into r1, just updating registers: N = (r1 < 0), Z = (r1 == 0), V = 0, C = 0 negative, zero, overflow, carry • Arithmetic operations • ALU[r3,r1,+,r2] • r3=r1+r2 N = (r3 < 0), Z = (r3 == 0) V = (r1+r2>231), C = (r1+r2>232) • Boolean operations • ALU[r3,r1,AND~,r2] • r3=r1 AND NOT r2 • ALU[r5,r1,+,r2] • ALU[r6,r3,+carry,r4] • r5=r1+r2 • r6=r3+r4+carry-bit • ALU[r3,r1,+8,r2] • r3=r1 + (r2 & 0xff) General arithmetic operations on all kinds of registers

IXP: general instructions • Replacing bytes • LD_FIELD_W_CLEAR[r2,0100,r1,<<rot12], not load_cc • move content of r1 to temporary space t • rotate t left by 12 bits • set r2 = ( r2 & 0xff00ffff ) OR ( t & 0xff0000 ) • don’t change any status bits • Find first set bit • IMMED[r1,0xc00] • FFS[r2,r1] • set r2 to lowest index of a bit in r1, here 19

IXP: branching instructions

IXP: branching instructions • Jump table: • IMMED[r1,2] • JUMP[r1,label0#], targets[label0#, label1#, label2#, label3#] • ... • label0#: • BR[elsewhere0#] • label1#: • BR[elsewhere1#] • label2#: • BR[elsewhere2#] • label3#: • BR[elsewhere3#] • example jumps to label2

IXP: branching instructions • Emulating a function call: • ... • LOAD_ADDR[r1,return_label#] • BR[subroutine_label#] • return_label#: • ... • Subroutine_label#: • ... • RTN[r1] • ... • example jumps to subroutine_label and returns to return_label (unless r1 gets overwritten)

Branching operationscan be deferredto “do something else” before jumping Great for confusing the mortals Especially when combined with ctx_arb Saves registers Don’t need to save temp2 in the example ... immed[temp1,20,<<0] immed[temp3,10,<<0] alu[temp2,temp1,-,temp3] bgt[label#], defer[3] alu[temp,temp,+,temp2] alu[temp,--,B,temp2] alu[--,temp,-,temp3] alu[temp,--,B,temp1] ... label: ... IXP: branching instructions

IXP: content addressable memory • Operate on CAM memory • 1 per-Engine • 16 longwords of 32 bits • operated as a LRU cache of 32 bit write operations • a lookup hit refreshes an access • a 4-bit state can be associated with each CAM entry • written by CAM_WRITE STATE • retrieved by CAM_LOOKUP or CAM_READ_STATE • The IXP1200 had CAM access to all SRAM memory!!! • apparently it wasn’t worth it ...

INF5062:Programming asymmetric multi-core processors Cell Broadband Engine SPU instruction set

CELL • CELL instruction set very similar to AltiVec • A set of SIMD instruction for floating point operations • Single precision is/was considered sufficient

MSB 1 2 3 4 5 6 7 8 9 10 11 12 13 14 LSB CELL • The CBE is big-endian • The SPEs have 16-byte (128-bit) registers lower memory addresses higher memory addresses most significant bit least significant bit • Big-endian escapes the typical x86 *********: char c[4]; int *e; c[0]=0x01; c[1]=0x02; c[2]=0x03; c[3]=0x04; e = (int*)c; printf(“%d\n”,*e); --> 0x4030201

MSB Byte 0 Char 0 1 Byte 1 Char 1 Byte 2 2 Char 2 Byte 3 Char 3 3 Char 4 4 Byte 4 Char 5 Byte 5 5 6 Char 6 Byte 6 7 Byte 7 Char 7 Byte 8 Char 8 8 9 Byte 9 Char 9 Char 10 10 Byte 10 Byte 11 Char 11 11 Byte 12 Char 12 12 13 Byte 13 Char 13 Char 14 Byte 14 14 Byte 15 LSB Char 15 Half-word 0 Half-word 1 Half-word 2 Half-word 3 Half-word 4 Half-word 5 Half-word 6 Half-word 7 Word 0 Word 1 Word 2 Word 3 Doubleword 0 Doubleword 1 CELL • The CBE is big-endian • The SPEs have 16-byte (128-bit) registers Quadword 0

short [0] short [1] short [2] short [3] short [4] short [5] short [6] short [7] double int [0] int [1] int [2] int [3] double [0] double [1] int short char CELL • “Preferred slots” of ABI types in registers qword Data type: vector double Data type: vector signed int Data type: vector signed short Data type: double Data type: int So what have they smoked? Data type: short Data type: char

CELL: groups of instructions

int [0] =0x12 0x9c1a : 1001110000011010 fsmi rt,0x9c1a =0x12 int [1] =0x12 int [2] =0x12 int [3] Byte 0 0xff 0x00 Byte 1 0x00 Byte 2 0xff Byte 3 0xff Byte 4 ... 0x00 Byte 15 CELL: groups of instructions Straight-forward Straight-forward See together with shufb instruction il rt,0x12

cg rt,ra,rb rt [15] rt [0] rt [0] rt [15] = ( = ( = ( = ( ra [15] ra [0] ra [0] ra [15] rb [0] rb [15] rb [0] rb [15] + + > > > 216 ) ? 0x01 : 0x00 ) ? 0x00 : 0x01 > 216 ) ? 0x01 : 0x00 ) ? 0x00 : 0x01 ... ... bgx rt,ra,rb CELL: groups of instructions ah rt,ra,rb rb [0] rt [0] ra [0] + = + rb [1] rt [1] ra [1] = + rb [2] rt [2] ra [2] = + rb [3] rt [3] ra [3] = rb [4] rt [4] + ra [4] = + rb [5] rt [5] = ra [5] + rb [6] rt [6] ra [6] = + = rb [7] rt [7] ra [7]

mpyhha rt,ra,rb rt{0:3} rt{12:15} rt{4:7} rt{8:11} = = = = ra{0:1} ra{12:13} ra{4:5} ra{8:9} rb{0:1} rb{12:13} rb{4:5} rb{8:9} + + + + rt{12:15} rt{0:3} rt{8:11} rt{4:7} * * * * CELL: groups of instructions For every byte separately For every byte separately

fsm rt,ra rt{0:3} = ( == 0 ) ? 0x00000000 : 0xffffffff rt{4:7} = ( == 0 ) ? 0x00000000 : 0xffffffff rt{8:11} ra = ( == 0 ) ? 0x00000000 : 0xffffffff rt{12:15} = ( == 0 ) ? 0x00000000 : 0xffffffff CELL: groups of instructions For every byte separately

rt{0:1} rb{0} rb{3} = + rb{1} + rb{2} + rt{2:3} ra{0} + ra{1} + ra{2} + ra{3} = rt{4:5} rb{7} = rb{4} + rb{5} + rb{6} + ... ra{0} rb{0} ra{1} rb{1} rb{2} ra{2} rb{3} ra{3} ra{4} rb{4} rb{5} ra{5} rb{6} ra{6} rb{7} ra{7} rb{8} ra{8} rb{9} ra{9} ra{10} rb{10} rb{11} ra{11} rb{12} ra{12} ra{13} rb{13} ra{14} rb{14} ra{15} rb{15} rt [14:15] ra{12} + ra{13} + ra{14} + ra{15} = CELL: groups of instructions sumb rt,ra,rb and rt,ra,rb bit-wise: t = a AND b eqv rt,ra,rb bit-wise: t = NOT (a XOR b ) selb rt,ra,rb,rc bit-wise: t = c ? b : a

rb{0} rt{0} ra{0} 0x10 rt{1} rb{1} 0x11 ra{1} rb{2} ra{2} 0x12 rt{2} rb{3} 0x13 rt{3} ra{3} rt{4} ra{4} 0x14 rb{4} 0x15 ra{5} rt{5} rb{5} rt{6} 0x16 rb{6} ra{6} rb{7} 0x17 ra{7} rt{7} ra{8} rt{8} 0x08 rb{8} 0x09 rt{9} rb{9} ra{9} rb{10} ra{10} 0x1a rt{10} rb{11} 0x1b ra{11} rt{11} ra{12} rt{12} rb{12} 0x1c rb{13} ra{13} rt{13} 0x1d rb{14} 0x1e rt{14} ra{14} rb{15} rt{15} ra{15} 0x1f CELL: groups of instructions il rd,0 chd rc,4(rd) shufb rt,ra,rb,rc rd{0:16} = 0

ra{0} 0x00 rb{0} 0x80 0x11 ra{1} rb{1} rb{1} ra{2} 0x80 0xe0 rb{2} 0x14 ra{3} rb{4} rb{3} ra{4} 0x14 rb{4} rb{4} 0x11 rb{5} rb{1} ra{5} 0xc0 rb{6} ra{6} 0xff rb{7} ra{7} 0x17 rb{7} 0x08 ra{8} ra{8} rb{8} ra{10} rb{9} 0x0a ra{9} rb{10} rb{10} ra{10} 0x1a 0x0c rb{11} ra{11} ra{12} rb{12} ra{12} rb{12} 0x1c 0x0e rb{13} ra{13} ra{14} 0x1e rb{14} ra{14} rb{14} rb{15} rb{15} ra{15} 0x1f CELL: groups of instructions shufb rt,ra,rb,rc

H L H L bitshift count = e.g.: 3 ... H0 H1 H2 H3 H4 H5 H6 L6 L5 L4 L3 L2 L1 L0 ... H3 H4 H5 H6 H7 H8 H9 L3 L2 L1 L0 0 0 0 rb CELL: groups of instructions shlqbybi rt,ra,rb ra= -> rt

CELL: groups of instructions cegb rt,ra,rb byte-wise: t = ( a == b ) ? 0xff : 0x00 clgt rt,ra,rb word-wise: t = ( a > b ) ? 0xffffffff : 0x00000000 brsl rt,symbol rt = PC+1 jump to PC+symbol hbra brinst,brtarg Warns the processor: There will be a jump to brtarg, coming brinst instructions from current PC brsl rt,symbol if( rt{4:7} == 0 } jump to symbol

CELL: groups of instructions • Interrupts: • an SPE can have one registered interrupt handler at address 0 • iret returns from such an interrupts • alternatively • interrupts can be disabled, and • programs can use the bisled command (branch indirect and set link if external data) to check conditions that might otherwise be handled by interrupt • Channels: • A chip CELL has 128 so-called channels • A channel is an atomic pipe for communication between the SPE and its environment • The meaning of channels is defined by integrators and implemented in the MMIO (memory mapped I/O) unit • In current-day CELLs channels are among other things used for implementing mailboxes and DMA operations • So the meaning of these commands varies with platform

INF5062:Programming asymmetric multi-core processors nVIDIA GPGPU CUDA instruction set

nVIDIA G92 Streaming Multiprocessor ( SM ) Instruction Fetch Instruction L 1 Cache L 1 Fill Thread / Instruction Dispatch Work Shared Memory Control SP 0 RF 0 RF 4 SP 4 Results SP 1 RF 1 RF 5 SP 5 S S F F U U SP 2 RF 2 RF 6 SP 6 SP 3 RF 3 RF 7 SP 7 Load Texture Constant L 1 Cache L 1 Fill Store to Load from Memory Store to Memory • Stream Multiprocessor (SM) • 8 Stream Processors (SPs) • Sharing local ”shared memory” • Sharing L1 cache • instruction set: PTX • Parallel thread execution • PTX is one abstraction: meant to make GPGPU programming easier to handle • Parts of the card’s operation is not formulated in PTX • PTX programs are sandboxed, running in a virtual machine-like environment

nVIDIA G92 Streaming Multiprocessor ( SM ) Instruction Fetch Instruction L 1 Cache L 1 Fill Thread / Instruction Dispatch Work Shared Memory Control SP 0 RF 0 RF 4 SP 4 Results SP 1 RF 1 RF 5 SP 5 S S F F U U SP 2 RF 2 RF 6 SP 6 SP 3 RF 3 RF 7 SP 7 Load Texture Constant L 1 Cache L 1 Fill Store to Load from Memory Store to Memory • SPs are not as independent as PTX seems to imply • From the manual: ”Each multiprocessor has a Single Instruction, Multiple Data architecture (SIMD): At any given clock cycle, each processor of the multiprocessor executes the same instruction, but operates on different data. “ • Implications • You program several different threads • Threads in the same SM execute the same instruction, some may wait • If threads don’t execute the same instr., they are divergent

nVIDIA G92 Streaming Multiprocessor ( SM ) Instruction Fetch Instruction L 1 Cache L 1 Fill Thread / Instruction Dispatch Work Shared Memory Control SP 0 RF 0 RF 4 SP 4 Results SP 1 RF 1 RF 5 SP 5 S S F F U U SP 2 RF 2 RF 6 SP 6 SP 3 RF 3 RF 7 SP 7 Load Texture Constant L 1 Cache L 1 Fill Store to Load from Memory Store to Memory • Parallelity • Threads are grouped into warps • Threads of one warp run on the same SM at the same time • Efficiency is lost if they diverge • The warps form cooperative thread arrays, or CTAs • One CTA runs on only one SM • CTAs have a 1D, 2D or 3D shape • software-defined • determines threads’ 3D ID (%tid) • An SM can handle several CTAs in time-sharing mode • Time-sharing is preemptive • The instruction set hides this entirely

PTX doesn’t have conditional branches However It has absolute branches It has the ability to hop over instructions “Predicate” Want to emulate Steps Branching CUDA if(i<n) j=j+1 .reg .pred p setp.lt.s32 p, i, n @p add.s32 j, j, 1 setp.lt.s32 p, i, n @p bra label ... label: ...

Branching is by default divergant Must tell assembler if this is not the case: The threads can declare, load and store vectors of intrinsics .v2, .v3, .v4 of .b8, .b16, .b32, .b64, .s8, .s16, .s32, .s64, .u8, .u16, .u32, .u64, .f32, .f64 vector double a as on the Cell would be: .v2 .f64 but PTX can even say ld.shared.v4.f64 rt,[a] And, of course, this instruction is effectively working on 8 of these in parallel (2096 bits) CUDA setp.lt.s32 p, i, n @p bra.uni label ... label: ...

CUDA

CUDA Given predicate p, general registers rt, ra, rb selp.s32 rt, ra, rb, p sets rt = p ? ra : rb General registers rt, ra, rb mul.hi.u32 rt, ra, rb tmp = ra*rb rt = high 32 bits of tmp Load an int into rt from shared memory addressed by ra plus offset ld.shared.u32 rt,[ra+16] Load a double vector into rt from global memory absolutely addressed ld.global.v4.f64 rt,[240] A true function call, recursion forbidden, in this case non-divergant call.uni label

CUDA A barrier. Not initialized, requires all SPs to arrive here. bar.sync 0 Atomically adds ra to signed integer at position in global memory pointed to by a. atom.global.add.s32 d,[a], ra Reciprocal value: rt = 1/ra rcp.f64 rt, ra Square root: rt = sqrt(ra) sqrt.f32 rt, ra Base-2 logarithm: rt = log2(ra) lg2.f32 rt, ra

INF5062:Programming asymmetric multi-core processors Discussion

Looks into ISAs of 3 specialized processors • Which IS did you like best? • What tasks would you have them do? • What do you think about IS abstractions à la PTX? • How hard is using the SIMD paradigm? • What don’t you know from knowing the ISA?

Instruction Sets