Importance of ISA in Hardware Design and Compiler Optimization

What is an ISA? • Hardware-software interface • Instruction Set Architecture (ISA) defines: • STATE OF THE PROGRAM (processor registers, memory) • WHAT INSTRUCTIONS DO: Semantics of instructions, how they update state • HOW INSTRUCTIONS ARE REPRESENTED: Syntax (bit encodings) • …selected so that implications of the above on hardware design/compiler design are optimal • Example: register specifier moves around between different instructions-- need multiple lines and a mux before the register file.

Why is the ISA important? • Fixed h/w-s/w interface for a generation of processors • IBM realized early the value of a fixed ISA • But: “stuck” with bad decisions for long time • Recent developments mitigate ISA problems (e.g., x86 micro-ops, Transmeta, virtual machines) • ISA decisions affect: (Revisit RISC vs. CISC…) • Memory cost of the machine • Short vs. long bit encodings • high vs. low semantic meaning per instruction • Hardware design • Simple, uniform-complexity ops => efficient pipeline • Don’t build hardware for instructions that never get used • Compiler and programming language issues • How much can compiler exploit ISA to optimize perf. • How well does ISA support high-level lang. constructs • Choice for hand coding vs. compiler generated code: semantics are easy to use vs. easy to generate code for

ISA Design Decisions & Outline: • Style of operand specification: stack, accumulator, registers, etc. • Operand access limitations • Addressing modes for operands • Semantics: • Mix of operations supported • Control transfers • Encoding tradeoffs • Compiler influence • Example: MIPS

Styles of ISAs

Styles of ISAs • All implement: • C=A+B

Why stacks, accumulators • Stacks: • Very compact format • All calculation operations take zero operands • Example use: Java bytecode (low network b/w) • Theoretically shortest code for implementing arithmetic expressions • All HP calculator fanatics know this • Accumulator: • Also a very compact format • Less dependence on memory than stack-based • For both: • Compact implies memory efficient • Good if memory is expensive

Why registers? • Faster than memory • Latency: raw access time (once address is known) • Cache access: 2-3 cycles (typical) • Register access: 1 cycle • Register file typically smaller than data cache • Register file doesn’t need tag check logic • Bandwidth: more practical to multiport a register file • ILP requires large number of operand ports • ILP requirements • High-performance scheduling (ILP) requires detecting data dependent/independent operations early in pipeline • Register “addresses” are known at instruction decode time • Memory addresses are known quite late due to address computation

Why Registers? (cont.) • Less memory traffic if values are in registers • Program runs faster if variables are inside registers (compiler does “register allocation”) • Bus can be used for other things (e.g., I/O) • More flexible for compiler/hardware scheduling • (A*B) - (C*D) - (E*F) • A*B in R1, -C*D in R2, -E*F in R3: can easily rearrange ADD instructions • A to F on the stack: less flexible • Need to add swaps/rotates or completely rewrite code

How many registers? • Depends on: • Compiler ability • Program characteristics • Lots-o-registers enable two important optimizations: • Register allocation (more variables can be in registers) • Limiting reuse of registers improves parallelism • Reuse example: Load R2, A; Load R3, B; Load R4, C; Load R5, D Add R1, R2, R3 Add R2, R5, R4 (reuse of R2) vs. Add R1, R2, R3 Add R6, R4, R5 (no reuse: had R6) • Without reuse Adds are “parallelizable” if there are two adders • Instruction level parallelism (ILP) • ILP ~ Average (CPI)-1 ~ Number of registers Conflict artificially serializes the two instructions

Operand access limitations • Load/store (0,3) • (+) Fixed-length instructions possible: easy fetch/decode • (+) Simpler h/w: efficient pipeline & potentially lower CT • (-) Higher instruction count (IC) • (-) Fixed-length instructions are wasteful • Register/memory (1,2) • (+) No need for extra loads • (+) “A few lengths” better uses bits • (-) Destroys source operand (e.g., Add R1,R2) • (-) May impact CPI • Memory/memory • (+) Most compact (code density) • (-) High memory traffic (memory bottleneck) Good code density

Alignment • Byte alignment • Any access is accommodated • Word alignment • Only accesses that are aligned at natural word boundaries are accommodated due to DRAM/SRAM organization • Reduces number of reads/writes to memory • Eliminates hardware for alignment (typically expensive) • Often handle misalignment via software: • Compiler detects & generates appropriate instructions • …or O/S detects and runs “fixit” routine memory (bytes) 0 1 2 Unaligned access 3 4 5 6 7 Word size = 4 bytes read #1 0 1 2 3 read #2 4 5 6 7 Asking for words beginning at 0 or 4 is OK Asking for other words requires two reads (e.g., ask for word starting at 2) 4 5 2 3 reorder 2 3 4 5

MSB LSB MSB LSB 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 Endian-ness • Where is the most-significant byte (MSB) in a word? • Little-endian (e.g., x86) • “little”-endian comes from interpreting byte address 0 as the “least”-significant byte • Big-endian (e.g., IBM PowerPC) • “big”-endian comes from interpreting byte address 0 as the “most”-significant byte Byte address Byte address

Common addressing modes • Register • Add R4, R3 • R4 = R4 + R3 • Used when value is in a register • Immediate • Add R4, #3 • R4 = R4 + 3 • Useful for small constants, which occur frequently • Displacement • Add R4, 100(R1) • R4 = R4 + Mem[100+R1] • Accesses the frame (arguments, local variables) • Accesses the global data segment • Accesses fields of a data struct

Addressing modes (cont.) • Register deferred/Register indirect • Add R3, (R1) • R3 = R3 + Mem[R1] • Access using a computed address • Indexed • Add R3, (R1 + R2) • R3 = R3 + Mem[R1 + R2] • Array accesses • R1 = base, R2 = index • Direct/Absolute • Add R1, (1001) • R1 = R1 + M[1001] • Accessing global (“static”) data

Addressing modes (cont.) • Memory indirect/Memory deferred • Add R1, @(R3) • R1 = R1 + Mem[Mem[R3]] • Pointer dereferencing: x = *p; (if p is not register-allocated) • Autoincrement/Postincrement • Add R1, (R2)+ • R1 = R1 + Mem[R2]; R2 = R2 + d (d is size of operation) • Looping through arrays, stack pop • Autodecrement/Predecrement • Add R1, -(R2) • R2 = R2 - d; R1 = R1 + Mem[R2] (d is size of operation) • Same uses as autoincrement, stack push • Scaled • Add R1, 100(R2)[R3] • R1 = R1 + Mem[100+R2+R3*d] (d is size of operation) • Array accesses for non-byte-sized elements

Wisdom about modes • Need: • Register, Displacement, Immediate and optionally Indexed (indexed simplifies array accesses) • Displacement size 12-16 bits (empirical) • Immediate: 8 to 16 bits (empirical) • Can synthesize the rest from simpler instructions • Example-- MIPS architecture: • Register, displacement, Immediate modes only • both immediate and displacement: 16 bits • Choice depends on workload! • For example, floating-point codes might require larger immediates, or 64bit wordsize machines might also require larger immediates (for *p++ kind of operations)

Control transfer semantics • Types of branches • Conditional • Unconditional • Normal • Call • Return • PC Relative (Branch) vs. Absolute (Jump) • Branch allows relocatable (“position independent”) code • Jump allows branching further than PC relative

Parts of a control transfer • WHERE • Determine target address • WHETHER • Determine if transfer should occur or not • WHEN • Determine when in time the transfer should occur • Each of the three decisions can be decoupled

Types of control transfer (cont). • All three together: Compare and branch instruction • Br (R1 = R2), destination • (+) A single instruction • (-) Heavy hardware requirement, inflexible scheduling • WHETHER separate from WHERE/WHEN: • Condition code register (CMP R1,R2 … BEQ dest) • (+) Sometimes test happens “for free” • (-) Hard for compiler to figure out which instructions depend on CC register • Condition register (SUB R1,R2 … BEQ R1, dest) • (+) Simple to implement, dependencies between instructions are obvious to compiler • (-) Uses a register (“register pressure”)

Prepare-to-branch • Decouple all three of WHERE / WHETHER / WHEN • WHERE: PBR BTR1 = destination • BTR1 = “Branch target register #1” • WHETHER: CMP PR2 = (R1 = R2) • PR2 = “Predicate register #2” • WHEN BR BTR1 if PR2 • (+) Schedule each instruction so it happens during “free time” when hardware is idle • (-) Three instructions: higher IC • From the HP Labs PlayDoh architecture

Instruction Encoding tradeoffs • Variable width • Common instructions are short (1-2 bytes), less common or more complex instructions are long (>2 bytes) • (+) Very versatile, uses memory efficiently • (-) Instruction words must be decoded before number of instructions is known • Fixed width • Typically 1 instruction per 32-bit word (Alpha is 2 instructions per word) • (+) Every instruction word is an instruction, Easier to fetch/decode • (-) Uses memory inefficiently

Addressing mode encoding • Each operand has a “mode” field • Also called “address specifiers” • VAX, 68000 • (+) Very versatile • (-) Encourages variable-width instructions (hard decode) • Opcode specifies addressing mode • Most RISCs • (+) Encourages fixed-width instructions (easy decode) • (+) “Natural” for a load/store ISA • (-) Limits what every instruction can do • But only matters for loads and stores

Compiler impact • High-level opt: • Use a “virtual source level” representation • Loop interchange, etc. • Low-level opt: • Clean up parser refuse • Each “optimization pass” runs as a filter • Enhance parallelism • Code generation: • Allocate registers • Schedule code for high performance • More later on this Parse High-level intermediate language High-level Optimize Low-level intermediate language Low-level Optimize Low-level intermediate language Code generation: Allocate, Schedule translate Assembly code

6 5 5 Opcode rs1 rd 6 5 5 5 Opcode rs1 rs2 rd 6 Opcode Example: MIPS A load/store, fixed-encoding architecture with a “condition register” architecture I-type instruction 16 Immediate Load, store, all immediate operations, conditional branches (rd unused) Jump through register, call through register (“jump and link register”) R-type instruction Opcode is in the same place for every instruction 5 6 Shamt Func Register-register ALU operations “Func” is an opcode extension J-type instruction 26 Offset added to PC Jump, call (“jump and link”), trap and return from exception

ISA of MIPS 64 0 R0 is permanent 0 R0 R1 R2 ... R31 PC 64 0 Use one half of Fi for single precision ops F0 F1 F2 ... F31 Load/store architecture Transfer sizes: B (byte), H (halfword), W (word), D (double word) No unaligned accesses allowed Only 3 addressing modes: register, immediate, displacement

MIPS example code DADDI R1,R0,10 Put 10 into R1 (R0 = 0) LD R2,A Put A in R2 Loop L.D F0, 0(R2) Load double FP value into F0 ADD.D F4, F0, F2 Add F2 to F0 S.D 0(R2),F4 Store result back to memory DADDI R1,R1,-1 Decrement I DADDI R2,R2,8 Increment loop pointer BNE R1,Loop

Importance of ISA in Hardware Design and Compiler Optimization

Importance of ISA in Hardware Design and Compiler Optimization

Presentation Transcript