Advanced Computer Architectures

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria

DLX • Load/Store Architecture • Registers are faster than memory • The compiler can do deeper optimization • 16bit offsets and immediates • 32bit integer registers • 64bit floating point registers • Fixed operation encoding: • Addr. Mode contained in the operation code • Fits in one word • Faster decoding Vittorio Zaccaria – Laboratory of Architectures

DLX (cont.) • 32 General purpose registers • 32 bit instructions: Vittorio Zaccaria – Laboratory of Architectures

DLX Pipeline Vittorio Zaccaria – Laboratory of Architectures

Pipeline Visualization Vittorio Zaccaria – Laboratory of Architectures

Hazards • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: HW cannot support this combination of instructions • Data hazards: Instruction depends on result of prior instruction still in the pipeline • Control hazards: Pipelining of branches & other instructions that change the PC • Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline Vittorio Zaccaria – Laboratory of Architectures

Structural Hazards Vittorio Zaccaria – Laboratory of Architectures

Data Hazards Vittorio Zaccaria – Laboratory of Architectures

Control Hazards Vittorio Zaccaria – Laboratory of Architectures

An example program: .data dati_a: .word 1,2,3,4,5,6,7,8 dati_b: .word 2,3,4,5,6,7,7,9 .text .global main add r3,r0,0 loop: lw r4,dati_a(r3) lw r5,dati_b(r3) sub r5,r5,r4 addi r3,r3,4 bnez r5,loop exit: Vittorio Zaccaria – Laboratory of Architectures

1st Exercise: • Draw pipeline chart • Indicate: • Data Hazards between WB stages and ID stages. • Control Hazards between EX stage and IF stage Vittorio Zaccaria – Laboratory of Architectures

Hazard Individuation

2nd Exercise: Hazard Resolution • Software solution • NOPs insertion • Hardware solutions • Bubbles/stalls generation • Register forwarding • Software optimizations • Code rescheduling Vittorio Zaccaria – Laboratory of Architectures

NOP insertion add r3,r0,0 NOP NOP Loop: Lw r4,dati_a(r3) Lw r5,dati_b(r3) NOP NOP Sub r5,r5,r4 Add r3,r3,4 NOP Bnez r5,Loop NOP Vittorio Zaccaria – Laboratory of Architectures

NOP dynamic execution First loop: Second loop: ........ Loop composed by 5 instr and 4 Nops Vittorio Zaccaria – Laboratory of Architectures

Performance Indexes • CPI= average clock cycles per instruction; • Average Clock cycles= n° instr+n°stalls/nops+4 4 is the n° of cycles needed to execute the last instruction. • CPI=[Average Clock cycles]/[n° instr] Vittorio Zaccaria – Laboratory of Architectures

Performance evaluation of NOPs • Actual CPI= Instructions+Nops+4 13+4 --------------------------------- = -------- = 2.42 Instructions 7 • MIPS frequency[=200Mhz] ------------------------- = 82.35 MIPS CPI*10^6 Vittorio Zaccaria – Laboratory of Architectures

NOPs Manual Exercise • Execute manually the loop for two cycles (finishing on the nop after the 2nd bnez) and calculate CPI and MIPS • 10 minutes Vittorio Zaccaria – Laboratory of Architectures

Results • CPI= (21+4)/11=2.27 • MIPS= 88 Vittorio Zaccaria – Laboratory of Architectures

Asymptotic loop performance • Consider an intermediate cycle of the loop. • Count instructions + nops of the cycle and divide it by the number of effective instructions -> asymptotical CPI • 10 minutes Vittorio Zaccaria – Laboratory of Architectures

Performance evaluation of NOPs (asymptotic) • Asymptotic loop CPI= (Instructions+Nops)*n+4 9n+4 --------------------------------- = ---------- =~ 1.8 Instructions*n 5n • MIPS frequency[=200Mhz] ------------------------- = 111 MIPS CPI*10^6 Vittorio Zaccaria – Laboratory of Architectures

Bubbles • Bubbles are NOPs inserted by the hardware. • Branch instructions provoke the generation of a NOP • Next instructions are stalled • Previous instructions are executed. Vittorio Zaccaria – Laboratory of Architectures

Bubbles Example Vittorio Zaccaria – Laboratory of Architectures

Performance evaluation of bubbles • Actual CPI= Instructions+Bubbles/aborts+4 7+6+4 --------------------------------- = -----------= 2.42 Instructions 7 • MIPS frequency[=200Mhz] ------------------------- = 82.35 MIPS CPI*10^6 Vittorio Zaccaria – Laboratory of Architectures

Verify on the simulator • File-> load code ... -> pipe1.s -> select -> load -> yes • Configuration -> disable forwarding • Open clock cycle diagram • Execute -> single cycle (until 1st load of the 2nd cycle has been executed) Vittorio Zaccaria – Laboratory of Architectures

Result Vittorio Zaccaria – Laboratory of Architectures

Manual Exercise • Preview what happens in an intermediate cycle • Calculate asymptotical CPI and MIPS • 10 minutes Vittorio Zaccaria – Laboratory of Architectures

Let’s simulate it • Simulate the program until the 4th cycle Vittorio Zaccaria – Laboratory of Architectures

Solutions • After the 1st cycle, we note the same behavior: • 5 instructions • 1 nop • 3 stalls so the asymptotic values are: • Asymptotic values: • CPI=1.8 • MIPS=111.11 Vittorio Zaccaria – Laboratory of Architectures

Result Forwarding Vittorio Zaccaria – Laboratory of Architectures

Forwarding Example Vittorio Zaccaria – Laboratory of Architectures

Simulation of 2 cycles of the loop. • Configuration -> enable forwarding • Open clock cycle diagram • File -> Reset DLX • Execute -> single cycle • Just to the WB of the 2nd bnez Vittorio Zaccaria – Laboratory of Architectures

Simulation results Vittorio Zaccaria – Laboratory of Architectures

Manual Exercise • Calculate CPI and MIPS for the 2 cycles. • Calculate Asymptotical CPI and MIPS. • 15 minutes Vittorio Zaccaria – Laboratory of Architectures

Results • 2 cycles: • 11 instructions • 1 nop • 2 stalls • 4 cycles to flush the pipe • CPI=18/11=1.63 • MIPS=122 Vittorio Zaccaria – Laboratory of Architectures

Asymptotical Results • 5 instructions • 1 nop • 1 stall • CPI=[7n+4]/5n=1.4 • MIPS=142.86. Vittorio Zaccaria – Laboratory of Architectures

Speedup • Speed up of A w.r.t. B: Exec. Time B ------------- Exec. Time A Vittorio Zaccaria – Laboratory of Architectures

Calculate asymptotical speedup • Speedup(NOPs,Bubbles) • Speedup(Forwarding,NOPs) • Speedup(Forwarding,Bubbles) • 5 minutes Vittorio Zaccaria – Laboratory of Architectures

Calculate Asym. speedup • Speedup(NOPs,Bubbles)=1 • Speedup(Forwarding,NOPs)=1.29 • Speedup(Forwarding,Bubbles)=1.29 Vittorio Zaccaria – Laboratory of Architectures

Scheduling Optimizations • change of the order of operations to minimize stalls/bubbles (forwarding enabled): lw r3,0(r2) add r3,r3,r7 lw r4,0(r2) add r4,r4,r8 add r4,r4,r3 CPI=(5+2+4)/5 lw r3,0(r2) lw r4,0(r2) add r3,r3,r7 add r4,r4,r8 add r4,r4,r3 CPI=(5+4)/5 Vittorio Zaccaria – Laboratory of Architectures

1st Exercise addi r1,r0,1 seq r2,r1,r1 add r3,r3,r3 Loop: lw r4,0(r3) sub r3,r3,r4 bnez r1,Loop Vittorio Zaccaria – Laboratory of Architectures

Manual Exercises • Draw the conflicts between operations until the end of the 3rd execution of the cycle (last instruction bnez). No forwarding possible. • Insert bubbles/aborts in the right place to solve hazards. • Calculate CPI and throughput of the trace. • Calculate asymptotical CPI of the loop. • 20 minutes Vittorio Zaccaria – Laboratory of Architectures

Hazard Diagram Vittorio Zaccaria – Laboratory of Architectures

Bubbles/Stall insertion Vittorio Zaccaria – Laboratory of Architectures

CPIs • Trace CPI=[24+4]/12=~2.33 • Asymptotic CPI=[6n+4]/3n=~2 Vittorio Zaccaria – Laboratory of Architectures

Manual Exercises • Suppose now that forwarding is possible. • Draw the new execution pipeline diagram (until the execution of the 3rd bnez) and indicate when stalls must be generated by the hardware. • Calculate CPI and MIPS • Calculate asymptotical CPI and MIPS • 20 minutes Vittorio Zaccaria – Laboratory of Architectures

Pipeline Diagram Vittorio Zaccaria – Laboratory of Architectures

Results • CPI=21/12=1.75 • Asymptotical CPI=[(4+1)n+4]/3n=5/3=1.66 Vittorio Zaccaria – Laboratory of Architectures

2nd exercise loop: lw r2,dati_a(r4) lw r3,dati_b(r5) add r1,r2,r3 sw dati_a(r6),r1 addi r4,r4,4 addi r5,r5,4 addi r6,r6,4 j loop Vittorio Zaccaria – Laboratory of Architectures

Advanced Computer Architectures

Advanced Computer Architectures

Presentation Transcript

Parallel Computer Architectures

Advanced Computer Architectures CSE 8383

Parallel Computer Architectures

CSC 4250 Computer Architectures

Advanced Computer Architecture Data-Level Parallel Architectures

Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

CSC 4250 Computer Architectures

CSC 4250 Computer Architectures

Structure of Computer Systems (Advanced Computer Architectures)

Advanced Computer Architectures – HB49 –

COMPUTER NETWORK ARCHITECTURES

Part VII Advanced Architectures

Embedded Computer Architectures

Embedded Computer Architectures

Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

Advanced Tool Architectures

Advanced Computer Architectures

CSC 4250 Computer Architectures