EE457 DiscussionFall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005
Review Questions • Question 2, Fall 2004 (Multi-cycle CPU) • Question 3, Summer 2004 (Pipeline CPU) • Question 1, Fall 2004 (Based on lab 7 pipeline) • Carry Look Ahead Adder • Question 5, Summer 2004 (CLA)
Example - Multicycle CPU • Modifications to the 2nd Edition CU (state diagram) and DPU. • Mr Trojan already modified DPU • Notice: Standalone registers (MDR, ALUout) are fast even though RegFile is not. • Standalone = instantaneous • Register File = ½ clock • So we want to skip states 4 (lw) & 7 (r-type) • implement “posted-write” (next page)
Posted-Write • We needed states 4 & 7 because Register Writing takes ½ clock • But we already have the data stored in MDR and ALUout for these states. • Can we delay writing until the beginning of the next instruction? (state 0) • What about control signals? • This is a “Posted-Write” • a write operation “posted” (scheduled) to occur later
Posted-Write Implementation • Well, we just save the control signals for 1 extra clock with Flip-Flops! • RegDst, RegWrite, MemToReg • Now the signals are available for 1 clock extra
Questions • DPU modifications are complete, modify the CU to implement register posted-write. • DPU and CU next pages • What justification did Mr. T tell his boss for using Positive Edge-triggered flip-flops? • The design team says that positive-edged FF’s cost extra. Can Mr. Trojan use negative-edged FF instead?
When to load FF • Ms. Bruin suggested a RegWrite_FF_Write as shown below. Comment on the design and its necessity.
Posted Write for sw • Ms. Bruin was given another chance by the lead engineer. She tried to copy Mr. Trojan and suggested saving a clock in the sw instruction by skipping state 5 and adding the following 2 FF. Advice?
Example – Pipeline CPU A new 4-stage Pipeline • MEM before EX • No spurious stalls • New R-Type instr. (ex.:addm,…) • Use memory operand as a source operand • Writing to RegFile takes very little time => No separate WB stage • Memory : One read port • Beq in Ex stage • EAC not possible => Revised lw and sw
New 4-stage Pipeline …. • addm • Investigate data dependencies and implement HDU and FU • Avoid any spurious stalls. (really dependent) • No internal forwarding in memory • Cannot write and read to/from memory simultaneously.
New 4-stage Pipeline …. (sw, lw) beq rs,rt, Target; sw rt, (rs); MEM[(rs)]<= (rt) • BEQ is executing after _____stage in ____ stage. • Where should we execute sw? • Where should we execute lw?
sw $1, ($2); lw $4, ($2); addm $8, ($2), $4; subm $16, ($8), $4;
Lab7, modified • Now implement SUB3 and SUB6 instructions (SUB3 in EX1 and EX2). • still have NOP • Optimize performance by performing SUB3 in EX1 or EX2 (i.e. minimize stalling) • The new stalling policy: • Never stall SUB3 and stall SUB6 iff it is dependent on the preceding instruction.
Logic Blocks • Postponing logic • assertions to perform SUB3 in EX1 or EX2 • prefer EX1 so data is available to forward. • HDU • Stall only dependant SUB6 instructions • FU1 and FU2 • forwarding from EX2→EX1 and WB→EX2
Stall vs. Flushing • When do you flush and when do you stall? • How many instructions do you flush at a time? • How many instructions in the pipe do you stall? • Do flushing & stalling have anything in common? • Which of them result in producing bubbles? • Is the penalty due to flushing / stalling more severe in deeper pipelines? (say 7-10 stages) • How do delay slots affect the penalty?
1-bit CLA adder A B (+) Cin S p g • p: propagator => p = A+B (If either A or B is 1, Cin = 1 causes Cout = 1) • g: generator => g = AB (If both A and B are 1, Cout = 1 for sure) • p, g are generated in 1 gate delay after we have A, B. • Note that Cin is not needed to produce p and g. • S is generated in 2 gate delay after we get Cin (SOP).
A3 B3 A2 B2 A1 B1 A0 B0 (+) (+) (+) (+) C3 C2 C1 C0 S2 S3 S1 S0 g3 g2 g1 g0 p3 p2 p1 p0 CLL (carry look-ahead logic) 4-bit CLA • The CLL takes p,g from all 4 bits and C0 as input to generate all Cs in 2 gate delay. • C1=g0+p0C0, • C2=g1+p1g0+p1p0C0, • C3=g2+p2g1+p2p1g0+p2p1p0c0, • C4=g3+p3g2+p3p2g1+p3p2p1g0+p3p2p1p0c0 • (Note: C4 is too complicated, • however it is a 2-level SOP representation)
C3 C2 C1 S2 S3 S1 S0 g3 g2 g1 g0 p3 p2 p1 p0 • Given A,B’s, all p,g’s are generated in 1 gate delay in parallel. • Given all p,g’s, all C’s are generated in 2 gate delay in parallel. • Given all C’s, all S’s are generated in 2 gate delay in parallel. 4-bit CLA A3 B3 A2 B2 A1 B1 A0 B0 (+) (+) (+) (+) C0 CLL (carry look-ahead logic) • Key virtue of CLA: sequential operation in RCA is broken into parallel operation!!
Same as before, p,g’s are generated in parallel in 1 gate delay • Now, without input carry, the first-tier CLL cannot generate C’s…… Instead they generate P,G’s (group propagator and group generator) in 2 gate delay P => This group will propagate the input carry to the group P=p0p1p2p3 G => This group will generate an output carry G=g3+p3g2+p3p2g1+p3p2p1g0 • The second-tier CLL takes the P,G’s from first-tier CLLs and C0 to generate “seed C’s” for first-tier CLLs in 2 gate delay. (note that the logic for generating “seed C’s” from P,G’s is exactly the same to generating C’s from p,g’s!) • With the seed C’s as input, the first-tier CLLs use Cin and p,g’s to generate C’s in 2 gate delay • With all C’s in place, S’s are calculated in 2 gate delay 16-bit CLA Therefore, totally1+2+2+2+2=9 gate delayto finish the whole thing!!
Delay: Example - 64bit-CLA • S39 takes longer to become valid. • List of primary and intermediate signals in producing S39: (Back tracking: S39 = A39B39C39 , S39<-C39<-C36…) • Do we need P39_36*and G39_36*? • Primary inputs: • Gate delay to generate p38_0, g38_0 : • Gate delay for second level P*, G*: • Gate delay for second level P**, G**: • Gate delay C32: • p38 ,p37 ,p36 , and g38 ,g37 ,g36 • C32 C36 C39
Other Topics • Usually there is a question on non-linear pipeline. • Please make sure that you are comfortable with cache and virtual memory organization.