EE457 Discussion Fall 2006

EE457 DiscussionFall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005

Review Questions • Question 2, Fall 2004 (Multi-cycle CPU) • Question 3, Summer 2004 (Pipeline CPU) • Question 1, Fall 2004 (Based on lab 7 pipeline) • Carry Look Ahead Adder • Question 5, Summer 2004 (CLA)

Example - Multicycle CPU • Modifications to the 2nd Edition CU (state diagram) and DPU. • Mr Trojan already modified DPU • Notice: Standalone registers (MDR, ALUout) are fast even though RegFile is not. • Standalone = instantaneous • Register File = ½ clock • So we want to skip states 4 (lw) & 7 (r-type) • implement “posted-write” (next page)

Posted-Write • We needed states 4 & 7 because Register Writing takes ½ clock • But we already have the data stored in MDR and ALUout for these states. • Can we delay writing until the beginning of the next instruction? (state 0) • What about control signals? • This is a “Posted-Write” • a write operation “posted” (scheduled) to occur later

Posted-Write Implementation • Well, we just save the control signals for 1 extra clock with Flip-Flops! • RegDst, RegWrite, MemToReg • Now the signals are available for 1 clock extra

Questions • DPU modifications are complete, modify the CU to implement register posted-write. • DPU and CU next pages • What justification did Mr. T tell his boss for using Positive Edge-triggered flip-flops? • The design team says that positive-edged FF’s cost extra. Can Mr. Trojan use negative-edged FF instead?

When to load FF • Ms. Bruin suggested a RegWrite_FF_Write as shown below. Comment on the design and its necessity.

Posted Write for sw • Ms. Bruin was given another chance by the lead engineer. She tried to copy Mr. Trojan and suggested saving a clock in the sw instruction by skipping state 5 and adding the following 2 FF. Advice?

Example – Pipeline CPU A new 4-stage Pipeline • MEM before EX • No spurious stalls • New R-Type instr. (ex.:addm,…) • Use memory operand as a source operand • Writing to RegFile takes very little time => No separate WB stage • Memory : One read port • Beq in Ex stage • EAC not possible => Revised lw and sw

New 4-stage Pipeline …. • addm • Investigate data dependencies and implement HDU and FU • Avoid any spurious stalls. (really dependent) • No internal forwarding in memory • Cannot write and read to/from memory simultaneously.

New 4-stage Pipeline …. (sw, lw) beq rs,rt, Target; sw rt, (rs); MEM[(rs)]<= (rt) • BEQ is executing after _____stage in ____ stage. • Where should we execute sw? • Where should we execute lw?

New 4-stage Pipeline …. (Hazard and stalling)

New 4-stage Pipeline …. (Hazard and stalling…)

sw $1, ($2); lw $4, ($2); addm $8, ($2), $4; subm $16, ($8), $4;

Lab7, modified • Now implement SUB3 and SUB6 instructions (SUB3 in EX1 and EX2). • still have NOP • Optimize performance by performing SUB3 in EX1 or EX2 (i.e. minimize stalling) • The new stalling policy: • Never stall SUB3 and stall SUB6 iff it is dependent on the preceding instruction.

Logic Blocks • Postponing logic • assertions to perform SUB3 in EX1 or EX2 • prefer EX1 so data is available to forward. • HDU • Stall only dependant SUB6 instructions • FU1 and FU2 • forwarding from EX2→EX1 and WB→EX2

Stall vs. Flushing • When do you flush and when do you stall? • How many instructions do you flush at a time? • How many instructions in the pipe do you stall? • Do flushing & stalling have anything in common? • Which of them result in producing bubbles? • Is the penalty due to flushing / stalling more severe in deeper pipelines? (say 7-10 stages) • How do delay slots affect the penalty?

1-bit CLA adder A B (+) Cin S p g • p: propagator => p = A+B (If either A or B is 1, Cin = 1 causes Cout = 1) • g: generator => g = AB (If both A and B are 1, Cout = 1 for sure) • p, g are generated in 1 gate delay after we have A, B. • Note that Cin is not needed to produce p and g. • S is generated in 2 gate delay after we get Cin (SOP).

A3 B3 A2 B2 A1 B1 A0 B0 (+) (+) (+) (+) C3 C2 C1 C0 S2 S3 S1 S0 g3 g2 g1 g0 p3 p2 p1 p0 CLL (carry look-ahead logic) 4-bit CLA • The CLL takes p,g from all 4 bits and C0 as input to generate all Cs in 2 gate delay. • C1=g0+p0C0, • C2=g1+p1g0+p1p0C0, • C3=g2+p2g1+p2p1g0+p2p1p0c0, • C4=g3+p3g2+p3p2g1+p3p2p1g0+p3p2p1p0c0 • (Note: C4 is too complicated, • however it is a 2-level SOP representation)

C3 C2 C1 S2 S3 S1 S0 g3 g2 g1 g0 p3 p2 p1 p0 • Given A,B’s, all p,g’s are generated in 1 gate delay in parallel. • Given all p,g’s, all C’s are generated in 2 gate delay in parallel. • Given all C’s, all S’s are generated in 2 gate delay in parallel. 4-bit CLA A3 B3 A2 B2 A1 B1 A0 B0 (+) (+) (+) (+) C0 CLL (carry look-ahead logic) • Key virtue of CLA: sequential operation in RCA is broken into parallel operation!!

Same as before, p,g’s are generated in parallel in 1 gate delay • Now, without input carry, the first-tier CLL cannot generate C’s…… Instead they generate P,G’s (group propagator and group generator) in 2 gate delay P => This group will propagate the input carry to the group P=p0p1p2p3 G => This group will generate an output carry G=g3+p3g2+p3p2g1+p3p2p1g0 • The second-tier CLL takes the P,G’s from first-tier CLLs and C0 to generate “seed C’s” for first-tier CLLs in 2 gate delay. (note that the logic for generating “seed C’s” from P,G’s is exactly the same to generating C’s from p,g’s!) • With the seed C’s as input, the first-tier CLLs use Cin and p,g’s to generate C’s in 2 gate delay • With all C’s in place, S’s are calculated in 2 gate delay 16-bit CLA Therefore, totally1+2+2+2+2=9 gate delayto finish the whole thing!!

Delay: Example - 64bit-CLA • S39 takes longer to become valid. • List of primary and intermediate signals in producing S39: (Back tracking: S39 = A39B39C39 , S39<-C39<-C36…) • Do we need P39_36*and G39_36*? • Primary inputs: • Gate delay to generate p38_0, g38_0 : • Gate delay for second level P*, G*: • Gate delay for second level P**, G**: • Gate delay C32: • p38 ,p37 ,p36 , and g38 ,g37 ,g36 • C32  C36 C39

Other Topics • Usually there is a question on non-linear pipeline. • Please make sure that you are comfortable with cache and virtual memory organization.

EE457 Discussion Fall 2006

EE457 Discussion Fall 2006

Presentation Transcript

EE141 Fall 2003 Discussion 1

Computer Graphics Fall 2006

EE141 Fall 2004 Discussion 1

Linguistics 2006 fall

FALL 2006 Abstracts

Bob DeYoung’s Discussion FDIC Fall Conference, 2006

EE457

2006 AirTAP Fall Forum

Fall 2006 Presentation

Fall 2006

MBAA Fall 2006 Elections

Fall Convention 2006

COMP445 Fall 2006

Fall 2006

Fall 2006 Initiation

Fall 2006

Bob DeYoung’s Discussion FDIC Fall Conference, 2006

EE457

2006 Fall Workshop

CMPUT680 - Fall 2006