ee457 discussion fall 2006 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
EE457 Discussion Fall 2006 PowerPoint Presentation
Download Presentation
EE457 Discussion Fall 2006

Loading in 2 Seconds...

play fullscreen
1 / 30

EE457 Discussion Fall 2006 - PowerPoint PPT Presentation


  • 231 Views
  • Uploaded on

EE457 Discussion Fall 2006. Final Review. Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005. Review Questions. Question 2, Fall 2004 (Multi-cycle CPU) Question 3, Summer 2004 (Pipeline CPU) Question 1, Fall 2004 (Based on lab 7 pipeline) Carry Look Ahead Adder

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'EE457 Discussion Fall 2006' - tangia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ee457 discussion fall 2006

EE457 DiscussionFall 2006

Final Review

Brandon Franzke, Maryam Soltan, USC2006

and Wei-Jen Hsu, USC 2005

review questions
Review Questions
  • Question 2, Fall 2004 (Multi-cycle CPU)
  • Question 3, Summer 2004 (Pipeline CPU)
  • Question 1, Fall 2004 (Based on lab 7 pipeline)
  • Carry Look Ahead Adder
  • Question 5, Summer 2004 (CLA)
example multicycle cpu
Example - Multicycle CPU
  • Modifications to the 2nd Edition CU (state diagram) and DPU.
    • Mr Trojan already modified DPU
  • Notice: Standalone registers (MDR, ALUout) are fast even though RegFile is not.
    • Standalone = instantaneous
    • Register File = ½ clock
  • So we want to skip states 4 (lw) & 7 (r-type)
    • implement “posted-write” (next page)
posted write
Posted-Write
  • We needed states 4 & 7 because Register Writing takes ½ clock
    • But we already have the data stored in MDR and ALUout for these states.
    • Can we delay writing until the beginning of the next instruction? (state 0)
    • What about control signals?
  • This is a “Posted-Write”
    • a write operation “posted” (scheduled) to occur later
posted write implementation
Posted-Write Implementation
  • Well, we just save the control signals for 1 extra clock with Flip-Flops!
    • RegDst, RegWrite, MemToReg
  • Now the signals are available for 1 clock extra
questions
Questions
  • DPU modifications are complete, modify the CU to implement register posted-write.
    • DPU and CU next pages
  • What justification did Mr. T tell his boss for using Positive Edge-triggered flip-flops?
    • The design team says that positive-edged FF’s cost extra. Can Mr. Trojan use negative-edged FF instead?
when to load ff
When to load FF
  • Ms. Bruin suggested a RegWrite_FF_Write as shown below. Comment on the design and its necessity.
posted write for sw
Posted Write for sw
  • Ms. Bruin was given another chance by the lead engineer. She tried to copy Mr. Trojan and suggested saving a clock in the sw instruction by skipping state 5 and adding the following 2 FF. Advice?
example pipeline cpu a new 4 stage pipeline
Example – Pipeline CPU A new 4-stage Pipeline
  • MEM before EX
  • No spurious stalls
  • New R-Type instr. (ex.:addm,…)
    • Use memory operand as a source operand
  • Writing to RegFile takes very little time

=> No separate WB stage

  • Memory : One read port
  • Beq in Ex stage
  • EAC not possible

=> Revised lw and sw

new 4 stage pipeline
New 4-stage Pipeline ….
  • addm
  • Investigate data dependencies and implement HDU and FU
  • Avoid any spurious stalls. (really dependent)
  • No internal forwarding in memory
    • Cannot write and read to/from memory simultaneously.
new 4 stage pipeline sw lw
New 4-stage Pipeline …. (sw, lw)

beq rs,rt, Target;

sw rt, (rs); MEM[(rs)]<= (rt)

  • BEQ is executing after _____stage in ____ stage.
  • Where should we execute sw?
  • Where should we execute lw?
slide17

sw $1, ($2);

lw $4, ($2);

addm $8, ($2), $4;

subm $16, ($8), $4;

lab7 modified
Lab7, modified
  • Now implement SUB3 and SUB6 instructions (SUB3 in EX1 and EX2).
    • still have NOP
  • Optimize performance by performing SUB3 in EX1 or EX2 (i.e. minimize stalling)
  • The new stalling policy:
    • Never stall SUB3 and stall SUB6 iff it is dependent on the preceding instruction.
logic blocks
Logic Blocks
  • Postponing logic
    • assertions to perform SUB3 in EX1 or EX2
    • prefer EX1 so data is available to forward.
  • HDU
    • Stall only dependant SUB6 instructions
  • FU1 and FU2
    • forwarding from EX2→EX1 and WB→EX2
stall vs flushing
Stall vs. Flushing
  • When do you flush and when do you stall?
    • How many instructions do you flush at a time?
    • How many instructions in the pipe do you stall?
    • Do flushing & stalling have anything in common?
    • Which of them result in producing bubbles?
    • Is the penalty due to flushing / stalling more severe in deeper pipelines? (say 7-10 stages)
    • How do delay slots affect the penalty?
1 bit cla adder
1-bit CLA adder

A

B

(+)

Cin

S

p

g

  • p: propagator => p = A+B (If either A or B is 1, Cin = 1 causes Cout = 1)
  • g: generator => g = AB (If both A and B are 1, Cout = 1 for sure)
  • p, g are generated in 1 gate delay after we have A, B.
  • Note that Cin is not needed to produce p and g.
  • S is generated in 2 gate delay after we get Cin (SOP).
4 bit cla

A3

B3

A2

B2

A1

B1

A0

B0

(+)

(+)

(+)

(+)

C3

C2

C1

C0

S2

S3

S1

S0

g3

g2

g1

g0

p3

p2

p1

p0

CLL (carry look-ahead logic)

4-bit CLA
  • The CLL takes p,g from all 4 bits and C0 as input to generate all Cs in 2 gate delay.
  • C1=g0+p0C0,
  • C2=g1+p1g0+p1p0C0,
  • C3=g2+p2g1+p2p1g0+p2p1p0c0,
  • C4=g3+p3g2+p3p2g1+p3p2p1g0+p3p2p1p0c0
  • (Note: C4 is too complicated,
  • however it is a 2-level SOP representation)
4 bit cla27

C3

C2

C1

S2

S3

S1

S0

g3

g2

g1

g0

p3

p2

p1

p0

  • Given A,B’s, all p,g’s are generated in 1 gate delay in parallel.
  • Given all p,g’s, all C’s are generated in 2 gate delay in parallel.
  • Given all C’s, all S’s are generated in 2 gate delay in parallel.
4-bit CLA

A3

B3

A2

B2

A1

B1

A0

B0

(+)

(+)

(+)

(+)

C0

CLL (carry look-ahead logic)

  • Key virtue of CLA: sequential operation in RCA is broken into parallel operation!!
16 bit cla

Same as before, p,g’s are generated in parallel in 1 gate delay

  • Now, without input carry, the first-tier CLL cannot generate C’s…… Instead they generate P,G’s (group propagator and group generator) in 2 gate delay P => This group will propagate the input carry to the group P=p0p1p2p3 G => This group will generate an output carry G=g3+p3g2+p3p2g1+p3p2p1g0
  • The second-tier CLL takes the P,G’s from first-tier CLLs and C0 to generate “seed C’s” for first-tier CLLs in 2 gate delay. (note that the logic for generating “seed C’s” from P,G’s is exactly the same to generating C’s from p,g’s!)
  • With the seed C’s as input, the first-tier CLLs use Cin and p,g’s to generate C’s in 2 gate delay
  • With all C’s in place, S’s are calculated in 2 gate delay
16-bit CLA

Therefore, totally1+2+2+2+2=9 gate delayto finish the whole thing!!

example 64bit cla

Delay:

Example - 64bit-CLA
  • S39 takes longer to become valid.
  • List of primary and intermediate signals in producing S39: (Back tracking: S39 = A39B39C39 , S39<-C39<-C36…)
    • Do we need P39_36*and G39_36*?
    • Primary inputs:
    • Gate delay to generate p38_0, g38_0 :
    • Gate delay for second level P*, G*:
    • Gate delay for second level P**, G**:
    • Gate delay C32:
    • p38 ,p37 ,p36 , and g38 ,g37 ,g36
    • C32  C36 C39
other topics
Other Topics
  • Usually there is a question on non-linear pipeline.
  • Please make sure that you are comfortable with cache and virtual memory organization.