Cs 152 spring 2010 section 5
Download
1 / 18

CS 152, Spring 2010 Section 5 - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

CS 152, Spring 2010 Section 5. Andrew Waterman. University of California, Berkeley. Mystery Die. Mystery Die. NVIDIA GTX280 240 cores * 1.296 GHz * 3 flops/cycle 933 GFLOPS (Nhm is 8*3G*8=192GFLOPS). Agenda. Quiz 1 Post-Mortem VM & Caches Return PS1 Graded only for completeness.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CS 152, Spring 2010 Section 5' - zach


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cs 152 spring 2010 section 5

CS 152, Spring 2010Section 5

Andrew Waterman

University of California, Berkeley



Mystery die1
Mystery Die

  • NVIDIA GTX280

  • 240 cores * 1.296 GHz * 3 flops/cycle

    • 933 GFLOPS (Nhm is 8*3G*8=192GFLOPS)


Agenda
Agenda

  • Quiz 1 Post-Mortem

  • VM & Caches

  • Return PS1

    • Graded only for completeness


Quiz 1 q1
Quiz 1, Q1

  • Microcode for JALM offset(rs)

  • Corner case didn’t hurt performance

  • Straightforward sol’n: (27/29 points)

    • A <- R[rs]

    • B <- sExt16(imm)

    • MA <- A+B

    • A <- PC // PC = PC+4 already happened

    • R[31] <- A

    • PC <- M[MA]


Quiz 1 q11
Quiz 1, Q1

  • Cleverer sol’n:

    • B <- R[rs] // use commutative property

    • R[31] <- A+4 // A still has old PC

    • A <- sExt16(imm)

    • MA <- A+B

    • PC <- M[MA]

  • AFAIK, this is the only 5-line solution


Quiz 1 q12
Quiz 1, Q1

  • Common problems:

    • Forgetting that A already had the old PC, so took an extra cycle

    • Forgetting that PC was already incremented, so did R[31] <- oldPC+8

    • Being overly-conservative with don’t-cares

      • Can destroy IR as soon as you’ve read rs, imm

      • Can set load-enable to DC the cycle the value is used

  • Almost all points deducted were nit-picks


Quiz 1 q2
Quiz 1, Q2

  • 6-stage pipeline; new writeback at end of EX

  • When ALUop has proceeded to M1, the writeback value is available to insn in ID

    • Second write port doesn’t help the immediately-subsequent insn—just the one after it

    • Example insn sequence that benefits from it:

      • add r1, r2, r3

      • sub r11, r12, r13

      • add r21, r1, r23


Quiz 1 q21
Quiz 1, Q2

  • 6-stage pipeline; new writeback at end of EX

  • When ALUop has proceeded to M1, the writeback value is available to insn in ID

    • Can remove bypass from end of M1 to end of ID

      • Equivalently, start of M2 to start of EX

    • Can also remove *ALU* bypass from end of M2 to end of ID, and end of WB to end of ID

      • Still needed for bypassing load results

      • Didn’t require this answer


Quiz 1 q22
Quiz 1, Q2

  • 6-stage pipeline; new writeback at end of EX

  • Problem with precise state:

    • Memory address exceptions not detected til M2

    • By then, a subsequent ALU op has written back

      • lw r1,-1(r0) // misaligned address

      • xor r2,r3,r4 // r2 modified anyway

    • Fix with interlock:

      • Stall any ALU op immediately following any load/store

      • Actually reduces control logic (interlock is already there for a load followed by a dependent ALU op)


Quiz 1 q23
Quiz 1, Q2

  • 6-stage pipeline; new writeback at end of EX

  • Problem with precise state:

    • Memory address exceptions not detected til M2

    • By then, a subsequent ALU op has written back

      • lw r1,-1(r0) // misaligned address

      • xor r2,r3,r4 // r2 modified anyway

    • Fix with additional read port:

      • Use read port to read *rd* (r2 in above example)

      • If lw causes trap, can then restore old value of rd


Quiz 1 q24
Quiz 1, Q2

  • 6-stage pipeline; new writeback at end of EX

  • Problem with precise state:

    • Memory address exceptions not detected til M2

    • By then, a subsequent ALU op has written back

      • lw r1,-1(r0) // misaligned address

      • xor r2,r3,r4 // r2 modified anyway

    • Fix with additional read port:

      • Use read port to read *rd* (r2 in above example)

      • If lw causes trap, can then restore old value of rd


Quiz 1 q3
Quiz 1, Q3

  • Reducing number of registers in ISA

    • Increases instructions/program because more registers must be spilled to the stack

    • Increases CPI because of load-use delay (these loads will be harder to schedule around)

      • Little penalty for “no effect”

      • Subtle: could decrease CPI for some programs with bad D$ hit rates; stack accesses will almost always hit

    • Smaller RF could shorten critical path


Quiz 1 q31
Quiz 1, Q3

  • Adding a branch delay slot

    • Compiler can’t always fill delay slot usefully, so more NOPs => more insns/program

    • CPI decreases because fewer control hazards are possible. Also, new NOPs have low CPI

    • Small critical path reduction: don’t need control signal to squash instructions after a taken branch

      • Credit still given for “no effect”


Quiz 1 q32
Quiz 1, Q3

  • Merging Execute and Memory Stages

    • No effect on insns/program: not ISA visible

    • Decreases CPI: eliminates load-use delay

      • NOT just because the pipeline depth is reduced

    • Address calculation added to critical path


Quiz 1 q33
Quiz 1, Q3

  • Microcoded CISC -> pipelined RISC

    • Increases insns/program: CISCs take fewer insns to encode a given program

    • Decreases CPI: RISC pipelines can sustain CPIs close to 1, whereas microcoded machines take several clocks per insn

    • Toss-up on seconds/cycle

      • Bypasses and extra control signals in pipeline are slow

      • Shared bus in microcoded machine could be slow, too


Quiz 1 q34
Quiz 1, Q3

  • Microcoded CISC -> pipelined RISC

    • Increases insns/program: CISCs take fewer insns to encode a given program

    • Decreases CPI: RISC pipelines can sustain CPIs close to 1, whereas microcoded machines take several clocks per insn

    • Toss-up on seconds/cycle

      • Bypasses and extra control signals in pipeline are slow

      • Shared bus in microcoded machine could be slow, too



ad