cs61c final review fall 2004
Download
Skip this Video
Download Presentation
CS61c – Final Review Fall 2004

Loading in 2 Seconds...

play fullscreen
1 / 56

CS61c Final Review Fall 2004 - PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on

CS61c – Final Review Fall 2004. Zhangxi Tan 12/11/2004. CPU Design Pipelining Caches Virtual Memory I/O and Performance. CPU Design Pipelining Caches Virtual Memory I/O and Performance. Single Cycle CPU Design. Overview Picture Two Major Issues Datapath Control

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CS61c Final Review Fall 2004' - cid


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cs61c final review fall 2004

CS61c – Final ReviewFall 2004

Zhangxi Tan

12/11/2004

slide2
CPU Design
  • Pipelining
  • Caches
  • Virtual Memory
  • I/O and Performance
slide3
CPU Design
  • Pipelining
  • Caches
  • Virtual Memory
  • I/O and Performance
single cycle cpu design
Single Cycle CPU Design
  • Overview Picture
  • Two Major Issues
    • Datapath
    • Control
  • Control is the hard part, but is make easier by the format of MIPS instructions
single cycle cpu design1

PC

ALU

Clk

Single-Cycle CPU Design

Control

Ideal

Instruction

Memory

Control Signals

Conditions

Instruction

Rd

Rs

Rt

5

5

5

Instruction

Address

A

Data

Address

Data

Out

32

Rw

Ra

Rb

32

Ideal

Data

Memory

32

32 32-bit

Registers

Next Address

Data In

B

Clk

Clk

32

Datapath

slide6

CPU Design – Steps to Design/Understand a CPU

  • 1. Analyze instruction set architecture (ISA) => datapath requirements
  • 2. Select set of datapath components and establish clocking methodology
  • 3. Assemble datapath meeting requirements
  • 4. Analyze implementation of each instruction to determine setting of control points.
  • 5. Assemble the control logic
putting it all together a single cycle datapath

Inst

Memory

Adr

Adder

Mux

Adder

Putting it All Together:A Single Cycle Datapath

Instruction<31:0>

<0:15>

<21:25>

<16:20>

<11:15>

Rs

Rt

Rd

Imm16

RegDst

nPC_sel

ALUctr

MemWr

MemtoReg

Equal

Rt

Rd

0

1

Rs

Rt

4

RegWr

5

5

5

busA

Rw

Ra

Rb

=

00

busW

32

32 32-bit

Registers

ALU

0

32

busB

32

0

PC

32

Mux

Mux

Clk

32

WrEn

Adr

1

1

Data In

Data

Memory

imm16

Extender

32

PC Ext

Clk

16

imm16

Clk

ExtOp

ALUSrc

cpu design components of the datapath
CPU Design – Components of the Datapath
  • Memory (MEM)
    • instructions & data
  • Registers (R: 32 x 32)
    • read RS
    • read RT
    • Write RT or RD
  • PC
  • Extender (sign extend)
  • ALU (Add and Sub register or extended immediate)
  • Add 4 or extended immediate to PC
cpu design instruction implementation
CPU Design – Instruction Implementation
  • Instructions supported (for our sample processor):
    • lw, sw
    • beq
    • R-format (add, sub, and, or, slt)
    • corresponding I-format (addi …)
  • You should be able to,
    • given instructions, write control signals
    • given control signals, write corresponding instructions in MIPS assembly
what does an add look like

Inst

Memory

Adr

Adder

Mux

Adder

What Does An ADD Look Like?

Instruction<31:0>

<0:15>

<21:25>

<16:20>

<11:15>

Rs

Rt

Rd

Imm16

RegDst

nPC_sel

ALUctr

MemWr

MemtoReg

Equal

Rt

Rd

0

1

Rs

Rt

4

RegWr

5

5

5

busA

Rw

Ra

Rb

=

00

busW

32

32 32-bit

Registers

ALU

0

32

busB

32

0

PC

32

Mux

Mux

Clk

32

WrEn

Adr

1

1

Data In

Data

Memory

imm16

Extender

32

PC Ext

Clk

16

imm16

Clk

ExtOp

ALUSrc

slide11

31

26

21

16

11

6

0

op

rs

rt

rd

shamt

funct

Add
  • R[rd] = R[rs] + R[rt]

Instruction<31:0>

PCSrc= 0

Instruction

Fetch Unit

Rd

Rt

<21:25>

<16:20>

<11:15>

<0:15>

Clk

RegDst = 1

1

0

Mux

ALUctr = Add

Rt

Rs

Rd

Imm16

Rs

Rt

RegWr = 1

5

5

5

MemtoReg = 0

busA

Zero

MemWr = 0

Rw

Ra

Rb

busW

32

32 32-bit

Registers

0

ALU

32

busB

32

0

Clk

Mux

32

Mux

32

1

WrEn

Adr

1

Data In

32

Data

Memory

Extender

imm16

32

16

Clk

ALUSrc = 0

how about addi

Inst

Memory

Adr

Adder

Mux

Adder

How About ADDI?

Instruction<31:0>

<0:15>

<21:25>

<16:20>

<11:15>

Rs

Rt

Rd

Imm16

RegDst

nPC_sel

ALUctr

MemWr

MemtoReg

Equal

Rt

Rd

0

1

Rs

Rt

4

RegWr

5

5

5

busA

Rw

Ra

Rb

=

00

busW

32

32 32-bit

Registers

ALU

0

32

busB

32

0

PC

32

Mux

Mux

Clk

32

WrEn

Adr

1

1

Data In

Data

Memory

imm16

Extender

32

PC Ext

Clk

16

imm16

Clk

ExtOp

ALUSrc

slide13

31

26

21

16

0

op

rs

rt

immediate

Addi
  • R[rt] = R[rs] + SignExt[Imm16]

Instruction<31:0>

PCSrc= 0

Instruction

Fetch Unit

Rd

Rt

<0:15>

<21:25>

<16:20>

<11:15>

Clk

RegDst = 0

1

0

Mux

Rt

Rs

Rd

Imm16

Rs

Rt

ALUctr = Add

RegWr = 1

MemtoReg = 0

5

5

5

busA

Zero

MemWr = 0

Rw

Ra

Rb

busW

32

32 32-bit

Registers

0

ALU

32

busB

32

0

Clk

Mux

32

Mux

32

1

WrEn

Adr

1

Data In

32

Data

Memory

Extender

imm16

32

16

Clk

ALUSrc = 1

control
Control

Instruction<31:0>

Inst

Memory

<21:25>

<21:25>

<16:20>

<11:15>

<0:15>

Adr

Op

Fun

Rt

Rs

Rd

Imm16

Control

ALUctr

MemWr

MemtoReg

ALUSrc

RegWr

RegDst

Zero

PCSrc

DATA PATH

topics since midterm
Topics Since Midterm
  • CPU Design
  • Pipelining
  • Caches
  • Virtual Memory
  • I/O and Performance
pipelining
Pipelining
  • View the processing of an instruction as a sequence of potentially independent steps
  • Use this separation of stages to optimize your CPU by starting to process the next instruction while still working on the previous one
  • In the real world, you have to deal with some interference between instructions
review datapath

ALU

2. Decode/

Register

Read

5. WriteBack

1. Instruction

Fetch

4. Memory

3. Execute

Review Datapath

rd

instruction

memory

registers

PC

rs

Data

memory

rt

+4

imm

sequential laundry

2 AM

12

6 PM

1

8

7

11

10

9

30

30

30

30

30

30

30

30

30

30

30

30

30

30

30

30

T

a

s

k

O

r

d

e

r

Time

A

B

C

D

Sequential Laundry
  • Sequential laundry takes 8 hours for 4 loads
pipelined laundry

2 AM

12

6 PM

1

8

7

11

10

9

Time

30

30

30

30

30

30

30

T

a

s

k

O

r

d

e

r

A

B

C

D

Pipelined Laundry
  • Pipelined laundry takes 3.5 hours for 4 loads!
pipelining key points
Pipelining -- Key Points
  • Pipelining doesn’t help latency of single task, it helps throughput of entire workload
  • Multiple tasks operating simultaneously using different resources
  • Potential speedup = Number pipe stages
  • Time to “fill” pipeline and time to “drain” it reduces speedup
pipelining limitations
Pipelining -- Limitations
  • Pipeline rate limited by slowest pipeline stage
  • Unbalanced lengths of pipe stages also reduces speedup
  • Interference between instructions – called a hazard
pipelining hazards
Pipelining -- Hazards
  • Hazards prevent next instruction from executing during its designated clock cycle
    • Structural hazards: HW cannot support this combination of instructions
    • Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard; “bubbles” in the pipeline
    • Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)
pipelining structural hazards
Pipelining – Structural Hazards
  • Avoid memory hazards by having two L1 caches – one for data and one for instructions
  • Avoid register conflicts by always writing in the first half of the clock cycle and reading in the second half
    • This is ok because registers are much faster than the critical path
pipelining control hazards
Pipelining – Control Hazards
  • Occurs on a branch or jump instruction
  • Optimally you would always branch when needed, never execute instructions you shouldn’t have, and always have a full pipeline
    • This generally isn’t possible
  • Do the best we can
    • Optimize to 1 problem instruction
    • Stall
    • Branch Delay Slot
pipelining data hazards
Pipelining – Data Hazards
  • Occur when one instruction is dependant on the results of an earlier instruction
  • Can be solved by forwarding for all cases except a load immediately followed by a dependant instruction
    • In this case we detect the problem and stall (for lack of a better plan)
pipelining exercise
Pipelining -- Exercise

addi $t0, $t1, 100 IDA M W

lw $t2, 4($t0) I D A M W

add $t3, $t1, $t2 I D A M W

sw $t3, 8($t0) I D A M W

lw $t5, 0($t6) no stall here

Or $t5, $t0, $t3

topics since midterm1
Topics Since Midterm
  • Digital Logic
    • Verilog
    • State Machines
  • CPU Design
  • Pipelining
  • Caches
  • Virtual Memory
  • I/O and Performance
caches
Caches
  • The Problem: Memory is slow compared to the CPU
  • The Solution: Create a fast layer in between the CPU and memory that holds a subset of what is stored in memory
  • We call this creature a Cache
caches reference patterns
Caches – Reference Patterns
  • Holding an arbitrary subset of memory in a faster layer should not provide any performance increase
    • Therefore we must carefully choose what to put there
  • Temporal Locality: When a piece of data is referenced once it is likely to be referenced again soon
  • Spatial Locality: When a piece of data is referenced it is likely that the data near it in the address space will be referenced soon
caches format mapping
Caches – Format & Mapping
  • Tag: Unique identifier for each block in memory that maps to same index in cache
  • Index: Which “row” in the cache the data block will map to (for direct mapped cache each row is a single cache block)
  • Block Offset: Byte offset within the block for the particular word or byte you want to access
caches exercise 1
Caches – Exercise 1

How many bits would be required to implement the following cache?

Size: 1MB

Associativity: 8-way set associative

Write policy: Write back

Block size: 32 bytes

Replacement policy: Clock LRU (requires one bit per data block)

caches solution 1
Caches – Solution 1

Number of blocks = 1MB / 32 bytes = 32 Kblocks (215)

Number of sets = 32Kblocks / 8 ways = 4 Ksets (212)  12 bits of index

Bits per set = 8 * (8 * 32 + (32 – 12 – 5) + 1 + 1 + 1)  32bytes + tag bits + valid + LRU + dirty

Bits total = (Bits per set) * (# Sets) = 212 * 2192 = 8,978,432 bits

caches exercise 2
Caches – Exercise 2

Given the following cache and access pattern, classify each access as hit, compulsory miss, conflict miss, or capacity miss:

Cache:

2-way set associative 32 byte cache with 1 word blocks. Use LRU to determine which block to replace.

Access Pattern :

4, 0, 32, 16, 8, 20, 40, 8, 12, 100, 0, 36, 68

caches solution 2
Caches – Solution 2

The byte offset (as always) is 2 bits, the index is 2 bits

and the tag is 28 bits.

topics since midterm2
Topics Since Midterm
  • Digital Logic
    • Verilog
    • State Machines
  • CPU Design
  • Pipelining
  • Caches
  • Virtual Memory
  • I/O and Performance
virtual memory
Virtual Memory
  • Caching works well – why not extend the basic concept to another level?
  • We can make the CPU think it has a much larger memory than it actually does by “swapping” things in and out to disk
  • While we’re doing this, we might as well go ahead and separate the physical address space and the virtual address space for protection and isolation
vm virtual address
VM – Virtual Address
  • Address space broken into fixed-sized pages
  • Two fields in a virtual address
    • VPN
    • Offset
  • Size of offset = log2(size of page)
vm address translation
VM – Address Translation
  • VPN used to index into page table and get Page Table Entry (PTE)
  • PTE is located by indexing off of the Page Table Base Register, which is changed on context switches
  • PTE contains valid bit, Physical Page Number (PPN), access rights
vm translation look aside buffer tlb
VM – Translation Look-Aside Buffer (TLB)
  • VM provides a lot of nice features, but requires several memory accesses for its indirection – this really kills performance
  • The solution? Another level of indirection: the TLB
  • Very small fully associative cache containing the most recently used mappings from VPN to PPN
vm exercise
VM – Exercise

Given a processor with the following parameters, how many bytes would be required to hold the entire page table in memory?

  • Addresses: 32-bits
  • Page Size: 4KB
  • Access modes: RO, RW
  • Write policy: Write back
  • Replacement: Clock LRU (needs 1-bit)
vm solution
VM – Solution

Number of bits page offset = 12 bits

Number of bits VPN/PPN = 20 bits

Number of pages in address space

= 232 bytes/212 bytes = 220 = 1Mpages

Size of PTE = 20 + 1 + 1 + 1 = 23 bits

Size of PT = 220 * 23 bits = 24,117,248 bits

= 3,014,656 bytes

the big picture cpu tlb cache memory vm
The Big PictureCPU – TLB – Cache – Memory – VM

VA

miss

TLB

Lookup

Cache

Main

Memory

miss

hit

Processor

Trans-

lation

data

big picture exercise
Big Picture – Exercise

What happens in the following cases?

  • TLB Miss
  • Page Table Entry Valid Bit 0
  • Page Table Entry Valid Bit 1
  • TLB Hit
  • Cache Miss
  • Cache Hit
big picture solution
Big Picture – Solution
  • TLB Miss – Go to the page table to fill in the TLB. Retry instruction.
  • Page Table Entry Valid Bit 0 / Miss – Page not in memory. Fetch from backing store. (Page Fault)
  • Page Table Entry Valid Bit 1 / Hit – Page in memory. Fill in TLB. Retry instruction.
  • TLB Hit – Use PPN to check cache.
  • Cache Miss – Use PPN to retrieve data from main memory. Fill in cache. Retry instruction.
  • Cache Hit – Data successfully retrieved.
  • Important thing to note: Data is always retrieved from the cache.
topics since midterm3
Topics Since Midterm
  • Digital Logic
    • Verilog
    • State Machines
  • CPU Design
  • Pipelining
  • Caches
  • Virtual Memory
  • I/O and Performance
io and performance
IO and Performance
  • IO Devices
  • Polling
  • Interrupts
  • Networks
io problems created by device speeds
IO – Problems Created by Device Speeds
  • CPU runs far faster than even the fastest of IO devices
  • CPU runs many orders of magnitude faster than the slowest of currently used IO devices
  • Solved by adhering to well defined conventions
    • Control Registers
io polling
IO – Polling
  • CPU continuously checks the device’s control registers to see if it needs to take any action
  • Extremely easy to program
  • Extremely inefficient. The CPU spends potentially huge amounts of times polling IO devices.
io interrupts
IO – Interrupts
  • Asynchronous notification to the CPU that there is something to be dealt with for an IO device
  • Not associated with any particular instruction – this implies that the interrupt must carry some information with it
  • Causes the processor to stop what it was doing and execute some other code
networks protocols
Networks – Protocols
  • A protocol establishes a logical format and API for communication
  • Actual work is done by a layer beneath the protocol, so as to protect the abstraction
  • Allows for encapsulation – carry higher level information within lower level “envelope”
  • Fragmentation – packets can be broken in to smaller units and later reassembled
networks complications
Networks – Complications
  • Packet headers eat in to your total bandwidth
  • Software overhead for transmission limits your effective bandwidth significantly
networks exercise
Networks – Exercise

What percentage of your total bandwidth is being used for protocol overhead in this example:

  • Application sends 1MB of true data
  • TCP has a segment size of 64KB and adds a 20B header to each packet
  • IP adds a 20B header to each packet
  • Ethernet breaks data into 1500B packets and adds 24B worth of header and trailer
networks solution
Networks – Solution

1MB / 64K = 16 TCP Packets

16 TCP Packets = 16 IP Packets

64K/1500B = 44 Ethernet packets per TCP Packet

16 TCP Packets * 44 = 704 Ethernet packets

20B overhead per TCP packet + 20B overhead per IP packet + 24B overhead per Ethernet packet =

20B * 16 + 20B * 16 + 24B * 704 = 17,536B of overhead

We send a total of 1,066,112B of data. Of that, 1.64% is protocol overhead.

slide56
CPI
  • Calculate the CPI for the following mix of instructions:
  • average CPI = 0.30*1 + 0.40*3 + 0.30*2 = 2.1 3b)How long will it take to execute 1 billion instructions on a 2GHz machine with the above CPI? time = 1/2,000,000,000(seconds/cycles) * 2.1(cycles/instruction) * 1,000,000,000(instructions) = 1.05 seconds
ad