1 / 25

Lecture 4: CPU Performance

Lecture 4: CPU Performance. A Modern Processor. Intel Core i7. Processor Performance. Lower bounds that characterize the maximum performance: Latency Bound Occurs when operations must be performed in strict sequence (e.g. data dependency) Minimum time to perform the operations sequentially

bessie
Download Presentation

Lecture 4: CPU Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 4:CPU Performance

  2. A Modern Processor Intel Core i7

  3. Processor Performance Lower bounds that characterize the maximum performance: • Latency Bound • Occurs when operations must be performed in strict sequence (e.g. data dependency) • Minimum time to perform the operations sequentially • Throughput Bound • Characterizes the raw computing capacity of the processor’s functional units. • Maximum operations per cycle

  4. Pipelining s1 s2 s3 stages stages s3 s3 s2 s2 s1 s1 time time With pipeline Without pipeline

  5. Pipelining s – stages n – tasks t – time per stage stages stages s3 s3 s2 s2 s1 s1 time time With pipeline Without pipeline T1 = s . t . n Tp = s . t + (n-1).t Speedup = T1 / Tp = s.n = s . s+(n-1) s/n +(1-1/n) Speedup = s n Throughput = n . Tp

  6. Pipelining Slowest stage determines the pipeline performance 10 30 20 s1 s2 s3 stages stages s3 s3 s2 s2 s1 s1 time time With pipeline Without pipeline

  7. Computational Pipelines Combinatorial logic Reg clock R R R Comb.log. A Comb.log. B Comb.log. C clock

  8. Limitations of Pipelining R R R Comb.log. A Comb.log. B Comb.log. C clock • Nonuniform partitioning • Stage delays may be nonuniform • Throughput is limited by the slowest stage • Deep pipelining • Large number of stages • Modern processors have deep pipelines (15 or more) to increase the clock rate. 50ps 20ps 150ps 20ps 100ps 20ps 50ps 20ps 50ps 20ps 50ps 20ps R R R … Comb.log. C Comb.log. A Comb.log. B clock

  9. Pipelined Parallel Adder a4,b4 a3,b3 a2,b2 a1,b1

  10. Pipelined Parallel Adder c4,d4 c3,d3 c2,d2 c1,d1 a4,b4 a3,b3 a2,b2 a1+b1

  11. Pipelined Parallel Adder e4,f4 e3,f3 e2,f2 e1,f1 c2,d2 c1+d1 c4,d4 c3,d3 a3,b3 a2+b2 a1+b1 a4,b4

  12. Pipelined Parallel Adder g4,h4 g3,h3 g2,h2 g1,h1 e4,f4 e3,f3 e2,f2 e1+f1 c4,d4 c3,d3 c2+d2 c1+d1 a3+b3 a4,b4 a2+b2 a1+b1

  13. Pipelined Parallel Adder g3,h3 g2,h2 g1+h1 g4,h4 e4,f4 e3,f3 e2+f2 e1+f1 c4,d4 c3+d3 c2+d2 c1+d1 a4+b4 a3+b3 a2+b2 a1+b1

  14. Instruction Execution Pipeline • Instruction Fetch Cycle (IF) • Fetch current instruction from memory • Increment PC • Instruction decode / register fetch cycle (ID) • Decode instruction • Compute possible branch target • Read registers from the register file • Execution / effective address cycle (EX) • Form the effective address • ALU performs the operation specified by the opcode • Memory access (MEM) • Memory read for load instruction • Memory write for store instruction • Write-back cycle (WB) • Write result into register file IF ID EX MEM WB

  15. Instruction Execution Pipeline IF ID EX MEM WB stages WB MEM EX ID IF time

  16. Pipeline Hazards • Structural hazards • Data Hazards • Control Hazards

  17. Pipeline Hazards Structural Hazards • Arise from resource conflicts when the hardware cannot support all possible combinations of instructions simultaneously in overlapped execution. stages stall (bubble) WB MEM EX ID IF time IF ID EX MEM WB Mem Reg ALU Mem Reg

  18. Pipeline Hazards Data Hazards • Arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions. stages ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11 WB MEM EX ID IF time IF ID EX MEM WB Mem Reg ALU Mem Reg

  19. Pipeline Hazards Data Hazards • Forwarding (by-passing) IF ID EX MEM WB Mem Reg ALU Mem Reg IF ID EX MEM WB Mem Reg ALU Mem Reg IF ID EX MEM WB Mem Reg ALU Mem Reg IF ID EX MEM WB Mem Reg ALU Mem Reg

  20. Pipeline Hazards Control (Branch) Hazards • Arise from pipelining of instructions (e.g. branch) that change PC. LOOP: LOAD 100,X ADD 200,X STORE 300,X DECX BNE LOOP ... for i=n to 1 ci = ai + bi stages WB MEM EX ID IF time

  21. Pipeline Hazards Control (Branch) Hazards • Freeze (flush) BRA L1 ... L1: NEXT NEXT NEXT NEXT stages WB MEM EX ID IF time

  22. Pipeline Hazards Control (Branch) Hazards • Predicted-not-taken BNE L1 NEXT NEXT NEXT ... L1: NEXT NEXT NEXT stages WB MEM EX ID IF time Not taken Taken

  23. Pipeline Hazards Control (Branch) Hazards • Predicted-taken BNE L1 NEXT NEXT NEXT ... L1: NEXT NEXT NEXT stages WB MEM EX ID IF time Not taken Taken

  24. Pipeline Hazards Control (Branch) Hazards • Delayed branch ADD R1,R2,R3 if (R2=0) branch L1 delay slot NEXT NEXT ... L1: NEXT NEXT NEXT if (R2=0) branch L1 ADD R1,R2,R3 NEXT NEXT ... L1: NEXT NEXT NEXT branch instruction sequential successor Branch target if taken stages WB MEM EX ID IF time Not taken Taken

  25. Levels of Parallelism • Bit level parallelism • Within arithmetic logic circuits • Instruction level parallelism • Multiple instructions execute per clock cycle • Memory system parallelism • Overlap of memory operations with computation • Operating system parallelism • More than one processor • Multiple jobs run in parallel on SMP • Loop level • Procedure level

More Related