Review of ECE301: Computer Organization

1 / 172

Review of ECE301: Computer Organization - PowerPoint PPT Presentation

Review of ECE301: Computer Organization

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

1. Review of ECE301: Computer Organization AMD Barcelona: 4 cores ECE610 - Fall 2013

2. Abstractions • Abstraction helps us deal with complexity • Hide lower-level detail • Instruction set architecture (ISA) • The hardware/software interface • Application binary interface • The ISA plus system software interface • Implementation • The details underlying and interface E. D. Dijkstra “… the main challenge of computer science is how not to get lost in the complexities of their own making.” ECE610 - Fall 2013

3. Defining Performance • Which airplane has the best performance? ECE610 - Fall 2013

4. Response Time and Throughput • Response time • How long it takes to do a task • Throughput • Total work done per unit time • e.g., tasks/transactions/… per hour • How are response time and throughput affected by • Replacing the processor with a faster version? • Adding more processors? • We’ll focus on response time for now… ECE610 - Fall 2013

5. Relative Performance • Define Performance = 1/Execution Time • “X is n time faster than Y” • Example: time taken to run a program • 10s on A, 15s on B • Execution TimeB / Execution TimeA= 15s / 10s = 1.5 • So A is 1.5 times faster than B ECE610 - Fall 2013

6. Measuring Execution Time • Elapsed time • Total response time, including all aspects • Processing, I/O, OS overhead, idle time • Determines system performance • CPU time • Time spent processing a given job • Discounts I/O time, other jobs’ shares • Comprises user CPU time and system CPU time • Different programs are affected differently by CPU and system performance ECE610 - Fall 2013

7. CPU Clocking • Operation of digital hardware governed by a constant-rate clock Clock period Clock (cycles) Data transferand computation Update state • Clock period: duration of a clock cycle • e.g., 250ps = 0.25ns = 250×10–12s • Clock frequency (rate): cycles per second • e.g., 4.0GHz = 4000MHz = 4.0×109Hz ECE610 - Fall 2013

8. CPU Time • Performance improved by • Reducing number of clock cycles • Increasing clock rate • Hardware designer must often trade off clock rate against cycle count ECE610 - Fall 2013

9. CPU Time Example • Computer A: 2GHz clock, 10s CPU time • Designing Computer B • Aim for 6s CPU time • Can do faster clock, but causes 1.2 × clock cycles • How fast must Computer B clock be? ECE610 - Fall 2013

10. Levels of Program Code • High-level language • Level of abstraction closer to problem domain • Provides for productivity and portability • Assembly language • Textual representation of instructions • Hardware representation • Binary digits (bits) • Encoded instructions and data ECE610 - Fall 2013

11. Instruction Count and CPI • Instruction Count for a program • Determined by program, ISA and compiler • Average cycles per instruction • Determined by CPU hardware • If different instructions have different CPI • Average CPI affected by instruction mix ECE610 - Fall 2013

12. CPI Example • Computer A: Cycle Time = 250ps, CPI = 2.0 • Computer B: Cycle Time = 500ps, CPI = 1.2 • Same ISA • Which is faster, and by how much? A is faster… …by this much ECE610 - Fall 2013

13. CPI in More Detail • If different instruction classes take different numbers of cycles • Weighted average CPI Relative frequency ECE610 - Fall 2013

14. CPI Example • Alternative compiled code sequences using instructions in classes A, B, C • Sequence 1: IC = 5 • Clock Cycles= 2×1 + 1×2 + 2×3= 10 • Avg. CPI = 10/5 = 2.0 • Sequence 2: IC = 6 • Clock Cycles= 4×1 + 1×2 + 1×3= 9 • Avg. CPI = 9/6 = 1.5 ECE610 - Fall 2013

15. Performance Summary The BIG Picture • Performance depends on • Algorithm: affects IC, possibly CPI • Programming language: affects IC, CPI • Compiler: affects IC, CPI • Instruction set architecture: affects IC, CPI, Tc ECE610 - Fall 2013

16. Power Trends • In CMOS IC technology (source: intel.com) ×30 5V → 1V ×1000 ECE610 - Fall 2013

17. Reducing Power • Suppose a new CPU has • 85% of capacitive load of old CPU • 15% voltage and 15% frequency reduction • The power wall • We can’t reduce voltage further • We can’t remove more heat • How else can we improve performance? ECE610 - Fall 2013

18. Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency ECE610 - Fall 2013

19. Multiprocessors • Multicore microprocessors • More than one processor per chip • Requires explicitly parallel programming • Compare with instruction level parallelism • Hardware executes multiple instructions at once • Hidden from the programmer • Hard to do • Programming for performance • Load balancing • Optimizing communication and synchronization (source: Intel Inc. via Embedded.com) ECE610 - Fall 2013

20. Manufacturing ICs • Yield: proportion of working dies per wafer ECE610 - Fall 2013

21. AMD Opteron X2 Wafer • X2: 300mm wafer, 117 chips, 90nm technology • X4: 45nm technology ECE610 - Fall 2013

22. Integrated Circuit Cost • Nonlinear relation to area and defect rate • Wafer cost and area are fixed • Defect rate determined by manufacturing process • Die area determined by architecture and circuit design ECE610 - Fall 2013

23. Example ECE610 - Fall 2013

24. SPEC CPU Benchmark • Programs used to measure performance • Supposedly typical of actual workload • Standard Performance Evaluation Corp (SPEC) • Develops benchmarks for CPU, I/O, Web, … • SPEC CPU2006 • Elapsed time to execute a selection of programs • Negligible I/O, so focuses on CPU performance • Normalize relative to reference machine • Summarize as geometric mean of performance ratios • CINT2006 (integer) and CFP2006 (floating-point) ECE610 - Fall 2013

25. CINT2006 for Opteron X4 2356 High cache miss rates ECE610 - Fall 2013

26. Processor design ECE610 - Fall 2013

27. Instruction Execution • PC  instruction memory, fetch instruction • Register numbers register file, read registers • Depending on instruction class • Use ALU to calculate • Arithmetic result • Memory address for load/store • Branch target address • Access data memory for load/store • PC  target address or PC + 4 ECE610 - Fall 2013

28. MIPS Instruction Set Microprocessor without Interlocked Pipeline Stages ECE610 - Fall 2013

29. Introduction • CPU performance factors • Instruction count • Determined by ISA and compiler • CPI and Cycle time • Determined by CPU hardware • We will examine two MIPS implementations • A simplified version • A more realistic pipelined version • Simple subset, shows most aspects • Memory reference: lw, sw • Arithmetic/logical: add, sub, and, or, slt • Control transfer: beq ECE610 - Fall 2013

30. Three Instruction Classes ECE610 - Fall 2013

31. CPU Overview ECE610 - Fall 2013

32. Multiplexers • Can’t just join wires together • Use multiplexers ECE610 - Fall 2013

33. Control ECE610 - Fall 2013

34. Full Datapath ECE610 - Fall 2013

35. Datapath With Control ECE610 - Fall 2013

36. R-Type Instruction ECE610 - Fall 2013

37. Load Instruction ECE610 - Fall 2013

38. Branch-on-Equal Insn. ECE610 - Fall 2013

39. Performance Issues • Longest delay determines clock period • Critical path: load instruction • Instruction memory  register file  ALU  data memory  register file • Not feasible to vary period for different instructions • Violates design principle • Making the common case fast • We will improve performance by pipelining ECE610 - Fall 2013

40. Pipeline Performance • Assume time for stages is • 100ps for register read or write • 200ps for other stages • Compare pipelined datapath with single-cycle datapath ECE610 - Fall 2013

41. Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) ECE610 - Fall 2013

42. MIPS Pipeline • Five stages, one step per stage • IF: Instruction fetch from memory • ID: Instruction decode & register read • EX: Execute operation or calculate address • MEM: Access memory operand • WB: Write result back to register ECE610 - Fall 2013

43. Pipeline Speedup • If all stages are balanced • i.e., all take the same time • Time between instructionspipelined= Time between instructionsnonpipelined Number of stages • If not balanced, speedup is less • Speedup due to increased throughput • Latency (time for each instruction) does not decrease ECE610 - Fall 2013

44. Hazards • Situations that prevent starting the next instruction in the next cycle • Structure hazards • A required resource is busy • Data hazard • Need to wait for previous instruction to complete its data read/write • Control hazard • Deciding on control action depends on previous instruction ECE610 - Fall 2013

45. Data Hazards • An instruction depends on completion of data access by a previous instruction • add \$s0, \$t0, \$t1sub \$t2, \$s0, \$t3 ECE610 - Fall 2013

46. Forwarding (aka Bypassing) • Use result when it is computed • Don’t wait for it to be stored in a register • Requires extra connections in the datapath ECE610 - Fall 2013

47. Load-Use Data Hazard • Can’t always avoid stalls by forwarding • If value not computed when needed • Can’t forward backward in time! ECE610 - Fall 2013

48. Code Scheduling to Avoid Stalls • Reorder code to avoid use of load result in the next instruction • C code for A = B + E; C = B + F; lw \$t1, 0(\$t0) lw \$t2, 4(\$t0) add \$t3, \$t1, \$t2 sw \$t3, 12(\$t0) lw \$t4, 8(\$t0) add \$t5, \$t1, \$t4 sw \$t5, 16(\$t0) lw \$t1, 0(\$t0) lw\$t2, 4(\$t0) lw\$t4, 8(\$t0) add \$t3, \$t1, \$t2 sw \$t3, 12(\$t0) add \$t5, \$t1, \$t4 sw \$t5, 16(\$t0) stall stall 13 cycles 11 cycles ECE610 - Fall 2013

49. Control Hazards • Branch determines flow of control • Fetching next instruction depends on branch outcome • Pipeline can’t always fetch correct instruction • Still working on ID stage of branch • In MIPS pipeline • Need to compare registers and compute target early in the pipeline • Add hardware to do it in ID stage ECE610 - Fall 2013

50. Stall on Branch • Wait until branch outcome determined before fetching next instruction ECE610 - Fall 2013