1 / 26

Unit 2

Unit 2. Organization. Multicore Chips Single-core Dual-core. CPU. CPU. CPU. Registers. Registers. Registers. L1 Cache. L1 Cache. L1 Cache. L2 Cache. L2 Cache. Main Memory. Main Memory. Performance Balance. Processor speed increased Memory capacity increased

darva
Download Presentation

Unit 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unit 2

  2. Organization Multicore Chips Single-core Dual-core CPU CPU CPU Registers Registers Registers L1 Cache L1 Cache L1 Cache L2 Cache L2 Cache Main Memory Main Memory

  3. Performance Balance • Processor speed increased • Memory capacity increased • Memory speed lags behind processor speed

  4. Logic and Memory Performance Gap

  5. Solutions • Increase number of bits retrieved at one time • Make DRAM “wider” rather than “deeper” • Change DRAM interface • Cache • Reduce frequency of memory access • More complex cache and cache on chip • Increase interconnection bandwidth • High speed buses • Hierarchy of buses

  6. I/O Devices • Peripherals with intensive I/O demands • Large data throughput demands • Processors can handle this • Problem moving data Solutions: • Caching • Buffering • Higher-speed interconnection buses • More elaborate bus structures • Multiple-processor configurations

  7. Key is Balance • Processor components • Main memory • I/O devices • Interconnection structures

  8. Improvements in Chip Organization and Architecture • Increase hardware speed of processor • Fundamentally due to shrinking logic gate size • More gates, packed more tightly, increasing clock rate • Propagation time for signals reduced • Increase size and speed of caches • Dedicating part of processor chip • Cache access times drop significantly • Change processor organization and architecture • Increase effective speed of execution • Parallelism

  9. Problems with Clock Speed and Logic Density • Power • Power density increases with density of logic and clock speed • Dissipating heat • RC delay • Speed at which electrons flow limited by resistance and capacitance of metal wires connecting them • Delay increases as RC product increases • Wire interconnects thinner, increasing resistance • Wires closer together, increasing capacitance • Memory latency • Memory speeds lag processor speeds • Solution: • More emphasis on organizational and architectural approaches

  10. Increased Cache Capacity • Typically two or three levels of cache between processor and main memory • Chip density increased • More cache memory on chip • Faster cache access • Pentium chip devoted about 10% of chip area to cache • Pentium 4 devotes about 50%

  11. Performance AssessmentClock Speed • Key parameters • Performance, cost, size, security, reliability, power consumption • System clock speed • In Hz or multiples of • Clock rate, clock cycle, clock tick, cycle time • Signals in CPU take time to settle down to 1 or 0 • Signals may change at different speeds • Operations need to be synchronised • Instruction execution in discrete steps • Fetch, decode, load and store, arithmetic or logical • Usually require multiple clock cycles per instruction • Pipelining gives simultaneous execution of instructions • So, clock speed is not the whole story

  12. System Clock

  13. Instruction Execution Rate • Millions of instructions per second (MIPS) • Millions of floating point instructions per second (MFLOPS) • Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy

  14. Performance A measure of how fast something works.. Plane DC to Paris Speed Passengers PMPH Boeing 747 6.5 hours 610 mph 470 286,700 Concorde 3 hours 1,350mph 132 178,200 ** PMPH = person miles per hour (Speed * Passengers) Latency (Response Time) Time to run the task (travel time for each passenger) < Flight Time of Boeing 747 Flight Time of Concorde Throughput (Bandwidth) Tasks run per time (person miles per hour) > Throughput of Boeing 747 Throughput of Concorde

  15. 1. Latency & Throughput 1. How long does it take for my job to run? Latency 2. How many jobs can the machine run at once? Throughput 3. What is the average execution rate? Throughput 4. How long does it take to execute a job? Latency 5. How much work is getting done? Throughput 6. How long must I wait for the database query? Latency Our Concern: Latency (Response Time)  “Execution Time”

  16. Execution Time Our Focus user CPU time  (CPU) Execution Time = IC * CPI * cycle time CPU Time • doesn’t count I/O or time spent running other programs. • system CPU time  spent in the operating system • user CPU timespent in the program

  17. 2. CPU Execution Time CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle • A program is running on a RISC machine with the followings: • 40,000,000 instructions • 6 cycles/instruction • 1 GHz Clock rate • What is the CPU execution time for this program? CPU Exec. Time = IC * CPI * Clock cycle time = = 0.24 seconds

  18. Ex: Performance • A program is running on a RISC machine with the followings: • 20,000,000 instructions • 5 cycles/instruction • 1 GHz Clock rate • Using the same program with a new compiler: • 5,000,000 instructions • 2 cycles/instruction • 1 GHz Clock rate • What is the speedup with the changes? Speedup = old execution time/new execution time = 0.1/0.01 = 10 (times faster after change)

  19. Aspects of CPU Performance Inst Count CPI Clock Rate Program X Compiler X X Inst. Set. X X Organization X X Technology X CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Caching, pipelining, parallelism, …

  20. Benchmark Programs • To evaluate two computer systems, a user would simply compare the execution time of the workload on the two computers; Or • The User rely on other methods that measure the performance of a candidate computer • This II alternative is usually followed by evaluating the computer using a set of benchmarks— programs specifically chosen to measure performance. • Small benchmarks • nice for architects and designers, easy to standardize • The motivation is to tune the system to the Benchmark to achieve peak performance

  21. SPEC SPEC (Standard Performance Evaluation Corporation):Performance of a computer’s processor, memory architecture, compiler, client server, etc. Refer the whole topic Benchmarks, including SPEC of section 2.5 (Performance assessment) in Williams Stallings(8th Ed.)

  22. Ex: CPI and Instruction FREQi Compute the average (effective) CPI for the followings: (Sol) Average (Effective) CPI = 3*0.4 + 4*0.4 + 2*0.2 = 3.2

  23. Ex: Average CPI and Average MIPS Compute the average (effective) CPI for the followings: (Sol) Average (Effective) CPI = 3*0.4 + 4*0.4 + 2*0.2 = 3.2 If the processor is Pentium II (320MHz), what is the MIPS rate?

  24. Practice Example A benchmark program is run on a 40 MHz processor. The executed program consists of 100,000 instruction executions, with the following instruction mix and clock cycle count: Determine the effective CPI, MIPS rate, and execution time for this program.

  25. Practice Example #2 Consider two different machines, with two different instruction sets, both of which have a clock rate of 200 MHz. The following measurements are recorded on the two machines running a given set of benchmark programs: a. Determine the effective CPI, MIPS rate, and execution time for each machine. b. Comment on the results

  26. Practice Example #3 Early examples of CISC and RISC design are the VAX 11/780 and the IBM RS/6000, respectively. Using a typical benchmark program, the following machine characteristics result: The final column shows that the VAX required 12 times longer than the IBM measured in CPU time. a. What is the relative size of the instruction count of the machine code for this benchmark program running on the two machines? b. What are the CPI values for the two machines?

More Related