1 / 46

Exam2 Review

Exam2 Review. Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010. Outline. Pipeline Memory Hierarchy . Parallel processing. A parallel processing system is able to perform concurrent data processing to achieve faster execution time

chidi
Download Presentation

Exam2 Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exam2 Review Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

  2. Outline • Pipeline • Memory Hierarchy

  3. Parallel processing • A parallel processing system is able to perform concurrent data processing to achieve faster execution time • The system may have two or more ALUs and be able to execute two or more instructions at the same time • Goal is to increase the throughput– the amount of processing that can be accomplished during a given interval of time

  4. Parallel processing classification Single instruction stream, single data stream – SISD Single instruction stream, multiple data stream – SIMD Multiple instruction stream, single data stream – MISD Multiple instruction stream, multiple data stream – MIMD

  5. k segments 9.2 Pipelining • Instruction execution is divided into k segments or stages • Instruction exits pipe stage k-1 and proceeds into pipe stage k • All pipe stages take the same amount of time; called one processor cycle • Length of the processor cycle is determined by the slowest pipe stage

  6. SPEEDUP • If we execute the same task sequentially in a single processing unit, it takes (k * n) clock cycles. • • The speedup gained by using the pipeline is:

  7. Example • A non-pipeline system takes 100ns to process a task; • the same task can be processed in a FIVE-segment pipeline into 20ns, each • Speedup Ratio for 1000 tasks: 100*1000 / (5 + 1000 -1)*20 = 4.98 • However, if the task cannot be evenly divided…

  8. Example • A non-pipeline system takes 100ns to process a task; • the same task can be processed in a six-segment pipeline with the time delay of each segment in the pipeline is as follows 20ns, 25ns, 30ns, 10ns, 15ns, and 30ns. • Determine the speedup ratio of the pipeline for 10, 100, and 1000 tasks. What is the maximum speedup that can be achieved?

  9. Example Answer • Speedup Ratio for 10 tasks: 100*10 / (6+10-1)*30 • Speedup Ratio for 100 tasks: 100*100 / (6+100-1)*30 • Speedup Ratio for 1000 tasks: 100*1000 / (6+1000-1)*30 • Maximum Speedup: 100*N/ (6+N-1)*30 = 10/3

  10. Instructions seperate • 1. Fetch the instruction • 2. Decode the instruction • 3. Fetch the operands from memory • 4. Execute the instruction • 5. Store the results in the proper place

  11. 5-Stage Pipelining S1 1 2 3 4 5 6 7 8 9 S2 1 2 3 4 5 6 7 8 S3 1 2 3 4 5 6 7 S4 1 2 3 4 5 6 S5 1 2 3 4 5 S1 S2 S3 S4 S5 Fetch Instruction (FI) Decode Instruction (DI) Fetch Operand (FO) Execution Instruction (EI) Write Operand (WO) Time

  12. Pipeline Hazards • There are situations, called hazards, that prevent the next instruction in the instruction stream from executing during its designated cycle • There are three classes of hazards • Structural hazard • Data hazard • Branch hazard

  13. Data hazard Example: ADD R1R2+R3 SUB R4R1-R5 AND R6R1 AND R7 OR R8R1 OR R9 XOR R10R1 XOR R11

  14. Data hazard FO: fetch data value WO: store the executed value S1 S2 S3 S4 S5 Fetch Instruction (FI) Decode Instruction (DI) Fetch Operand (FO) Execution Instruction (EI) Write Operand (WO) Time

  15. Data hazard • Delay load approach inserts a no-operation instruction to avoid the data conflict ADD R1R2+R3 No-op No-op SUB R4R1-R5 AND R6R1 AND R7 OR R8R1 OR R9 XOR R10R1 XOR R11

  16. Data hazard

  17. Data hazard • It can be further solved by a simple hardware technique called forwarding (also called bypassing or short-circuiting) • The insight in forwarding is that the result is not really needed by SUB until the ADD execute completely • If the forwarding hardware detects that the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the results in ALU instead of from memory

  18. Data hazard

  19. Branch hazards • Branch hazards can cause a greater performance loss for pipelines • When a branch instruction is executed, it may or may not change the PC • If a branch changes the PC to its target address, it is a taken branch • Otherwise, it is untaken

  20. Branch hazards • There are FOUR schemes to handle branch hazards • Freeze scheme • Predict-untaken scheme • Predict-taken scheme • Delayed branch

  21. Branch Untaken (Freeze approach) • The simplest method of dealing with branches is to redo the fetch following a branch Fetch Instruction (FI) Decode Instruction (DI) Fetch Operand (FO) Execution Instruction (EI) Write Operand (WO)

  22. Branch Taken (Freeze approach) • The simplest method of dealing with branches is to redo the fetch following a branch Fetch Instruction (FI) Decode Instruction (DI) Fetch Operand (FO) Execution Instruction (EI) Write Operand (WO)

  23. Branch Untaken (Predicted-untaken) Fetch Instruction (FI) Decode Instruction (DI) Fetch Operand (FO) Execution Instruction (EI) Write Operand (WO) Time

  24. Branch Taken (Predicted-untaken) Fetch Instruction (FI) Decode Instruction (DI) Fetch Operand (FO) Execution Instruction (EI) Write Operand (WO)

  25. Branch Untaken (Predicted-taken) Fetch Instruction (FI) Decode Instruction (DI) Fetch Operand (FO) Execution Instruction (EI) Write Operand (WO)

  26. Branch taken (Predicted-taken) Fetch Instruction (FI) Decode Instruction (DI) Fetch Operand (FO) Execution Instruction (EI) Write Operand (WO)

  27. Delayed Branch • A fourth scheme in use in some processors is called delayed branch • It is done in compiler time. It modifies the code • The general format is: branch instruction Delay slot branch target if taken

  28. Delayed Branch • Optimal

  29. Outline • Pipeline • Memory Hierarchy

  30. Memory Hierarchy • The main memory occupies a central position by being able to communicate directly with the CPU and with auxiliary memory devices through an I/O processor • A special very-high-speed memory called cache is used to increase the speed of processing by making current programs and data available to the CPU at a rapid rate

  31. RAM

  32. ROM

  33. Memory Address Map

  34. Cache memory • When the CPU refers to memory and finds the word in cache, it is said to produce a hit • Otherwise, it is a miss • The performance of cache memory is frequently measured in terms of a quantity called hit ratio • Hit ratio = hit / (hit+miss)

  35. Cache memory • The basic characteristic of cache memory is its fast access time, • Therefore, very little or no time must be wasted when searching the words in the cache • The transformation of data from main memory to cache memory is referred to as a mapping process, there are three types of mapping: • Associative mapping • Direct mapping • Set-associative mapping

  36. Associative mapping

  37. Direct Mapping

  38. Direct Mapping

  39. Set-Associative Mapping

  40. Average memory access time • Average memory access time = % instructions * (Hit_time + instruction miss rate*miss_penality) + % data * (Hit_time + data miss rate*miss_penality)

  41. Average memory access time • Assume 40% of the instructions are data accessing instruction. • Let a hit take 1 clock cycle and the miss penalty is 100 clock cycle • Assume instruction miss rate is 4% and data access miss rate is 12%, what is the average memory access time? 60% * (1 + 4% * 100) + 40% * (1 + 12% * 100) = 0.6 * (5) + 0.4 * (13) = 8.2 (clock cycle)

  42. Page Fault

  43. Performance of Demand Paging Page Fault Rate 0 ≤p≤1.0 • if p= 0 no page faults • if p= 1, every reference is a fault • Effective Access Time (EAT)= (1-p)*ma + p*page fault time

  44. 9.4 Page Replacement • What if there is no free frame? • Page replacement –find some page in memory, but not really in use, swap it out • In this case, same page may be brought into memory several times

  45. 9.4 Page Replacement • Many Approaches: • FIFO • Optimal Page-Replacement Algorithm • Least-recently-used (LRU) • Second-Chance Algorithm • Least Frequently used (LFU) page-replacement algorithm • Most frequently used (MFU) page-replacement algorithm

More Related