1 / 32

CPU Continued

CPU Continued. Pentium II. Fetch/Decode Unit pulls instructions from the cache in order predicted by BTB. 3 decoders to break instruction into uops (micro instructions). 2 restricted decoders, 1 general decoder If more than four uops, instruction is sent to MIS unit

woods
Download Presentation

CPU Continued

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPU Continued

  2. Pentium II • Fetch/Decode Unit • pulls instructions from the cache in order predicted by BTB. • 3 decoders to break instruction into uops (micro instructions). • 2 restricted decoders, 1 general decoder • If more than four uops, instruction is sent to MIS unit • can generate 6 uops/clock cycle

  3. ReOrder Buffer (ROB) - instruction pool • circular buffer. • Handles register renaming (40) • Reservation Station • passes uops to available execution units • acts as buffer --> stores up to 20 uops and data

  4. Dispatch/Execute • checks each uop in ROB to make sure it can be processed • searches L1 D-cache then L2 cache • Instead of being idle, execute unit looks at each uop in ROB until it finds one it can execute. - speculative execution.

  5. After uop is processed, execution has to compare it to what the BTB predicted. If BTB has failed, then Jump Execution Unit, moves pointer in ROB appropriately. • BTB updates so that next time it can predict better.

  6. At same time, the Retirement Unit is inspecting the ROB to see if the uop at the head of the ROB has been executed. If so, it checks 2nd and 3rd as well. Finally, it sends all 3 to store buffer. • While in store buffer, results checked before sent to L2 cache.’

  7. Pipelining • decompose sequential process into subprocesses • performed by collection of processing segments, output of one is input to another. • Clock line is used to connect all registers and is pulsed after enough time has passed.

  8. Example - A[I] x B[I] + C[I], I = 1, 2, …7 • suboperations: Segment 1: R1  A[ i ], R2  B[ i ] Input A[ i ] and B[ i ] Segment 2: R3  R1 x R2, R4  C[ i ] Multiply and input C[ i ] Segment 3: R5  R3 + R4 Add C[ i ] to product

  9. After 3rd clock pulse, results start being output. • Potential problem: empty pipeline • P2 and P3 have 12 stage pipelines in the integer ALUs

  10. Branch Prediction • pipelining works best on linear code • programs are not linear code only if (i == 0) CMP i, 0 :compare i to 0 k = 1; BNE else :branch to else if not equal else then: MOV k, 1 :move 1 to k k = 2; BR next :unconditional branch else: MOV k,2 :move 2 to k next: code fragment same code fragment in generic assembly language

  11. fetch occurs before know if instruction is a branch and which branch to take. • Delay slot - instruction after a branch -- NOPs • stall pipeline v.s. prediction • secret registers • rollback • solution is complex

  12. Superscalar Architecture • program steps - sequential in nature • superscalar provides 2 or more execution paths.

  13. Pentium has 2 five-stage pipelines (u and v) • P2 and P3 - single pipeline with multiple Functional • Units

  14. Out of Order Execution • difficult to divide workload equally • dependencies cause trouble - RAW, WAR • Results are held in buffers until all out of order instructions finish and results are put back together

  15. Register Renaming • to avoid register access problems, secret registers are used • these are dynamically allocated and not visible to programmers • so any register reference has to be switched to the secret registers and then the chip has to remember which is used by which

  16. RISC • John Cocke 80/20 rule • Optimize the 20% instructions and then redo the 80% as combinations of the 20%

  17. RISC characteristics • single-cycle or better execution of instructions • Uniformity of instructions (same length) • Lack of Microcode - hardwired logic • software does work of optimizing • Design Simplicity

  18. Micro-Ops and CISC/RISC computers • CISC executed on RISC • uops or AMDs R-ops

  19. Single Instruction - Multiple Data • single microprocessor, multiple data bytes • 64-bit in MMX • Streaming SIMD (SSE) • augmented with 70 new instructions • Katmai New Instructions

  20. Operating Modes • real mode • protected mode • most commonly used • flat memory model, demand paging, no addressing limits • virtual 8086 mode

  21. Microprocessor Competition • Moore’s Law -- “Microprocessor power double’s every 18 months” • micron sized parts (expect miniaturization wall in 2017) • other methods • To keep getting better, either switch media or improve processing power through techniques such as parallel processing.

  22. AMD • Advanced Micro Devices, 1969 • cross-licensing agreement with Intel • With 386, Intel separated from partnership • AMD built own 386 • Intel sued, AMD won

  23. AMD chips • K5 (586) • RISC core • socket compatible with Pentium • six stage pipeline • matches P133

  24. K6, K6-2 (with 3D Now!) • fits Intel’s Socket 7 • core built by NexGen • 2 six stage pibpelines • RISC • no 32-bit instructions, but no slowdown • P-rating --> rate speed: actual speed

  25. K6-3 (Sharptooth) • 64KB L1, 256 KB L2, 1MB L3 -- Tri-Level Cache • seven state pipelines (2) • performance equivalent to P3 at one speed higher (450:400 in P-rating) • 21.3 x 106 transistors

  26. K7 • nine instruction superscalar design • 3 parallel decoders • 3 integer pipelines, 3 multimedia • 128K L1, 64 bit back-side bus • 512K - 8 M L2 cache

  27. Cyrix Corporation • MII • MMX pipeline -- 10 stages • Jalapeno -- 3D graphics, new FPU.

  28. IDT/Centaur • WinChip C6 • Intel compatible • owned by Integrated Device Technology

  29. Intel Itanium • 64 - bit RISC • dual mode - IA 32 and IA 64 • instructions processed in bundles • compilers break instructions apart and arrange scheduling (not CPU)

  30. branching done using predication • predicate registers help with conditions for instructions

More Related