1 / 74

Intel Pentium M

Intel Pentium M. Outline. History P6 Pipeline in detail New features Improved Branch Prediction Micro-ops fusion Speed Step technology Thermal Throttle 2 Power and Performance. Quick Review of x86. 8080 - 8-bit

Download Presentation

Intel Pentium M

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Intel Pentium M

  2. Outline • History • P6 Pipeline in detail • New features • Improved Branch Prediction • Micro-ops fusion • Speed Step technology • Thermal Throttle 2 • Power and Performance

  3. Quick Review of x86 • 8080 - 8-bit • 8086/8088 - 16-bit (8088 had 8-bit external data bus) - segmented memory model • 286 - introduction of protected mode, which included: segment limit checking, privilege levels, read- and exe-only segment options • 386 - 32-bit - segmented and flat memory model - paging • 486 - first pipeline - expanded the 386's ID and EX units into five-stage pipeline - first to include on-chip cache - integrated x87 FPU (before it was a coprocessor) • Pentium (586) - first superscalar - included two pipelines, u and v - virtual-8086 mode - MMX soon after • Pentium Pro (686 or P6) - three-way superscalar - dynamic execution - out-of-order execution, branch prediction, speculative execution - very successful micro-architecture • Pentium 2 and 3 - both P6 • Pentium 4 - new NetBurst architecture • Pentium M - enhanced P6

  4. Pentium Pro Roots • NexGen 586 (1994) • Decomposes IA32 instructions into simplerRISC-like operations (R-ops or micro-ops) • Decoupled Approach • NexGen bought by AMD • AMD K5 (1995) – also used micro-ops • Intel Pentium Pro • Intel’s first use of decoupled architecture

  5. Pentium-M Overview • Introduced March 12, 2003 • Initially called Banias • Created by Israeli team • Missed deadline by less than 5 days • Marketed with Intel’s Centrino Initiative • Based on P6 microarchitechture

  6. P6 Pipeline in a Nutshell • Divided into three clusters (front, middle, back) • In-order Front-End • Out-of-order Execution Core • Retirement • Each cluster is independent • I.e. if a mispredicted branch is detected in the front-end, the front-end will flush and retch from the corrected branch target, all while the execution core continues working on previous instructions

  7. P6 Pipeline in a Nutshell

  8. P6 Front-End • Major units: IFU, ID, RAT, Allocator, BTB, BAC • Fetching (IFU) • Includes I-cache, I-streaming cache, ITLB, ILD • No pre-decoding • Boundary markings by instruction-length decoder (ILD) • Branch Prediction • Predicted (speculative) instructions are marked • Decoding (ID) • Conversion of instructions (macro-ops) into micro-ops • Allocation of Buffer Entries: RS, ROB, MOB

  9. P6 Execution Core • Reservation Station (RS) • Waiting micro-ops ready to go • Scheduler • Out-of-order Execution of micro-ops • Independent execution units (EU) • Must be careful about out-of-order memory access • Memory ordering buffer (MOB) interfaces to the memory subsystem • Requirements for execution • Available operands, EU, and write-back bus • Optimal performance

  10. P6 Retirement • In-order updating of architected machine state • Re-order buffer (ROB) • Micro-op retirement – “all or none” • Architecturally illegal to retire only partof an IA-32 instruction • In-ordering handling of exceptions • Legal to handle mid-execution, but illegalto handle mid-retirement

  11. PM Changes to P6 • Most changes made in P6 front-end • Added and expanded on P4 branch predictor • Micro-ops fusion • Addition of dedicated stack engine • Pipeline length • Longer than P3, shorter than P4 • Accommodates extra features above

  12. PM Changes to P6, cont. • Intel has not released the exact length of the pipeline. • Known to be somewhere between the P4 (20 stage)and the P3 (10 stage). Rumored to be 12 stages. • Trades off slightly lower clock frequencies (than P4) for better performance per clock, less branch prediction penalties, …

  13. Blue Man Group Commercial Break

  14. Banias • 1st version • 77 million transistors, 23 million more than P4 • 1 MB on die Level 2 cache • 400 MHz FSB (quad pumped 100 MHZ) • 130 nm process • Frequencies between 1.3 – 1.7 GHz • Thermal Design Point of 24.5 watts http://www.intel.com/pressroom/archive/photos/centrino.htm

  15. Dothan • Launched May 10, 2004 • 140 million transistors • 2 MB Level 2 cache • 400 or 533 MHz FSB • Frequencies between 1.0 to 2.26 GHz • Thermal Design Point of 21(400 MHz FSB) to 27 watts http://www.intel.com/pressroom/archive/photos/centrino.htm

  16. Dothan cont. • 90 nm process technology on 300 mm wafer. • Provide twice the capacity of the 200 mm while the process dimensions double the transistor density • Gate dimensions are 50nm or approx half the diameter if the influenza virus • P and n gate voltages are reduced by enhancing the carrier mobility of the Si lattice by 10-20% • Draws less than 1 W average power

  17. Bus • Utilizes a split transaction deferred reply protocol • 64-bit width • Delivers up to 3.2 Gbps (Banis) or 4.2 Gbps (Dothan) in and out of the processor • Utilizes source synchronous transfer of addresses and data • Data transferred 4 times per bus clock • Addresses can be delivered times per bus clock

  18. Bus update in Dothan • http://www.intel.com/technology/itj/2005/volume09issue01/art05_perf_power

  19. L1 Cache • 64KB total • 32 K instruction • 32 K data (4 times P4M) • Write-back vs. write-through on P4 • In write-through cache, data is written to both L1 and main memory simultaneously • In write-back cache, data can be loaded without writing to main memory, increasing speed by reducing the number of slow memory writes

  20. L2 cache • 1 – 2 MB • 8-way set associative • Each set is divided into 4 separate power quadrants. • Each individual power quadrant can be set to a sleep mode, shutting off power to those quadrants • Allows for only 1/32 of cache to be powered at any time • Increased latency vs. improved power consumption

  21. Prefetch • Prefetch logic fetches data to the level 2 cache before L1 cache requests occur • Reduces compulsory misses due to an increase of valid data in cache • Reduces bus cycle penalties

  22. Schedule • P6 Pipeline in detail • Front-End • Execution Core • Back-End • Power Issues • Intel SpeedStep • Testing the Features • x86 system registers • Performance Testing

  23. P6 Front-end: Instruction Fetching • IA-32 Memory Management • Classic segmented model (cannot be disabled in protected mode) • Separation of code, data, and stack into "segments“ • Optional paging • Segments divided into pages (typically 4KB) • Additional protection to segment-protection • I.e. provides read-write protection on a page-by-page basis • Stage 11 (stage 1) - Selection of address for next I-cache access • Speculation – address chosen from competing sources (i.e. BTB, BAC, loop detector, etc.) • Calculation of linear address from logical (segment selector + offset) • Segment selector – index into a table of segment descriptors, which include base address, size, type, and access right of the segment • Remember: only six segment selectors, so only six usable at a time • 32-bit code nowadays uses flat model, so OS can make do with only a few (typically four) segments • IFU chooses address with highest priority and sends it to stage two

  24. P6 Front-end: Instruction Fetching • Stage 12-13 - Accessing of caches • Accesses instruction caches with address calculated in stage one • Includes standard cache, victim cache, and streaming buffer • With paging, consults ITLB to determine physical page number (tag bits) • Without paging, linear address from stage one becomes physical address • Obtains branch prediction from branch target buffer (BTB) • BTB takes two cycles to complete one access • Instruction boundary (ILD) and BTB markings • Stage 14 - Completion of instruction cache access • Instructions and their marks are sent to instruction buffer or steered to ID

  25. P6 Front-end: Instruction Fetching

  26. P6 Front-end: Instruction Decoding • Stage 15-16 - Decoding of IA32 Instructions • Alignment of instruction bytes • Identification of the ends of up to three instructions • Conversion of instructions into micro-ops • Stage 17 - Branch Decoding • If the ID notices a branch that went unpredicted by the BTB (i.e. if the BTB had never seen the branch before), flushes the in-order pipe, and re-fetches from the branch target • Branch target calculated by BAC • Early catch saves speculative instructions from being sent through the pipeline • Stage 21 - Register Allocation and Renaming • Synonymous with stage 17 (a reminder of independent working units) • Allocator used to allocate required entries in ROB, RS, LB, and SB • Register Alias Table (RAT) consulted • Maps logical sources/destinations to physical entries in the ROB (or sometimes RRF) • Stage 22 – Completion of Front-End • Marked micro-ops are forwarded to RS and ROB, where theyawait execution and retirement, respectively.

  27. P6 Front-end: Instruction Decoding

  28. Register Alias Table Introduction • Provides register renaming of integer and floating-point registers and flags • Maps logical (architected) entries to physical entries usually in the re-order buffer (ROB) • Physical entries are actually allocated by the Allocator • The physical entry pointers become a part of the micro-op’s overall state as it travels through the pipeline

  29. RAT Details • P6 is 3-way super-scalar, so the RAT must be able to rename up to six logical sources per cycle • Any data dependences must be handled • Ex: op1) ADD EAX, EBX, ECX (dest. = EAX) op2) ADD EAX, EAX, EDX op3) ADD EDX, EAX, EDX • Instead of making op2 wait for op1 to retire, the RAT provides data forwarding • Same case for op3, but RAT must make sure that it gets the result from op2 and not op1

  30. RAT Implementation Difficulties • Speculative Renaming • Since speculative micro-ops flow by, the RAT must be able to undo its mappings in the case of a branch misprediction • Partial-width register reads and writes • Consider a partial-width write followed by a larger-width read • Data required by the read is an assimilation of multiple previous writes to the register – to make sure, RAT must stall the pipeline • Retirement Overrides • Common interaction between RAT and ROB • When a micro-op retires, its ROB entry is removed and its result may be latched into an architected destination register • If any active micro-ops source the retired op’s destination, they must not reference the outdated ROB entry • Mismatch stalls • Associated with flag renaming

  31. The Allocator • Works in conjunction with RAT to allocate required entries • In each cycle, assumes three ROB, RS, and LB and two SB entries • Once micro-ops arrive, it determines how many entries are really needed • ROB Allocation • If three entries aren’t available the allocator will stall • RS Allocation • A bitmap is used to determine which entries are free • If the RS is full, pipeline is stalled • RS must make sure valid entries are not overwritten • MOB Allocation • Allocation of LB and SB entries also done by allocator

  32. PM Changes to P6 Front-End • Micro-op fusion • Dedicated Stack Engine • Enhanced branch prediction • Additional stages • Intel’s secret • Most likely required for extra functionality above

  33. Micro-ops Fusion • Fusion of multiple micro-ops into one micro-op • Less contention for buffer entries • Similarity to SIMD data packing • Two examples of fusion from Intel documentation: • IA32 load-and-operate and store instructions • Not known for certain whether these are the only cases of fusion • Possibly inspired by MacroOps used in K7 (Athlon)

  34. Dedicated Stack Engine • Traditional out-of-order implementations update the Stack Pointer Register (ESP) by sending a µop to update the ESP register with every stack related instruction • Pentium M implementation • A delta register (ESPD) is maintained in the front end • A historic ESP (ESPO) is then kept in the out-of-order execution core • Dedicated logic was added to update the ESP by adding the ESPO with the ESPD

  35. Improvements • The ESPO value kept in the out-of-order machine is not changed during a sequence of stack operations, this allows for more parallelism opportunities to be realized • Since ESPD updates are now done by a dedicated adder, the execution unit is now free to work on other µops and the ALU’s are freed to work on more complex operations • Decreased power consumption since large adders are not used for small operations and the eliminated µops do not toggle through the machine • Approximately 5% of the µops have been eliminated

  36. Complications • Since the new adder lives in the front end all of its calculations are speculative. This necessitates the addition of recovery table for all values of ESPO and ESPD • If the architectural value of ESP is needed inside of the out-of-order machine the decode logic then needs to insert a µop that will carry out the ESP calculation

  37. Branch Prediction • Longer pipelines mean higher penalties for mispredicted branches • Improvements result in added performance and hence less energy spent per instruction retired

  38. Branch Prediction in Pentium M • Enhanced version of Pentium 4 predictor • Two branch predictors added that run in tandem with P4 predictor: • Loop detector • Indirect branch detector • 20% lower misprediction rate than PIII resulting in up to 7% gain in real performance

  39. Branch Prediction Based on diagram found here: http://www.cpuid.org/reviews/PentiumM/index.php

  40. Loop Detector • A predictor that always branches in a loop will always incorrectly branch on the last iteration • Detector analyzes branches for loop behavior • Benefits a wide variety of program types http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p05_branch.htm

  41. Indirect Branch Predictor • Picks targets based on global flow control history • Benefits programs compiled to branch to calculated addresses http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p05_branch.htm

  42. Reservation Station • Used as a store for µops to wait for their operands and execution units to become available • Consists of 20 entries • Control portion of the entry can be written to from one of three ports • Data portion can be written to from one of 6 available ports • 3 for ROB • 3 for EU write backs • Scheduler then uses this to schedule up to 5 µops at a time • During pipeline stage 31 entries that are ready for dispatch are then sent to stage 32

  43. Cancellation • Reservation Station assumes that all cache accesses will be hits • In the case of a cache miss micro-ops that are dependant on the write-back data need to be cancelled and rescheduled at a later time • Can also occur due to a future resource conflict

  44. Retirement • Takes 2 clock cycles to complete • Utilizes reorder buffer (ROB) to control retirement or completion of μops • ROB is a multi-ported register file with separate ports for • Allocation time writes of µop fields needed at retirement • Execution Unit write-backs • ROB reads of sources for the Reservation Station • Retirement logic reads of speculative result data • Consists of 40 entries with each entry 157 bits wide • The ROB participates in • Speculative execution • Register renaming • Out-of-order execution

  45. Speculative Execution • Buffers results of the execution unit before commit • Allows maximum rate for fetch and execute by assuming that branch prediction is perfect and no exceptions have occurred • If a misprediction occurs: • Speculative results stored in the ROB are immediately discarded • Microengine will restart by examining the committed state in the ROB

  46. Register Renaming • Entries in the ROB that will hold the results of speculative µops are allocated during stage 21 of the pipeline • In stage 22 the sources for the µops are delivered based upon the allocation in stage 21. • Data is written to the ROB by the Execution Unit into the renamed register during stage 83

  47. Out-of-order Execution • Allows µops to complete and write back their results without concern for other µops executing simultaneously • The ROB reorders the completed µops into the original sequence and updates the architectural state • Entries in ROB are treated as FIFO during retirement • µops are originally allocated in sequential order so the retirement will also follow the original program order • Happens during pipeline stage 92 and 93

  48. Exception Handling • Events are sent to the ROB by the EU during stage 83 • Results sent to the ROB from the Execution Unit are speculative results, therefore any exceptions encountered may not be real • If the ROB determines that branch prediction was incorrect it inserts a clear signal at the point just before the retirement of this operation and then flushes all the speculative operations from the machine • If speculation is correct, the ROB will invoke the correct microcode exception handler • All event records are saved to allow the handler to repair the result or invoke the correct macro handler • Pointers for the macro and micro instructions are also needed to allow the program to resume after completion by the event handler • If the ROB retires an operation that faults, both the in-order and out-of-order sections are cleared. This happens during pipeline stages 93 and 94

  49. Memory Subsystem • Memory Ordering Buffer (MOB) • Execution is out-of-order, but memory accesses cannot just be done in any order • Contains mainly the LB and the SB • Speculative loads and stores • Not all loads can be speculative • I.e. a memory-mapped I/O ld could have unrecoverable side effects • Stores are never speculative (can’t get back overwritten bits) • But to improve performance, stores are queued in the store buffer (SB) to allow pending loads to proceed • Similar to a write-back cache

More Related