intel pentium m n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
IntelPentiumM_r4.ppt PowerPoint Presentation
Download Presentation
IntelPentiumM_r4.ppt

Loading in 2 Seconds...

play fullscreen
1 / 74

IntelPentiumM_r4.ppt - PowerPoint PPT Presentation


  • 615 Views
  • Uploaded on

Pentium (586) - first superscalar - included two pipelines, u and v - virtual-8086 mode ... Pentium Pro (686 or P6) - three-way superscalar ...

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'IntelPentiumM_r4.ppt' - Sharma


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • History
  • P6 Pipeline in detail
  • New features
    • Improved Branch Prediction
    • Micro-ops fusion
    • Speed Step technology
    • Thermal Throttle 2
  • Power and Performance
quick review of x86
Quick Review of x86
  • 8080 - 8-bit
  • 8086/8088 - 16-bit (8088 had 8-bit external data bus) - segmented memory model
  • 286 - introduction of protected mode, which included: segment limit checking, privilege levels, read- and exe-only segment options
  • 386 - 32-bit - segmented and flat memory model - paging
  • 486 - first pipeline - expanded the 386's ID and EX units into five-stage pipeline - first to include on-chip cache - integrated x87 FPU (before it was a coprocessor)
  • Pentium (586) - first superscalar - included two pipelines, u and v - virtual-8086 mode - MMX soon after
  • Pentium Pro (686 or P6) - three-way superscalar - dynamic execution - out-of-order execution, branch prediction, speculative execution - very successful micro-architecture
  • Pentium 2 and 3 - both P6
  • Pentium 4 - new NetBurst architecture
  • Pentium M - enhanced P6
pentium pro roots
Pentium Pro Roots
  • NexGen 586 (1994)
    • Decomposes IA32 instructions into simplerRISC-like operations (R-ops or micro-ops)
      • Decoupled Approach
  • NexGen bought by AMD
    • AMD K5 (1995) – also used micro-ops
  • Intel Pentium Pro
    • Intel’s first use of decoupled architecture
pentium m overview
Pentium-M Overview
  • Introduced March 12, 2003
  • Initially called Banias
  • Created by Israeli team
  • Missed deadline by less than 5 days
  • Marketed with Intel’s Centrino Initiative
  • Based on P6 microarchitechture
p6 pipeline in a nutshell
P6 Pipeline in a Nutshell
  • Divided into three clusters (front, middle, back)
    • In-order Front-End
    • Out-of-order Execution Core
    • Retirement
  • Each cluster is independent
    • I.e. if a mispredicted branch is detected in the front-end, the front-end will flush and retch from the corrected branch target, all while the execution core continues working on previous instructions
p6 front end
P6 Front-End
  • Major units: IFU, ID, RAT, Allocator, BTB, BAC
  • Fetching (IFU)
    • Includes I-cache, I-streaming cache, ITLB, ILD
    • No pre-decoding
    • Boundary markings by instruction-length decoder (ILD)
  • Branch Prediction
    • Predicted (speculative) instructions are marked
  • Decoding (ID)
    • Conversion of instructions (macro-ops) into micro-ops
  • Allocation of Buffer Entries: RS, ROB, MOB
p6 execution core
P6 Execution Core
  • Reservation Station (RS)
    • Waiting micro-ops ready to go
    • Scheduler
  • Out-of-order Execution of micro-ops
    • Independent execution units (EU)
    • Must be careful about out-of-order memory access
      • Memory ordering buffer (MOB) interfaces to the memory subsystem
  • Requirements for execution
    • Available operands, EU, and write-back bus
    • Optimal performance
p6 retirement
P6 Retirement
  • In-order updating of architected machine state
    • Re-order buffer (ROB)
  • Micro-op retirement – “all or none”
    • Architecturally illegal to retire only partof an IA-32 instruction
  • In-ordering handling of exceptions
    • Legal to handle mid-execution, but illegalto handle mid-retirement
pm changes to p6
PM Changes to P6
  • Most changes made in P6 front-end
  • Added and expanded on P4 branch predictor
  • Micro-ops fusion
  • Addition of dedicated stack engine
  • Pipeline length
    • Longer than P3, shorter than P4
    • Accommodates extra features above
pm changes to p6 cont
PM Changes to P6, cont.
  • Intel has not released the exact length of the pipeline.
  • Known to be somewhere between the P4 (20 stage)and the P3 (10 stage). Rumored to be 12 stages.
  • Trades off slightly lower clock frequencies (than P4) for better performance per clock, less branch prediction penalties, …
banias
Banias
  • 1st version
  • 77 million transistors, 23 million more than P4
  • 1 MB on die Level 2 cache
  • 400 MHz FSB (quad pumped 100 MHZ)
  • 130 nm process
  • Frequencies between 1.3 – 1.7 GHz
  • Thermal Design Point of 24.5 watts

http://www.intel.com/pressroom/archive/photos/centrino.htm

dothan
Dothan
  • Launched May 10, 2004
  • 140 million transistors
  • 2 MB Level 2 cache
  • 400 or 533 MHz FSB
  • Frequencies between 1.0 to 2.26 GHz
  • Thermal Design Point of 21(400 MHz FSB) to 27 watts

http://www.intel.com/pressroom/archive/photos/centrino.htm

dothan cont
Dothan cont.
  • 90 nm process technology on 300 mm wafer.
  • Provide twice the capacity of the 200 mm while the process dimensions double the transistor density
  • Gate dimensions are 50nm or approx half the diameter if the influenza virus
  • P and n gate voltages are reduced by enhancing the carrier mobility of the Si lattice by 10-20%
  • Draws less than 1 W average power
slide18
Bus
  • Utilizes a split transaction deferred reply protocol
  • 64-bit width
  • Delivers up to 3.2 Gbps (Banis) or 4.2 Gbps (Dothan) in and out of the processor
  • Utilizes source synchronous transfer of addresses and data
    • Data transferred 4 times per bus clock
    • Addresses can be delivered times per bus clock
slide19
Bus update in Dothan
  • http://www.intel.com/technology/itj/2005/volume09issue01/art05_perf_power
l1 cache
L1 Cache
  • 64KB total
    • 32 K instruction
    • 32 K data (4 times P4M)
  • Write-back vs. write-through on P4
  • In write-through cache, data is written to both L1 and main memory simultaneously
  • In write-back cache, data can be loaded without writing to main memory, increasing speed by reducing the number of slow memory writes
l2 cache
L2 cache
  • 1 – 2 MB
  • 8-way set associative
  • Each set is divided into 4 separate power quadrants.
  • Each individual power quadrant can be set to a sleep mode, shutting off power to those quadrants
  • Allows for only 1/32 of cache to be powered at any time
  • Increased latency vs. improved power consumption
prefetch
Prefetch
  • Prefetch logic fetches data to the level 2 cache before L1 cache requests occur
  • Reduces compulsory misses due to an increase of valid data in cache
  • Reduces bus cycle penalties
schedule
Schedule
  • P6 Pipeline in detail
    • Front-End
    • Execution Core
    • Back-End
  • Power Issues
    • Intel SpeedStep
  • Testing the Features
    • x86 system registers
    • Performance Testing
p6 front end instruction fetching
P6 Front-end: Instruction Fetching
  • IA-32 Memory Management
    • Classic segmented model (cannot be disabled in protected mode)
      • Separation of code, data, and stack into "segments“
    • Optional paging
      • Segments divided into pages (typically 4KB)
      • Additional protection to segment-protection
        • I.e. provides read-write protection on a page-by-page basis
  • Stage 11 (stage 1) - Selection of address for next I-cache access
    • Speculation – address chosen from competing sources (i.e. BTB, BAC, loop detector, etc.)
    • Calculation of linear address from logical (segment selector + offset)
      • Segment selector – index into a table of segment descriptors, which include base address, size, type, and access right of the segment
      • Remember: only six segment selectors, so only six usable at a time
        • 32-bit code nowadays uses flat model, so OS can make do with only a few (typically four) segments
    • IFU chooses address with highest priority and sends it to stage two
p6 front end instruction fetching1
P6 Front-end: Instruction Fetching
  • Stage 12-13 - Accessing of caches
    • Accesses instruction caches with address calculated in stage one
      • Includes standard cache, victim cache, and streaming buffer
    • With paging, consults ITLB to determine physical page number (tag bits)
      • Without paging, linear address from stage one becomes physical address
    • Obtains branch prediction from branch target buffer (BTB)
      • BTB takes two cycles to complete one access
    • Instruction boundary (ILD) and BTB markings
  • Stage 14 - Completion of instruction cache access
    • Instructions and their marks are sent to instruction buffer or steered to ID
p6 front end instruction decoding
P6 Front-end: Instruction Decoding
  • Stage 15-16 - Decoding of IA32 Instructions
    • Alignment of instruction bytes
    • Identification of the ends of up to three instructions
    • Conversion of instructions into micro-ops
  • Stage 17 - Branch Decoding
    • If the ID notices a branch that went unpredicted by the BTB (i.e. if the BTB had never seen the branch before), flushes the in-order pipe, and re-fetches from the branch target
      • Branch target calculated by BAC
    • Early catch saves speculative instructions from being sent through the pipeline
  • Stage 21 - Register Allocation and Renaming
    • Synonymous with stage 17 (a reminder of independent working units)
    • Allocator used to allocate required entries in ROB, RS, LB, and SB
    • Register Alias Table (RAT) consulted
      • Maps logical sources/destinations to physical entries in the ROB (or sometimes RRF)
  • Stage 22 – Completion of Front-End
    • Marked micro-ops are forwarded to RS and ROB, where theyawait execution and retirement, respectively.
register alias table introduction
Register Alias Table Introduction
  • Provides register renaming of integer and floating-point registers and flags
  • Maps logical (architected) entries to physical entries usually in the re-order buffer (ROB)
  • Physical entries are actually allocated by the Allocator
  • The physical entry pointers become a part of the micro-op’s overall state as it travels through the pipeline
rat details
RAT Details
  • P6 is 3-way super-scalar, so the RAT must be able to rename up to six logical sources per cycle
  • Any data dependences must be handled
    • Ex: op1) ADD EAX, EBX, ECX (dest. = EAX) op2) ADD EAX, EAX, EDX

op3) ADD EDX, EAX, EDX

    • Instead of making op2 wait for op1 to retire, the RAT provides data forwarding
      • Same case for op3, but RAT must make sure that it gets the result from op2 and not op1
rat implementation difficulties
RAT Implementation Difficulties
  • Speculative Renaming
    • Since speculative micro-ops flow by, the RAT must be able to undo its mappings in the case of a branch misprediction
  • Partial-width register reads and writes
    • Consider a partial-width write followed by a larger-width read
      • Data required by the read is an assimilation of multiple previous writes to the register – to make sure, RAT must stall the pipeline
  • Retirement Overrides
    • Common interaction between RAT and ROB
    • When a micro-op retires, its ROB entry is removed and its result may be latched into an architected destination register
    • If any active micro-ops source the retired op’s destination, they must not reference the outdated ROB entry
  • Mismatch stalls
    • Associated with flag renaming
the allocator
The Allocator
  • Works in conjunction with RAT to allocate required entries
  • In each cycle, assumes three ROB, RS, and LB and two SB entries
    • Once micro-ops arrive, it determines how many entries are really needed
  • ROB Allocation
    • If three entries aren’t available the allocator will stall
  • RS Allocation
    • A bitmap is used to determine which entries are free
    • If the RS is full, pipeline is stalled
      • RS must make sure valid entries are not overwritten
  • MOB Allocation
    • Allocation of LB and SB entries also done by allocator
pm changes to p6 front end
PM Changes to P6 Front-End
  • Micro-op fusion
  • Dedicated Stack Engine
  • Enhanced branch prediction
  • Additional stages
    • Intel’s secret
    • Most likely required for extra functionality above
micro ops fusion
Micro-ops Fusion
  • Fusion of multiple micro-ops into one micro-op
    • Less contention for buffer entries
  • Similarity to SIMD data packing
  • Two examples of fusion from Intel documentation:
    • IA32 load-and-operate and store instructions
    • Not known for certain whether these are the only cases of fusion
  • Possibly inspired by MacroOps used in K7 (Athlon)
dedicated stack engine
Dedicated Stack Engine
  • Traditional out-of-order implementations update the Stack Pointer Register (ESP) by sending a µop to update the ESP register with every stack related instruction
  • Pentium M implementation
    • A delta register (ESPD) is maintained in the front end
    • A historic ESP (ESPO) is then kept in the out-of-order execution core
    • Dedicated logic was added to update the ESP by adding the ESPO with the ESPD
improvements
Improvements
  • The ESPO value kept in the out-of-order machine is not changed during a sequence of stack operations, this allows for more parallelism opportunities to be realized
  • Since ESPD updates are now done by a dedicated adder, the execution unit is now free to work on other µops and the ALU’s are freed to work on more complex operations
  • Decreased power consumption since large adders are not used for small operations and the eliminated µops do not toggle through the machine
  • Approximately 5% of the µops have been eliminated
complications
Complications
  • Since the new adder lives in the front end all of its calculations are speculative. This necessitates the addition of recovery table for all values of ESPO and ESPD
  • If the architectural value of ESP is needed inside of the out-of-order machine the decode logic then needs to insert a µop that will carry out the ESP calculation
branch prediction
Branch Prediction
  • Longer pipelines mean higher penalties for mispredicted branches
  • Improvements result in added performance and hence less energy spent per instruction retired
branch prediction in pentium m
Branch Prediction in Pentium M
  • Enhanced version of Pentium 4 predictor
  • Two branch predictors added that run in tandem with P4 predictor:
    • Loop detector
    • Indirect branch detector
  • 20% lower misprediction rate than PIII resulting in up to 7% gain in real performance
branch prediction1
Branch Prediction

Based on diagram found here: http://www.cpuid.org/reviews/PentiumM/index.php

loop detector
Loop Detector
  • A predictor that always branches in a loop will always incorrectly branch on the last iteration
  • Detector analyzes branches for loop behavior
  • Benefits a wide variety of program types

http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p05_branch.htm

indirect branch predictor
Indirect Branch Predictor
  • Picks targets based on global flow control history
  • Benefits programs compiled to branch to calculated addresses

http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p05_branch.htm

reservation station
Reservation Station
  • Used as a store for µops to wait for their operands and execution units to become available
  • Consists of 20 entries
  • Control portion of the entry can be written to from one of three ports
  • Data portion can be written to from one of 6 available ports
    • 3 for ROB
    • 3 for EU write backs
  • Scheduler then uses this to schedule up to 5 µops at a time
  • During pipeline stage 31 entries that are ready for dispatch are then sent to stage 32
cancellation
Cancellation
  • Reservation Station assumes that all cache accesses will be hits
  • In the case of a cache miss micro-ops that are dependant on the write-back data need to be cancelled and rescheduled at a later time
  • Can also occur due to a future resource conflict
retirement
Retirement
  • Takes 2 clock cycles to complete
  • Utilizes reorder buffer (ROB) to control retirement or completion of μops
  • ROB is a multi-ported register file with separate ports for
    • Allocation time writes of µop fields needed at retirement
    • Execution Unit write-backs
    • ROB reads of sources for the Reservation Station
    • Retirement logic reads of speculative result data
  • Consists of 40 entries with each entry 157 bits wide
  • The ROB participates in
    • Speculative execution
    • Register renaming
    • Out-of-order execution
speculative execution
Speculative Execution
  • Buffers results of the execution unit before commit
  • Allows maximum rate for fetch and execute by assuming that branch prediction is perfect and no exceptions have occurred
  • If a misprediction occurs:
    • Speculative results stored in the ROB are immediately discarded
    • Microengine will restart by examining the committed state in the ROB
register renaming
Register Renaming
  • Entries in the ROB that will hold the results of speculative µops are allocated during stage 21 of the pipeline
  • In stage 22 the sources for the µops are delivered based upon the allocation in stage 21.
  • Data is written to the ROB by the Execution Unit into the renamed register during stage 83
out of order execution
Out-of-order Execution
  • Allows µops to complete and write back their results without concern for other µops executing simultaneously
  • The ROB reorders the completed µops into the original sequence and updates the architectural state
  • Entries in ROB are treated as FIFO during retirement
    • µops are originally allocated in sequential order so the retirement will also follow the original program order
  • Happens during pipeline stage 92 and 93
exception handling
Exception Handling
  • Events are sent to the ROB by the EU during stage 83
  • Results sent to the ROB from the Execution Unit are speculative results, therefore any exceptions encountered may not be real
  • If the ROB determines that branch prediction was incorrect it inserts a clear signal at the point just before the retirement of this operation and then flushes all the speculative operations from the machine
  • If speculation is correct, the ROB will invoke the correct microcode exception handler
  • All event records are saved to allow the handler to repair the result or invoke the correct macro handler
  • Pointers for the macro and micro instructions are also needed to allow the program to resume after completion by the event handler
  • If the ROB retires an operation that faults, both the in-order and out-of-order sections are cleared. This happens during pipeline stages 93 and 94
memory subsystem
Memory Subsystem
  • Memory Ordering Buffer (MOB)
    • Execution is out-of-order, but memory accesses cannot just be done in any order
    • Contains mainly the LB and the SB
  • Speculative loads and stores
    • Not all loads can be speculative
      • I.e. a memory-mapped I/O ld could have unrecoverable side effects
    • Stores are never speculative (can’t get back overwritten bits)
      • But to improve performance, stores are queued in the store buffer (SB) to allow pending loads to proceed
        • Similar to a write-back cache
schedule1
Schedule
  • P6 Pipeline in detail
    • Front-End
    • Execution Core
    • Back-End
  • Power Issues
    • Intel SpeedStep
  • Testing the Features
    • x86 system registers
    • Performance Testing
power issues
Power Issues
  • Power use = α * C * V2 * F
    • α = activity factor
    • C = effective capacitance
    • V = voltage
    • F = operating frequency
  • Power use can be reduced linearly by lowering frequency and capacitance and quadratically by scaling voltage
mobile use
Mobile Use
  • Mobile is bursty – full power is only necessary for brief periods
  • Intel developed SpeedStep technology to take advantage of this fact and reduce power consumption during periods of inactivity

http://www.intel.com/technology/itj/2003/volume07issue02/art05_power/p05_thermal.htm

speedstep i and ii
SpeedStep I and II
  • SpeedStep I and II used in previous generations
    • Only two states:
      • High performance (High frequency mode)
      • Lower power use (Low frequency mode)
  • Problems
    • Slow transition times
    • Limited opportunity for optimization
pentium m goals
Pentium M Goals
  • Optimize for performance when plugged in
  • Optimize for long battery-life when unplugged
speedstep iii
SpeedStep III
  • Optimized to fix limitations of previous generations
  • Three innovations:
    • Voltage-Frequency switching separation
    • Clock partitioning and recovery
    • Event blocking

The 6 states of the Pentium M 1,6GHz

voltage frequency switching separation
Voltage-Frequency switching separation
  • Voltage scaling is stepped up and down incrementally
  • This prevents clock noise and allows the processor to remain responsive during transition
  • Once voltage target is reached, frequency is throttled

http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p10_speedstep.htm

clock partitioning and recovery
Clock partitioning and recovery
  • During transition, only the core clock and phase-locked-loop are stopped
  • This keeps logic active even while the clock is stopped

http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p10_speedstep.htm

event blocking
Event blocking
  • To prevent loss of events during frequency and voltage scaling when the core clock is stopped, interrupts, pin events, and snoop requests are sampled and saved
  • These events are retransmitted once the core clock becomes available

http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p10_speedstep.htm

leakage
Leakage
  • Transistors in off state still draw current
  • As transistors shrink and clock speed increases, transistors leak more current causing higher temperatures and more power use
strained silicon
Strained Silicon

http://www.research.ibm.com/resources/press/strainedsilicon/

benefits of strained silicon
Benefits of Strained Silicon
  • Electrons flow up to 70% faster due to reduced resistance
  • This leads to chips which are up to 35% faster, without decrease in chip size
  • Intel’s "uni-axial" strained silicon process reduces leakage by at least five times without reducing performance – the 65nm process will realize another reduction of at least four times
high k transistor gate dielectric coming soon
High-K Transistor Gate Dielectric (coming soon)
  • The dielectric used since the 1960s, silicon dioxide, is so thin now that leakage is a significant problem
  • A high-k (high dielectric constant) material has been developed by Intel to replace silicon dioxide
  • This high-k material reduces leakage by a factor of 100 below silicon dioxide
more advances to expect
More Advances to Expect
  • Continued lowering of capacitance has helped reduce power consumption
  • Tri-gate transistors decreases leakage by increasing the amount of surface area for electrons to flow through
schedule2
Schedule
  • P6 Pipeline in detail
    • Front-End
    • Execution Core
    • Back-End
  • Power Issues
    • Intel SpeedStep
  • Testing the Features
    • x86 system registers
    • Performance Testing
x86 system registers
x86 System Registers
  • EFLAGS
    • Various system flags
  • CPUID
    • Exposes type and available features of processor
  • Model Specific Registers (MSRs)
    • rdmsr and wrmsr
    • Examples
      • Enabling/Disabling SpeedStep
      • Determining and changing voltage/frequency points
      • More
performance testing
Performance Testing
  • P4 2.2GHz vs. PM 1.6GHz
future processors
Future Processors
  • Yonah
    • Dual-core processor
    • Manufactured on a 65 nm process
    • Starting at 2.16GHz with a 667 MHz FSB (166MHz quad-pumped)
    • Shared 2MB L2 cache
    • Increased floating point performance with SSE3 instructions
  • Merom
    • Based on EM64T ISA
    • Consume ~0.5 W of power, half of what the Dothan consumes
    • Possibility of laptops with 10 hours of battery life