Intel pentium m
Advertisement
This presentation is the property of its rightful owner.
1 / 74

IntelPentiumM_r4.ppt PowerPoint PPT Presentation

Pentium (586) - first superscalar - included two pipelines, u and v - virtual-8086 mode ... Pentium Pro (686 or P6) - three-way superscalar ...

Download Presentation

IntelPentiumM_r4.ppt

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Intel pentium m

Intel Pentium M


Outline

Outline

  • History

  • P6 Pipeline in detail

  • New features

    • Improved Branch Prediction

    • Micro-ops fusion

    • Speed Step technology

    • Thermal Throttle 2

  • Power and Performance


Quick review of x86

Quick Review of x86

  • 8080 - 8-bit

  • 8086/8088 - 16-bit (8088 had 8-bit external data bus) - segmented memory model

  • 286 - introduction of protected mode, which included: segment limit checking, privilege levels, read- and exe-only segment options

  • 386 - 32-bit - segmented and flat memory model - paging

  • 486 - first pipeline - expanded the 386's ID and EX units into five-stage pipeline - first to include on-chip cache - integrated x87 FPU (before it was a coprocessor)

  • Pentium (586) - first superscalar - included two pipelines, u and v - virtual-8086 mode - MMX soon after

  • Pentium Pro (686 or P6) - three-way superscalar - dynamic execution - out-of-order execution, branch prediction, speculative execution - very successful micro-architecture

  • Pentium 2 and 3 - both P6

  • Pentium 4 - new NetBurst architecture

  • Pentium M - enhanced P6


Pentium pro roots

Pentium Pro Roots

  • NexGen 586 (1994)

    • Decomposes IA32 instructions into simplerRISC-like operations (R-ops or micro-ops)

      • Decoupled Approach

  • NexGen bought by AMD

    • AMD K5 (1995) – also used micro-ops

  • Intel Pentium Pro

    • Intel’s first use of decoupled architecture


Pentium m overview

Pentium-M Overview

  • Introduced March 12, 2003

  • Initially called Banias

  • Created by Israeli team

  • Missed deadline by less than 5 days

  • Marketed with Intel’s Centrino Initiative

  • Based on P6 microarchitechture


P6 pipeline in a nutshell

P6 Pipeline in a Nutshell

  • Divided into three clusters (front, middle, back)

    • In-order Front-End

    • Out-of-order Execution Core

    • Retirement

  • Each cluster is independent

    • I.e. if a mispredicted branch is detected in the front-end, the front-end will flush and retch from the corrected branch target, all while the execution core continues working on previous instructions


P6 pipeline in a nutshell1

P6 Pipeline in a Nutshell


P6 front end

P6 Front-End

  • Major units: IFU, ID, RAT, Allocator, BTB, BAC

  • Fetching (IFU)

    • Includes I-cache, I-streaming cache, ITLB, ILD

    • No pre-decoding

    • Boundary markings by instruction-length decoder (ILD)

  • Branch Prediction

    • Predicted (speculative) instructions are marked

  • Decoding (ID)

    • Conversion of instructions (macro-ops) into micro-ops

  • Allocation of Buffer Entries: RS, ROB, MOB


P6 execution core

P6 Execution Core

  • Reservation Station (RS)

    • Waiting micro-ops ready to go

    • Scheduler

  • Out-of-order Execution of micro-ops

    • Independent execution units (EU)

    • Must be careful about out-of-order memory access

      • Memory ordering buffer (MOB) interfaces to the memory subsystem

  • Requirements for execution

    • Available operands, EU, and write-back bus

    • Optimal performance


P6 retirement

P6 Retirement

  • In-order updating of architected machine state

    • Re-order buffer (ROB)

  • Micro-op retirement – “all or none”

    • Architecturally illegal to retire only partof an IA-32 instruction

  • In-ordering handling of exceptions

    • Legal to handle mid-execution, but illegalto handle mid-retirement


Pm changes to p6

PM Changes to P6

  • Most changes made in P6 front-end

  • Added and expanded on P4 branch predictor

  • Micro-ops fusion

  • Addition of dedicated stack engine

  • Pipeline length

    • Longer than P3, shorter than P4

    • Accommodates extra features above


Pm changes to p6 cont

PM Changes to P6, cont.

  • Intel has not released the exact length of the pipeline.

  • Known to be somewhere between the P4 (20 stage)and the P3 (10 stage). Rumored to be 12 stages.

  • Trades off slightly lower clock frequencies (than P4) for better performance per clock, less branch prediction penalties, …


Blue man group commercial break

Blue Man Group Commercial Break


Banias

Banias

  • 1st version

  • 77 million transistors, 23 million more than P4

  • 1 MB on die Level 2 cache

  • 400 MHz FSB (quad pumped 100 MHZ)

  • 130 nm process

  • Frequencies between 1.3 – 1.7 GHz

  • Thermal Design Point of 24.5 watts

http://www.intel.com/pressroom/archive/photos/centrino.htm


Dothan

Dothan

  • Launched May 10, 2004

  • 140 million transistors

  • 2 MB Level 2 cache

  • 400 or 533 MHz FSB

  • Frequencies between 1.0 to 2.26 GHz

  • Thermal Design Point of 21(400 MHz FSB) to 27 watts

http://www.intel.com/pressroom/archive/photos/centrino.htm


Dothan cont

Dothan cont.

  • 90 nm process technology on 300 mm wafer.

  • Provide twice the capacity of the 200 mm while the process dimensions double the transistor density

  • Gate dimensions are 50nm or approx half the diameter if the influenza virus

  • P and n gate voltages are reduced by enhancing the carrier mobility of the Si lattice by 10-20%

  • Draws less than 1 W average power


Intelpentiumm r4 ppt

Bus

  • Utilizes a split transaction deferred reply protocol

  • 64-bit width

  • Delivers up to 3.2 Gbps (Banis) or 4.2 Gbps (Dothan) in and out of the processor

  • Utilizes source synchronous transfer of addresses and data

    • Data transferred 4 times per bus clock

    • Addresses can be delivered times per bus clock


Intelpentiumm r4 ppt

  • Bus update in Dothan

  • http://www.intel.com/technology/itj/2005/volume09issue01/art05_perf_power


L1 cache

L1 Cache

  • 64KB total

    • 32 K instruction

    • 32 K data (4 times P4M)

  • Write-back vs. write-through on P4

  • In write-through cache, data is written to both L1 and main memory simultaneously

  • In write-back cache, data can be loaded without writing to main memory, increasing speed by reducing the number of slow memory writes


L2 cache

L2 cache

  • 1 – 2 MB

  • 8-way set associative

  • Each set is divided into 4 separate power quadrants.

  • Each individual power quadrant can be set to a sleep mode, shutting off power to those quadrants

  • Allows for only 1/32 of cache to be powered at any time

  • Increased latency vs. improved power consumption


Prefetch

Prefetch

  • Prefetch logic fetches data to the level 2 cache before L1 cache requests occur

  • Reduces compulsory misses due to an increase of valid data in cache

  • Reduces bus cycle penalties


Schedule

Schedule

  • P6 Pipeline in detail

    • Front-End

    • Execution Core

    • Back-End

  • Power Issues

    • Intel SpeedStep

  • Testing the Features

    • x86 system registers

    • Performance Testing


P6 front end instruction fetching

P6 Front-end: Instruction Fetching

  • IA-32 Memory Management

    • Classic segmented model (cannot be disabled in protected mode)

      • Separation of code, data, and stack into "segments“

    • Optional paging

      • Segments divided into pages (typically 4KB)

      • Additional protection to segment-protection

        • I.e. provides read-write protection on a page-by-page basis

  • Stage 11 (stage 1) - Selection of address for next I-cache access

    • Speculation – address chosen from competing sources (i.e. BTB, BAC, loop detector, etc.)

    • Calculation of linear address from logical (segment selector + offset)

      • Segment selector – index into a table of segment descriptors, which include base address, size, type, and access right of the segment

      • Remember: only six segment selectors, so only six usable at a time

        • 32-bit code nowadays uses flat model, so OS can make do with only a few (typically four) segments

    • IFU chooses address with highest priority and sends it to stage two


P6 front end instruction fetching1

P6 Front-end: Instruction Fetching

  • Stage 12-13 - Accessing of caches

    • Accesses instruction caches with address calculated in stage one

      • Includes standard cache, victim cache, and streaming buffer

    • With paging, consults ITLB to determine physical page number (tag bits)

      • Without paging, linear address from stage one becomes physical address

    • Obtains branch prediction from branch target buffer (BTB)

      • BTB takes two cycles to complete one access

    • Instruction boundary (ILD) and BTB markings

  • Stage 14 - Completion of instruction cache access

    • Instructions and their marks are sent to instruction buffer or steered to ID


P6 front end instruction fetching2

P6 Front-end: Instruction Fetching


P6 front end instruction decoding

P6 Front-end: Instruction Decoding

  • Stage 15-16 - Decoding of IA32 Instructions

    • Alignment of instruction bytes

    • Identification of the ends of up to three instructions

    • Conversion of instructions into micro-ops

  • Stage 17 - Branch Decoding

    • If the ID notices a branch that went unpredicted by the BTB (i.e. if the BTB had never seen the branch before), flushes the in-order pipe, and re-fetches from the branch target

      • Branch target calculated by BAC

    • Early catch saves speculative instructions from being sent through the pipeline

  • Stage 21 - Register Allocation and Renaming

    • Synonymous with stage 17 (a reminder of independent working units)

    • Allocator used to allocate required entries in ROB, RS, LB, and SB

    • Register Alias Table (RAT) consulted

      • Maps logical sources/destinations to physical entries in the ROB (or sometimes RRF)

  • Stage 22 – Completion of Front-End

    • Marked micro-ops are forwarded to RS and ROB, where theyawait execution and retirement, respectively.


P6 front end instruction decoding1

P6 Front-end: Instruction Decoding


Register alias table introduction

Register Alias Table Introduction

  • Provides register renaming of integer and floating-point registers and flags

  • Maps logical (architected) entries to physical entries usually in the re-order buffer (ROB)

  • Physical entries are actually allocated by the Allocator

  • The physical entry pointers become a part of the micro-op’s overall state as it travels through the pipeline


Rat details

RAT Details

  • P6 is 3-way super-scalar, so the RAT must be able to rename up to six logical sources per cycle

  • Any data dependences must be handled

    • Ex:op1) ADD EAX, EBX, ECX (dest. = EAX)op2) ADD EAX, EAX, EDX

      op3) ADD EDX, EAX, EDX

    • Instead of making op2 wait for op1 to retire, the RAT provides data forwarding

      • Same case for op3, but RAT must make sure that it gets the result from op2 and not op1


Rat implementation difficulties

RAT Implementation Difficulties

  • Speculative Renaming

    • Since speculative micro-ops flow by, the RAT must be able to undo its mappings in the case of a branch misprediction

  • Partial-width register reads and writes

    • Consider a partial-width write followed by a larger-width read

      • Data required by the read is an assimilation of multiple previous writes to the register – to make sure, RAT must stall the pipeline

  • Retirement Overrides

    • Common interaction between RAT and ROB

    • When a micro-op retires, its ROB entry is removed and its result may be latched into an architected destination register

    • If any active micro-ops source the retired op’s destination, they must not reference the outdated ROB entry

  • Mismatch stalls

    • Associated with flag renaming


The allocator

The Allocator

  • Works in conjunction with RAT to allocate required entries

  • In each cycle, assumes three ROB, RS, and LB and two SB entries

    • Once micro-ops arrive, it determines how many entries are really needed

  • ROB Allocation

    • If three entries aren’t available the allocator will stall

  • RS Allocation

    • A bitmap is used to determine which entries are free

    • If the RS is full, pipeline is stalled

      • RS must make sure valid entries are not overwritten

  • MOB Allocation

    • Allocation of LB and SB entries also done by allocator


Pm changes to p6 front end

PM Changes to P6 Front-End

  • Micro-op fusion

  • Dedicated Stack Engine

  • Enhanced branch prediction

  • Additional stages

    • Intel’s secret

    • Most likely required for extra functionality above


Micro ops fusion

Micro-ops Fusion

  • Fusion of multiple micro-ops into one micro-op

    • Less contention for buffer entries

  • Similarity to SIMD data packing

  • Two examples of fusion from Intel documentation:

    • IA32 load-and-operate and store instructions

    • Not known for certain whether these are the only cases of fusion

  • Possibly inspired by MacroOps used in K7 (Athlon)


Dedicated stack engine

Dedicated Stack Engine

  • Traditional out-of-order implementations update the Stack Pointer Register (ESP) by sending a µop to update the ESP register with every stack related instruction

  • Pentium M implementation

    • A delta register (ESPD) is maintained in the front end

    • A historic ESP (ESPO) is then kept in the out-of-order execution core

    • Dedicated logic was added to update the ESP by adding the ESPO with the ESPD


Improvements

Improvements

  • The ESPO value kept in the out-of-order machine is not changed during a sequence of stack operations, this allows for more parallelism opportunities to be realized

  • Since ESPD updates are now done by a dedicated adder, the execution unit is now free to work on other µops and the ALU’s are freed to work on more complex operations

  • Decreased power consumption since large adders are not used for small operations and the eliminated µops do not toggle through the machine

  • Approximately 5% of the µops have been eliminated


Complications

Complications

  • Since the new adder lives in the front end all of its calculations are speculative. This necessitates the addition of recovery table for all values of ESPO and ESPD

  • If the architectural value of ESP is needed inside of the out-of-order machine the decode logic then needs to insert a µop that will carry out the ESP calculation


Branch prediction

Branch Prediction

  • Longer pipelines mean higher penalties for mispredicted branches

  • Improvements result in added performance and hence less energy spent per instruction retired


Branch prediction in pentium m

Branch Prediction in Pentium M

  • Enhanced version of Pentium 4 predictor

  • Two branch predictors added that run in tandem with P4 predictor:

    • Loop detector

    • Indirect branch detector

  • 20% lower misprediction rate than PIII resulting in up to 7% gain in real performance


Branch prediction1

Branch Prediction

Based on diagram found here: http://www.cpuid.org/reviews/PentiumM/index.php


Loop detector

Loop Detector

  • A predictor that always branches in a loop will always incorrectly branch on the last iteration

  • Detector analyzes branches for loop behavior

  • Benefits a wide variety of program types

http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p05_branch.htm


Indirect branch predictor

Indirect Branch Predictor

  • Picks targets based on global flow control history

  • Benefits programs compiled to branch to calculated addresses

http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p05_branch.htm


Reservation station

Reservation Station

  • Used as a store for µops to wait for their operands and execution units to become available

  • Consists of 20 entries

  • Control portion of the entry can be written to from one of three ports

  • Data portion can be written to from one of 6 available ports

    • 3 for ROB

    • 3 for EU write backs

  • Scheduler then uses this to schedule up to 5 µops at a time

  • During pipeline stage 31 entries that are ready for dispatch are then sent to stage 32


Cancellation

Cancellation

  • Reservation Station assumes that all cache accesses will be hits

  • In the case of a cache miss micro-ops that are dependant on the write-back data need to be cancelled and rescheduled at a later time

  • Can also occur due to a future resource conflict


Retirement

Retirement

  • Takes 2 clock cycles to complete

  • Utilizes reorder buffer (ROB) to control retirement or completion of μops

  • ROB is a multi-ported register file with separate ports for

    • Allocation time writes of µop fields needed at retirement

    • Execution Unit write-backs

    • ROB reads of sources for the Reservation Station

    • Retirement logic reads of speculative result data

  • Consists of 40 entries with each entry 157 bits wide

  • The ROB participates in

    • Speculative execution

    • Register renaming

    • Out-of-order execution


Speculative execution

Speculative Execution

  • Buffers results of the execution unit before commit

  • Allows maximum rate for fetch and execute by assuming that branch prediction is perfect and no exceptions have occurred

  • If a misprediction occurs:

    • Speculative results stored in the ROB are immediately discarded

    • Microengine will restart by examining the committed state in the ROB


Register renaming

Register Renaming

  • Entries in the ROB that will hold the results of speculative µops are allocated during stage 21 of the pipeline

  • In stage 22 the sources for the µops are delivered based upon the allocation in stage 21.

  • Data is written to the ROB by the Execution Unit into the renamed register during stage 83


Out of order execution

Out-of-order Execution

  • Allows µops to complete and write back their results without concern for other µops executing simultaneously

  • The ROB reorders the completed µops into the original sequence and updates the architectural state

  • Entries in ROB are treated as FIFO during retirement

    • µops are originally allocated in sequential order so the retirement will also follow the original program order

  • Happens during pipeline stage 92 and 93


Exception handling

Exception Handling

  • Events are sent to the ROB by the EU during stage 83

  • Results sent to the ROB from the Execution Unit are speculative results, therefore any exceptions encountered may not be real

  • If the ROB determines that branch prediction was incorrect it inserts a clear signal at the point just before the retirement of this operation and then flushes all the speculative operations from the machine

  • If speculation is correct, the ROB will invoke the correct microcode exception handler

  • All event records are saved to allow the handler to repair the result or invoke the correct macro handler

  • Pointers for the macro and micro instructions are also needed to allow the program to resume after completion by the event handler

  • If the ROB retires an operation that faults, both the in-order and out-of-order sections are cleared. This happens during pipeline stages 93 and 94


Memory subsystem

Memory Subsystem

  • Memory Ordering Buffer (MOB)

    • Execution is out-of-order, but memory accesses cannot just be done in any order

    • Contains mainly the LB and the SB

  • Speculative loads and stores

    • Not all loads can be speculative

      • I.e. a memory-mapped I/O ld could have unrecoverable side effects

    • Stores are never speculative (can’t get back overwritten bits)

      • But to improve performance, stores are queued in the store buffer (SB) to allow pending loads to proceed

        • Similar to a write-back cache


Schedule1

Schedule

  • P6 Pipeline in detail

    • Front-End

    • Execution Core

    • Back-End

  • Power Issues

    • Intel SpeedStep

  • Testing the Features

    • x86 system registers

    • Performance Testing


Power issues

Power Issues

  • Power use = α * C * V2 * F

    • α = activity factor

    • C = effective capacitance

    • V = voltage

    • F = operating frequency

  • Power use can be reduced linearly by lowering frequency and capacitance and quadratically by scaling voltage


Mobile use

Mobile Use

  • Mobile is bursty – full power is only necessary for brief periods

  • Intel developed SpeedStep technology to take advantage of this fact and reduce power consumption during periods of inactivity

http://www.intel.com/technology/itj/2003/volume07issue02/art05_power/p05_thermal.htm


Speedstep i and ii

SpeedStep I and II

  • SpeedStep I and II used in previous generations

    • Only two states:

      • High performance (High frequency mode)

      • Lower power use (Low frequency mode)

  • Problems

    • Slow transition times

    • Limited opportunity for optimization


Pentium m goals

Pentium M Goals

  • Optimize for performance when plugged in

  • Optimize for long battery-life when unplugged


Speedstep iii

SpeedStep III

  • Optimized to fix limitations of previous generations

  • Three innovations:

    • Voltage-Frequency switching separation

    • Clock partitioning and recovery

    • Event blocking

The 6 states of the Pentium M 1,6GHz


Voltage frequency switching separation

Voltage-Frequency switching separation

  • Voltage scaling is stepped up and down incrementally

  • This prevents clock noise and allows the processor to remain responsive during transition

  • Once voltage target is reached, frequency is throttled

http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p10_speedstep.htm


Clock partitioning and recovery

Clock partitioning and recovery

  • During transition, only the core clock and phase-locked-loop are stopped

  • This keeps logic active even while the clock is stopped

http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p10_speedstep.htm


Event blocking

Event blocking

  • To prevent loss of events during frequency and voltage scaling when the core clock is stopped, interrupts, pin events, and snoop requests are sampled and saved

  • These events are retransmitted once the core clock becomes available

http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p10_speedstep.htm


Leakage

Leakage

  • Transistors in off state still draw current

  • As transistors shrink and clock speed increases, transistors leak more current causing higher temperatures and more power use


Strained silicon

Strained Silicon

http://www.research.ibm.com/resources/press/strainedsilicon/


Benefits of strained silicon

Benefits of Strained Silicon

  • Electrons flow up to 70% faster due to reduced resistance

  • This leads to chips which are up to 35% faster, without decrease in chip size

  • Intel’s "uni-axial" strained silicon process reduces leakage by at least five times without reducing performance – the 65nm process will realize another reduction of at least four times


High k transistor gate dielectric coming soon

High-K Transistor Gate Dielectric (coming soon)

  • The dielectric used since the 1960s, silicon dioxide, is so thin now that leakage is a significant problem

  • A high-k (high dielectric constant) material has been developed by Intel to replace silicon dioxide

  • This high-k material reduces leakage by a factor of 100 below silicon dioxide


More advances to expect

More Advances to Expect

  • Continued lowering of capacitance has helped reduce power consumption

  • Tri-gate transistors decreases leakage by increasing the amount of surface area for electrons to flow through


Schedule2

Schedule

  • P6 Pipeline in detail

    • Front-End

    • Execution Core

    • Back-End

  • Power Issues

    • Intel SpeedStep

  • Testing the Features

    • x86 system registers

    • Performance Testing


X86 system registers

x86 System Registers

  • EFLAGS

    • Various system flags

  • CPUID

    • Exposes type and available features of processor

  • Model Specific Registers (MSRs)

    • rdmsr and wrmsr

    • Examples

      • Enabling/Disabling SpeedStep

      • Determining and changing voltage/frequency points

      • More


Performance testing

Performance Testing

  • P4 2.2GHz vs. PM 1.6GHz


Benchmark

Benchmark


Battery life

Battery Life


Pentium m vs amd turion

Pentium M vs AMD Turion


Gaming

Gaming


Battery life1

Battery Life


Future processors

Future Processors

  • Yonah

    • Dual-core processor

    • Manufactured on a 65 nm process

    • Starting at 2.16GHz with a 667 MHz FSB (166MHz quad-pumped)

    • Shared 2MB L2 cache

    • Increased floating point performance with SSE3 instructions

  • Merom

    • Based on EM64T ISA

    • Consume ~0.5 W of power, half of what the Dothan consumes

    • Possibility of laptops with 10 hours of battery life


  • Login