Mobile Pentium 4 Architecture Supporting Hyper-ThreadingTechnology

Mobile Pentium 4 Architecture Supporting Hyper-ThreadingTechnology Hakan Burak Duygulu 2003701387 CmpE 511 15.12.2005

Outline • Introduction • Intel® NetBurstTM microarchitecture. • Hyper-Threading Technology • Microarchitecture Pipeline andHyper-Threading Technology • Streaming SIMD Extensions 3 (SSE3) • Enhanced Intel SpeedStep® technology • Conclusion

Introduction • 16-bit Processors and Segmentation (1978) • the 8086 and 8088. • 20-bit addressing, 1-Mbyte address space. • The Intel® 286 Processor (1982) • 24-bit addressing, 16 Mbytes address space. • The Intel386™ Processor (1985) • 32-bit addressing, 4-Gbytes address space. • The Intel486™ Processor (1989) • Expanded instruction decode and execution units into five pipelined stages.

Introduction • The Intel® Pentium® Processor (1993) • added a second execution pipeline to achieve superscalar performance. • Branch prediction has added. • The P6 Family of Processors (1995-1999) • Intel Pentium Pro processor • Intel Pentium II processor • Pentium II Xeon processor • Intel Celeron processor • Intel Pentium III processor • Pentium III Xeon processor

Introduction • The Intel Pentium 4 Processor Family (2000-2005) • based on Intel NetBurst® microarchitecture. • introduced Streaming SIMD Extensions 2 (SSE2) • introduced Streaming SIMD Extensions 3 (SSE3) • The Intel® Xeon Processor (2001-2005) • introduced support for Hyper-Threading Technology • The Intel® Pentium® M Processor (2003-2005) • designed for extending battery life • The Intel Pentium Processor Extreme Edition (2005) • 64-bit addressing, 1024-Gbytes address space.

Intel® NetBurstTM microarchitecture. • Design Goals • to execute legacy IA-32 applications based onsingle-instruction, multiple-data (SIMD) technology at highthroughput • to operate at high clock rates and to scale to higher performance and clock rates in the future

Intel® NetBurstTM microarchitecture. • Design Advantages • a deeply pipelined design, 20-stage pipeline, that allows for high clock rates (withdifferent parts of the chip running at different clock rates). • a pipeline that optimizes for the common case of frequentlyexecuted instructions; the most frequently-executed instructions incommon circumstances (such as a cache hit) are decoded efficientlyand executed with short latencies

Intel® NetBurstTM microarchitecture. • Design Advantages • Employment of techniques to hide stall penalties; Among these areparallel execution, buffering, and speculation. Themicroarchitecture executes instructions dynamically and out-of-order, so the time it takes to execute each individualinstruction is not always deterministic

Intel® NetBurstTM microarchitecture. Figure 1. The Intel NetBurst Microarchitecture

Intel® NetBurstTM microarchitecture. • Caches • The Intel NetBurst microarchitecture supports up to three levels ofon-chip cache. • The first level cache (nearest to the execution core) contains separatecaches for instructions and data. These include the first-level data cacheand the trace cache (an advanced first-level instruction cache). All othercaches are shared between instructions and data. • All caches use apseudo-LRU (least recently used) replacement algorithm.

Intel® NetBurstTM microarchitecture. • The Front End Pipeline • Consists of two parts: • Fetch/decode unit: • a hardware instruction fetcher that automatically prefetches instructions • a hardware mechanism that automatically fetches data and instructions into the unified second-level cache • Execution trace cache • The execution trace cache (TC) is the primary instruction cache in theIntel NetBurst microarchitecture. The TC stores decoded IA-32instructions (µops).

Intel® NetBurstTM microarchitecture. • The Front End Pipeline • Prefetches IA-32 instructions that are likely to be executed • Fetches instructions that have not already been prefetched • Decodes IA-32 instructions into micro-operations • Generates microcode for complex instructions and special-purpose code • Delivers decoded instructions from the execution trace cache • Predicts branches using highly advanced algorithm

Intel® NetBurstTM microarchitecture. • The Front End Pipeline - Branch Prediction • Enables the processor to begin executing instructions longbefore the branch outcome is certain. Branch delay is the penalty that isincurred in the absence of correct prediction. • Branch prediction in the Intel NetBurst microarchitecture predicts allnear branches (conditional calls, unconditional calls, returns andindirect branches). It does not predict far transfers (far callsandsoftware interrupts).

Intel® NetBurstTM microarchitecture. • The Front End Pipeline - Branch Prediction • Mechanisms have been implemented to aid in predicting branchesaccurately and to reduce the cost of taken branches: • the ability to dynamically predict the direction and target ofbranches based on an instruction’s linear address, using the branchtarget buffer (BTB) • if no dynamic prediction is available or if it is invalid, the ability tostatically predict the outcome based on the offset of the target: abackward branch is predicted to be taken, a forward branch ispredicted to be not taken • the ability to predict return addresses using the 16-entry returnaddress stack • the ability to build a trace of instructions across predicted takenbranches to avoid branch penalties.

Intel® NetBurstTM microarchitecture. • The Static Predictor. • Once a branch instruction is decoded, thedirection of the branch (forward or backward) is known. If there was novalid entry in the BTB for the branch, the static predictor makes aprediction based on the direction of the branch. The static predictionmechanism predicts backward conditional branches (those withnegative displacement, such as loop-closing branches) as taken.Forward branches are predicted not taken. • To take advantage of the forward-not-taken and backward-taken staticpredictions, code should be arranged so that the likely target of thebranch immediately follows forward branches

Intel® NetBurstTM microarchitecture. Figure 2.Pentium 4 Processor Static Branch Prediction Algorithm

Intel® NetBurstTM microarchitecture. • Branch Target Buffer. • Once branch history is available, the Pentium 4processor can predict the branch outcome even before the branchinstruction is decoded. The processor uses a branch history table and abranch target buffer (collectively called the BTB) to predict thedirection and target of branches based on an instruction’s linear address.Once the branch is retired, the BTB is updated with the target address.

Intel® NetBurstTM microarchitecture. • Out-Of-Order Execution Core • Ability to execute instructions out of order is a key factor inenabling parallelism. This feature enables the processor to reorder instructions so that if one µopis delayed, other µops may proceed around it. The processor employs several buffers to smooththe flow of µops. • The core is designed to facilitate parallel execution. It can dispatch up to six µops per cycle. Most pipelines can start executing a newµop every cycle, so several instructions can be in flight at a time for each pipeline. A number ofarithmetic logical unit (ALU) instructions can start at two per cycle; many floating-point instructionscan start once every two cycles.

Intel® NetBurstTM microarchitecture. Figure 3. Execution Units and Ports in the Out-Of-Order Core

Intel® NetBurstTM microarchitecture. • Retirement Unit • The retirement unit receives the results of the executed µops from the out-of-order executioncore and processes the results so that the architectural state updates according to the originalprogram order. • When a µop completes and writes its result, it is retired. Up to three µops may be retired percycle. The Reorder Buffer (ROB) is the unit in the processor which buffers completed µops,updates the architectural state in order, and manages the ordering of exceptions. The retirementsection also keeps track of branches and sends updated branch target information to the branch target buffer(BTB).The BTB then purges pre-fetched traces that are no longer needed.

Hyper-Threading Technology • Enables software to take advantage of task-level, orthread-level parallelism by providing multiple logical processors withina physical processor package. • The two logical processors each have a complete set of architecturalregisters while sharing one single physical processor's resources. By maintaining the architecture state of two processors, an HT Technologycapable processor looks like two processors to software, includingoperating system and application code.

Hyper-Threading Technology Figure 4. Comparison of an IA-32 Processor Supporting Hyper-Threading Technology and a Traditional Dual Processor System

Hyper-Threading Technology • Replicated Resources • Control registers (Architectural Registers AR) • 8 general purpose registers (AR) • Machine state registers (AR) • Debug registers (AR) • Instruction pointers(IP) • Register renaming tables(RNT) • Return stack predictor (RSP)

Hyper-Threading Technology • Replicated Resources • AR’s are used by the operating system and application code to control program behavior and store data for computations. • IP and RNT are replicated for simultaneouslytrack execution and state changes ofthe two logical processors. • The RSP is replicated toimprove branch prediction of return instructions.

Hyper-Threading Technology • Partitioned Resources • Re-order Buffers(ROB’s) • Load/Store Buffers • Various queues, like the scheduling and µop queues

Hyper-Threading Technology • Partitioned Resources • operational fairness • permitting the ability to allow operations from one logicalprocessorto bypass operations of the other logical processor that may havestalled. • For example: a cache miss, a branch misprediction, or instructiondependencies may prevent a logical processor from making forwardprogress for some number of cycles. The partitioning prevents thestalled logical processor from blocking forward progress.

Hyper-Threading Technology • Shared Resources • Caches: trace cache, L1, L2, L3 • Execution Units are fully shared to improve thedynamic utilizationof the resource.

Microarchitecture Pipeline and Hyper-Threading Technology • Front End Pipeline • Execution trace cache access is arbitrated by the two logical processorsevery clock. If a cache line is fetched for one logical processor in oneclock cycle, the next clock cycle a line would be fetched for the otherlogical processor provided that both logical processors are requestingaccess to the trace cache. • If one logical processor is stalled or is unable to use the execution tracecache, the other logical processor can use the full bandwidth of the tracecache.

Microarchitecture Pipeline and Hyper-Threading Technology • Front End Pipeline • After fetching the instructions and building traces of µops, the µops areplaced in a queue. This queue decouples the execution trace cache fromthe register rename pipeline stage. If both logicalprocessors are active, the queue is partitioned so that both logicalprocessors can make independent forward progress.

Microarchitecture Pipeline and Hyper-Threading Technology • Execution Core • The core can dispatch up to six µops per cycle, provided the µops areready to execute. Once the µops are placed in the queues waiting forexecution, there is no distinction between instructions from the twological processors. • After execution, instructions are placed in the re-order buffer. There-order buffer decouples the execution stage from the retirement stage.The re-order buffer is partitioned such that each uses half the entries.

Microarchitecture Pipeline and Hyper-Threading Technology • Retirement • The retirement logic tracks when instructions from the two logicalprocessors are ready to be retired. It retires the instruction in programorder for each logical processor by alternating between the two logicalprocessors. If one logical processor is not ready to retire anyinstructions, then all retirement bandwidth is dedicated to the otherlogical processor. • Once stores have retired, the processor needs to write the store data intothe level-one data cache. Selection logic alternates between the twological processors to commit store data to the cache.

Streaming SIMD Extensions 3 (SSE3) • Beginning with the Pentium II and Pentium Intel MMX technology processor families, fourextensions have been introduced into the IA-32 architecture to permit IA-32 processors toperform single-instruction multiple-data (SIMD) operations. • These extensions include theMMX technology, SSE extensions, SSE2 extensions, and SSE3 extensions.

Streaming SIMD Extensions 3 (SSE3) Figure 5. Typical SIMD Operations

Streaming SIMD Extensions 3 (SSE3) • MMX™ Technology • MMX Technology introduced: • 64-bit MMX registers • support for SIMD operations on packed byte, word, anddoublewordintegers • MMX instructions are useful for multimedia and communicationssoftware.

Streaming SIMD Extensions 3 (SSE3) • Streaming SIMD Extensions • Streaming SIMD extensions introduced: • 128-bit XMM registers • data prefetch instructions • non-temporal store instructions and other cacheability and memoryordering instructions • extra 64-bit SIMD integer support • SSE instructions are useful for 3D geometry, 3D rendering, speechrecognition, and video encoding and decoding.

Streaming SIMD Extensions 3 (SSE3) • Streaming SIMD Extensions 2 • Streaming SIMD extensions 2 add the following: • support for SIMD arithmetic on 64-bit integer operands • instructions for converting between new and existing data types • extended support for data shuffling • extended support for cacheability and memory ordering operations • SSE2 instructions are useful for 3D graphics, video decoding/encoding,and encryption.

Streaming SIMD Extensions 3 (SSE3) • Streaming SIMD Extensions 3 • Streaming SIMD extensions 3 add the following: • SIMD floating-point instructions for asymmetric and horizontalcomputation • a special-purpose 128-bit load instruction to avoid cache line splits • instructions to support thread synchronization • SSE3 instructions are useful for scientific, video and multi-threadedapplications.

Figure 6. Summary of SIMD Technologies

Enhanced Intel SpeedStep® technology • Enables real-time dynamicswitching between multiple voltages and operating frequency points. • The processor features the Auto Halt, Stop Grant, DeepSleep, and Deeper Sleep low power states.

Enhanced Intel SpeedStep® technology • The processor includes an address bus powerdown capabilitywhich removes power from the address and data pins when the FSB is not in use.

Conclusion • Deeply Pipelined, 20-stage pipeline ,achieved Higher Clock Rate with NetBurst microarchitecture • Improved performance with very little additional die area with Hyper-Threading Technology • SSE3 offers 13 instructions thataccelerate performance of Streaming SIMD Extensions technology

References • http://www.intel.com/design/pentium4/manuals/index_new.htm - IA-32 Intel® Architecture Software Developer's Manual, Volume 1: Basic Architecture - IA-32 Intel® Architecture Optimization Reference Manual • http://www.intel.com/design/mobile/datashts/302424.htm • http://arstechnica.com/articles/paedia/cpu.ars • http://www.extremenano.com/print_article/PC+Processor+Microarchitecture/1621.aspx

Questions ?

Mobile Pentium 4 Architecture Supporting Hyper-ThreadingTechnology