The pentium 4 cpsc 321
1 / 27

The Pentium 4 CPSC 321 - PowerPoint PPT Presentation

  • Uploaded on

The Pentium 4 CPSC 321. Andreas Klappenecker. Today’s Menu. Advanced Pipelining Brief overview of the Pentium 4. Instruction Level Parallelism. Pipelining exploits the potential parallelism among instructions. There are two main methods to increase the potential amount of parallelism:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'The Pentium 4 CPSC 321' - selia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The pentium 4 cpsc 321 l.jpg

The Pentium 4CPSC 321

Andreas Klappenecker

Today s menu l.jpg

Today’s Menu

Advanced Pipelining

Brief overview of the Pentium 4

Instruction level parallelism l.jpg
Instruction Level Parallelism

Pipelining exploits the potential parallelism among instructions. There are two main methods to increase the potential amount of parallelism:

  • Increase the depth of the pipeline to overlap more instructions

  • Replicate the internal components of the computer so that it can launch multiple instructions in every pipeline stage

Washer dryer example l.jpg
Washer-Dryer Example

Suppose that the washer cycle is longer than the other cycles. We can divide our washer into three machines that perform the wash, rinse, and spin steps of a traditional washer.

(Move from a four to six pipeline stages)

A multiple issue laundry would replace our household washer and dryer with, say, three washers and three dryers.

Multiple issue processors l.jpg
Multiple-Issue Processors

We have two different approaches to multiple-issue processors:

  • The approach to decide at compile time which instructions should be issued is called static multiple issue

  • The approach to decide at execution time which instructions should be issued is called dynamic multiple issue

Multiple issues with multiple issue l.jpg
Multiple Issues with Multiple-Issue

  • Package instructions into issue slots: How does the processor determine how many instructions and which instructions can be issued in a given clock cycle?

  • Dealing with data and control hazards: In static issue processors, some or all consequences of these hazards are handled statically by the compiler. Dynamic issue processors attempt to alleviate at least some classes of hazards using hardware techniques

Speculation l.jpg

The most important method to exploit more ILP is speculation. The compiler or the processor guess about the properties of an instruction, to enable execution of instructions that depend on the current instruction.

For example, a compiler can use speculation to reorder instructions and move instructions beyond a branch.

Recovery from wrong speculations l.jpg
Recovery from wrong Speculations

  • Speculation in software: the compiler inserts additional instructions to that check the accuracy of a speculation and provide a fix-up routine when the speculation was incorrect.

  • Speculation in hardware: The processor usually buffers the results until it knows that they are no longer speculative. If the speculation was correct, then the instructions are completed by allowing the contents to be written to registers or memory; otherwise the buffers are flushed and the correct instruction sequence is re-executed.

Register renaming l.jpg
Register Renaming

A compiler can get more performance from loops by so-called loop unrolling; this is a technique where multiple copies of the loop are made => more ILP by overlapping instructions from different iterations

In the loop unrolling, the compiler will usually introduce additional registers to eliminate dependencies that are not true data dependencies (just name dependence). The process is called register renaming.

Slide11 l.jpg

Intel’s History

Intel Pentium®

Processor with MMX™ technology


Pentium® II Xeon™ Processor




Pentium® III

And Xeon™ Processors


Pentium® II Processor

Intel Pentium®

Pro Processor




Celeron™ Processor




Pentium® 4Processor


































100 Mbit

E-Net Card

1 Gbit

E-Net Card



Exchange Architecture

First Intel


First DRAM

Intel Inside®


1st Pb-Free




First Intel Inside®

Brand TV Ad

Slide courtesy of Intel

The pentium4 architecture l.jpg
The Pentium4 Architecture

Graphic courtesy of Tom’s hardware guide

A glance at a pentium 4 chip l.jpg
A Glance at a Pentium 4 Chip

Picture courtesy of Tom’s hardware guide

Pentium4 l.jpg

  • The Pentium 4 was first released in 2000. Some of its features are:

    • fast system bus

    • advanced transfer cache

    • advanced dynamic execution (execution trace cache and enhanced branch prediction)

    • “hyper” pipeline technology

    • rapid execution engine

    • enhanced floating point and multimedia (SSE2)

Some features l.jpg
Some Features

  • The processor uses micro-operations/operands

    • simple instructions of unified length

    • easier sequencing than variable length x86 instr.

    • understood by the execution units

    • the length is not exactly small

System bus l.jpg
System Bus

  • The system bus is clocked at 100 MHz, 64 bits wide, “quad-pumped”, meaning that is can transfer

    8 bytes * 100 million/s*4= 3,200 MB/s

    (this is about 3 times the speed of the system bus of the Pentium 3)

  • Intel introduced the 850 chipset to sustain high data exchange rates between processor and system

Data caches l.jpg
Data Caches

  • Data passes a level 2 cache (256 KB),

    (8-way associative, 128 byte cache lines that are divided into 64 byte blocks that are read in one burst, read latency is 7 clock cycles; we come back later to such issues)

  • Data passes a small level 1 cache (8 KB)

  • Hardware pre-fetch unit

    (allows the processor to guess and fetch some that that is presumably used next; good for streaming video applications).

Execution pipeline the trace cache l.jpg
Execution Pipeline: The Trace Cache

  • The Pentium 4 does not use an L1 instruction cache, but rather an “execution trace cache”.

  • Note that the decoding of x86 instructions is much more complex than on MIPS

  • The execution trace cache is basically an instruction cache after the decoding unit (which generates the micro-operations), so that decoding does not have to be repeated.

  • Supplies next pipeline stage with 6 micro-operations every 2 clock cycles.

The trace cache l.jpg
The Trace Cache

Actual program instructions

Trace cache can contain instructions of both branches

The pipeline l.jpg
The Pipeline

The branch prediction aids the execution trace cache; it has a fairly large branch target buffer

  • The 20 stage hyper pipeline

  • The pipeline can keep up to 126 instructions

The pipeline21 l.jpg
The Pipeline

Trace cache

Rapid execution engine l.jpg
Rapid Execution Engine

  • The rapid execution engine consists of two ALUs and two AGUs that run at twice the clock speed.

  • Not every instruction can be processed by the rapid execution engine; those instructions need to use e.g. the slower ALU

  • AGU = address generation unit to load or store at the correct address (used whenever you have indirect addressing a[i]).

Streaming simd extensions sse2 l.jpg
Streaming SIMD Extensions SSE2

The Pentium 4 can operate on 128 bit data as

  • 4 single precision FP values (SSE)

  • 2 double precision FP values (SSE2)

  • 16 byte values (SSE2)

  • 8 word values (SSE2)

  • 4 double word values (SSE2)

  • 2 quad word values

  • 1 128 bit values

    single instruction multiple data instructions

Pentium 4 pipeline l.jpg
Pentium 4 Pipeline

  • Trace cache access, predictor 5 clock cycles

    • Microoperation queue

  • Reorder buffer allocation, register renaming 4 clock cycles

    • functional unit queues

  • Scheduling and dispatch unit 5 clock cycles

  • Register file access 2 clock cycles

  • Execution 1 clock cycle

    • reorder buffer

  • Commit 3 clock cycles (total: 20 clock cycles)

Pentium 4 generations l.jpg
Pentium 4 Generations

  • Willamette

  • Northwood (smaller transistors, later hyper-threading)

  • Extreme Edition (added 2MB level 3 cache)

  • Prescott (90 nm process, new micro architecture)

  • Irwindale (as Prescott, but with doubled L2 cache)

  • Dual Core

Hyper threading l.jpg

A typical thread of code of the IA-32 architecture uses about 35% of the microarchitecture execution resources.

Intel added a little bit of hardware to schedule and control two threads.

The operating system sees two logical processors

To probe further l.jpg
To Probe Further

  • Read Chapter 6

  • Hennessy and Patterson, Computer Architecture: A Quantitative Approach

  • Intel website

  • AMD websiter