The Pentium 4 CPSC 321. Andreas Klappenecker. Today’s Menu. Advanced Pipelining Brief overview of the Pentium 4. Instruction Level Parallelism. Pipelining exploits the potential parallelism among instructions. There are two main methods to increase the potential amount of parallelism:
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Brief overview of the Pentium 4
Pipelining exploits the potential parallelism among instructions. There are two main methods to increase the potential amount of parallelism:
Suppose that the washer cycle is longer than the other cycles. We can divide our washer into three machines that perform the wash, rinse, and spin steps of a traditional washer.
(Move from a four to six pipeline stages)
A multiple issue laundry would replace our household washer and dryer with, say, three washers and three dryers.
We have two different approaches to multiple-issue processors:
The most important method to exploit more ILP is speculation. The compiler or the processor guess about the properties of an instruction, to enable execution of instructions that depend on the current instruction.
For example, a compiler can use speculation to reorder instructions and move instructions beyond a branch.
A compiler can get more performance from loops by so-called loop unrolling; this is a technique where multiple copies of the loop are made => more ILP by overlapping instructions from different iterations
In the loop unrolling, the compiler will usually introduce additional registers to eliminate dependencies that are not true data dependencies (just name dependence). The process is called register renaming.
Processor with MMX™ technology
Pentium® II Xeon™ Processor
And Xeon™ Processors
Pentium® II Processor
First Intel Inside®
Brand TV Ad
Slide courtesy of Intel
Graphic courtesy of Tom’s hardware guide
Picture courtesy of Tom’s hardware guide
8 bytes * 100 million/s*4= 3,200 MB/s
(this is about 3 times the speed of the system bus of the Pentium 3)
(8-way associative, 128 byte cache lines that are divided into 64 byte blocks that are read in one burst, read latency is 7 clock cycles; we come back later to such issues)
(allows the processor to guess and fetch some that that is presumably used next; good for streaming video applications).
Actual program instructions
Trace cache can contain instructions of both branches
The branch prediction aids the execution trace cache; it has a fairly large branch target buffer
The Pentium 4 can operate on 128 bit data as
single instruction multiple data instructions
A typical thread of code of the IA-32 architecture uses about 35% of the microarchitecture execution resources.
Intel added a little bit of hardware to schedule and control two threads.
The operating system sees two logical processors