Exploring P4 Trace Cache Features

Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten

Problem Statement • Explore characteristics of the P4 Trace Cache using microbenchmarks and performance counters related to branching and Trace Cache

Approach • Determine characteristics of the Pentium 4 processor that will help us evaluate the P4’s trace cache • Using a performance monitoring tool (Intel’s Vtune Performance Analyzer) measure the data we need and analyze it to find limitations on the trace cache

Some P4 Characteristics • Like most high performance processors, the P4 has special on-chip hardware for performance monitoring. This hardware typically includes • Event detectors and counters • Qualification of event detections and counting by privilege mode and event characteristics • Support for event-based sampling

P4 characteristics cont. • Common problems faces by modern processors • Small number of counters • Inability to distinguish between speculative and non-speculative events • Imprecise event-based sampling • With 42 million transistors (compared to 28 million of the P3), the P4 has overcome these problems • 48 event detectors and 18 event counters • Provides instruction-tagging to enable counting of nonspeculative performane events • Provides support for imprecise event-based sampling (IEBS) and precise event-based sampling (PEBS)

Trace Cache • Special instruction cache for capturing long dynamic instruction sequences. • Each line stores a snapshot, or trace, of the dynamic instruction stream • P4 executes trace caches when there is an L1 cache hit (which is over 90% of the time)

Characteristics of Trace Cache • Stores instructions after they’ve already been decoded into μops (“micro-ops”). • μops – RISC-style instructions • Cache Line Size: 6 μops • Trace Cache Size: 12K μops • Branch Prediction hardware is used • knows about any branch and fetch instructions that follow the branch. • Conditional Branches can cause problems • Won’t know if wrong until branch condition check in ALU0

Entering The Execution Pipeline - Pentium 4's Trace Cache Tom’s Hardware Guide http://www6.tomshardware.com/cpu/20001120/p4-06.html

Advantages of Trace Cache • More efficient use of limited cache space. • Trace cache lines contain both branch instructions and the code after the branch instruction. • No extra latency for branches • Does not use TLB check

The P4’s Critical Execution Path “Execute Mode" (when needed code is in L1 cache)

Execute Mode Vs. Trace Segment Build Mode • Execute Mode • Trace cache feeds stored traces to the execution logic to be executed. • Trace cache normally runs in this mode. • Trace Segment Build Mode • Used when there is an L1 cache miss • Front end fetches x86 code from the L2 cache, • Translates into μops, • Builds a “trace segment” with it, • Loads that segment into the trace cache to be executed.

Branch Prediction • X86 code with a branch in it: • The trace cache builds a trace from instructions up to and including the branch instruction • Then picks which branch it thinks the program will take • Continues to build the trace along that speculative branch.

Microcode ROM • Used by P4 to process longer instructions • Allows regular hardware decoder to concentrate on decoding the smaller, faster instructions. • Stores a sequence of μops for each long instruction encountered. • Inserts a tag into the trace segment that points to the section of the microcode ROM where the μop sequence is held. • Trace Cache gives control to the Microcode ROM when a tag is encountered until the proper sequence of μops is produced. • Execution Engine does not care if instructions come from the Trace Cache or the Microcode ROM

VTune Experiment for(i=0; i<1M; i++) _asm { mov eax, 10 mov eax, 20 }

VTune Experiment for(i=0; i<1M; i++) _asm { mov eax, 10 … mov eax, 4990 }

VTune Results

VTune Results cont.

VTune Results for P4m

Sources: • M. Milenkovic, A. Milenkovic, J. Kulick, “Demystifying Intel Branch Predictors,” Proceedings of the Workshop on Duplicating, Deconstructing, and Debunking (held in conjunction with 29th ISCA), Anchorage, Alaska, May 2002 • E. Rotenberg, S. Bennett, J. E. Smith, “A Trace Cache Microarchitecture and Evaluation,” IEEE Transactions on Computers, (Vol. 48, No. 2) February 1999 • http://www6.tomshardware.com/cpu/20001120/p4-06.html • http://www.extremetech.com/article2/0,3973,1488,00.asp • http://www.arstechnica.com/cpu/01q2/p4andg4e/p4andg4e-5.htm

Exploring P4 Trace Cache Features

Exploring P4 Trace Cache Features

Presentation Transcript

Trace Cache

P4: Multithreaded Programming

P3/P4

Trace

Exploring the Powerful Features of PowerPoint

P4 and P4/5 Jacobite Activity

P4 – Features and Functions of Information Systems

Trace -

Using Trace Cache In SMT

P4 English

P4 Interactions

P4 revision

Trace

p4

P3/P4

P3, P4

Feature-level Phase Detection for Execution Trace Using Object Cache

Exploring Some New Features Of Hotmail

Polynomials P4