1 / 19

Exploring P4 Trace Cache Features

Exploring P4 Trace Cache Features. Ed Carpenter Marsha Robinson Jana Wooten. Problem Statement. Explore characteristics of the P4 Trace Cache using microbenchmarks and performance counters related to branching and Trace Cache. Approach .

etenia
Download Presentation

Exploring P4 Trace Cache Features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten

  2. Problem Statement • Explore characteristics of the P4 Trace Cache using microbenchmarks and performance counters related to branching and Trace Cache

  3. Approach • Determine characteristics of the Pentium 4 processor that will help us evaluate the P4’s trace cache • Using a performance monitoring tool (Intel’s Vtune Performance Analyzer) measure the data we need and analyze it to find limitations on the trace cache

  4. Some P4 Characteristics • Like most high performance processors, the P4 has special on-chip hardware for performance monitoring. This hardware typically includes • Event detectors and counters • Qualification of event detections and counting by privilege mode and event characteristics • Support for event-based sampling

  5. P4 characteristics cont. • Common problems faces by modern processors • Small number of counters • Inability to distinguish between speculative and non-speculative events • Imprecise event-based sampling • With 42 million transistors (compared to 28 million of the P3), the P4 has overcome these problems • 48 event detectors and 18 event counters • Provides instruction-tagging to enable counting of nonspeculative performane events • Provides support for imprecise event-based sampling (IEBS) and precise event-based sampling (PEBS)

  6. Trace Cache • Special instruction cache for capturing long dynamic instruction sequences. • Each line stores a snapshot, or trace, of the dynamic instruction stream • P4 executes trace caches when there is an L1 cache hit (which is over 90% of the time)

  7. Characteristics of Trace Cache • Stores instructions after they’ve already been decoded into μops (“micro-ops”). • μops – RISC-style instructions • Cache Line Size: 6 μops • Trace Cache Size: 12K μops • Branch Prediction hardware is used • knows about any branch and fetch instructions that follow the branch. • Conditional Branches can cause problems • Won’t know if wrong until branch condition check in ALU0

  8. Entering The Execution Pipeline - Pentium 4's Trace Cache Tom’s Hardware Guide http://www6.tomshardware.com/cpu/20001120/p4-06.html

  9. Advantages of Trace Cache • More efficient use of limited cache space. • Trace cache lines contain both branch instructions and the code after the branch instruction. • No extra latency for branches • Does not use TLB check

  10. The P4’s Critical Execution Path “Execute Mode" (when needed code is in L1 cache)

  11. Execute Mode Vs. Trace Segment Build Mode • Execute Mode • Trace cache feeds stored traces to the execution logic to be executed. • Trace cache normally runs in this mode. • Trace Segment Build Mode • Used when there is an L1 cache miss • Front end fetches x86 code from the L2 cache, • Translates into μops, • Builds a “trace segment” with it, • Loads that segment into the trace cache to be executed.

  12. Branch Prediction • X86 code with a branch in it: • The trace cache builds a trace from instructions up to and including the branch instruction • Then picks which branch it thinks the program will take • Continues to build the trace along that speculative branch.

  13. Microcode ROM • Used by P4 to process longer instructions • Allows regular hardware decoder to concentrate on decoding the smaller, faster instructions. • Stores a sequence of μops for each long instruction encountered. • Inserts a tag into the trace segment that points to the section of the microcode ROM where the μop sequence is held. • Trace Cache gives control to the Microcode ROM when a tag is encountered until the proper sequence of μops is produced. • Execution Engine does not care if instructions come from the Trace Cache or the Microcode ROM

  14. VTune Experiment for(i=0; i<1M; i++) _asm { mov eax, 10 mov eax, 20 }

  15. VTune Experiment for(i=0; i<1M; i++) _asm { mov eax, 10 … mov eax, 4990 }

  16. VTune Results

  17. VTune Results cont.

  18. VTune Results for P4m

  19. Sources: • M. Milenkovic, A. Milenkovic, J. Kulick, “Demystifying Intel Branch Predictors,” Proceedings of the Workshop on Duplicating, Deconstructing, and Debunking (held in conjunction with 29th ISCA), Anchorage, Alaska, May 2002 • E. Rotenberg, S. Bennett, J. E. Smith, “A Trace Cache Microarchitecture and Evaluation,” IEEE Transactions on Computers, (Vol. 48, No. 2) February 1999 • http://www6.tomshardware.com/cpu/20001120/p4-06.html • http://www.extremetech.com/article2/0,3973,1488,00.asp • http://www.arstechnica.com/cpu/01q2/p4andg4e/p4andg4e-5.htm

More Related