Understanding Performance Counter Data

Load/Store Machine • How to calculate cache-miss rate with performance counters on the POWER 3? • Available events: • Load miss @ L1 • Load dispatched • Load completed • Store miss @ L1 • Store dispatched • Store completed PIPELINE COMPLETED DISPATCHED (Speculative Execution: Not all dispatched instructions may get completed) Cache-Miss Rate Underestimated Cache-Miss Rate Overestimated (LOAD MISS @ L1 + STORE MISS @ L1) (LOAD CMPL + STORE CMPL) (LOAD MISS @ L1 + STORE MISS @ L1) (LOAD DISP + STORE DISP) or Should be better approximation (Recommended by PCAT) Assembler micro-benchmark example . # Set up input parameters . . lfd fp1,64(SP) # loading a value into a register lfd fp2,72(SP) # loading a value into a register fa fp3,fp1,fp2 # performing a floating-point operation on a number of values stfd fp3,72(SP) # storing the result of the floating-point operation in a register . . . # More operations Understanding Performance Counter Data Authors: Alonso Bayona, Michael Maxwell, Manuel Nieto, Leonardo Salayandia, Seetharami Seelam Mentor: Dr. Patricia J. Teller RECENT WORK DoD applications programmers working on the IBM Power 3 computer architecture had some questions pertaining to performance counter data for that specific architecture. The questions were forwarded to our team by collaborators at UTK (University of Tennessee-Knoxville), and we took on the task of answering any questions not answered by UTK. LIST OF QUESTIONS 1. Exactly how are floating-point (FP) operations counted? (What is counted?) We have observed that FP loads & stores are not counted, and that FMAs (FP Multiply-Adds) are counted as one FP op. Also FP round-to-single is counted as one FP op. Are divides and SQRTs (square roots) included? Are SQRTs counted as one FP op? 2. Kevin London said that PAPI_L1_DCH will return L1 data cache (DC) hits; however, it (that event) is not available on the Power3. We can derive L1 DC hits using "total references to L1 DC" minus number of L1 misses (PAPI_L1_DCM). How do we get total L1 references? Obviously, we should include number of loads (PAPI_LD_CMPL) plus number of stores (PAPI_ST_CMPL), but do we count prefetches (data fetched speculatively)? 3. Are prefetches already part of the load count? (Probably shouldn't be since the result goes to cache, but not to a register.) Are prefetches part of the L1 miss count? Apparently there is a counter for prefetch hits (PM_DC_PREF_HIT). Should the hit rate be calculated PAPI_L1_DCH / (PAPI_LD_INS + PAPI_ST_INS) or (PAPI_L1_DCH + PM_DC_PREF_HIT) / (PAPI_LD_INS + PAPI_ST_INS + number of prefetches). If the latter, how do you count number of prefetches? 4. Same question as 3, except for L2 cache. This question also is complicated by the fact that the L2 cache is unified (data and instruction) (I think). If this is true, how do instruction prefetches fit into the calculation? 5. What (on earth) is the difference between the events PM_LD_MISS_L1, PM_LD_MISS_EXCEED_L2, and PM_LS_MISS_EXCEED_NO_L2? Also, the latter two events take a "threshold" as an argument; how do you specify this to PAPI? 6. On the POWER3 SP, does the sum of PM_FPU_FADD_FMUL + PM_FPU_FCMP + PM_FPU_FDIV + PM_FPU_FEST + PM_FPU_FMA + PM_FPU_FPSCR + PM_FPU_FRSP + PM_FPU_FSQRTequal PM_FPU0_CMPL + PM_FPU1_CMPL? 7. On the POWER3 SP, does PM_IC_HIT + PM_IC_MISS equal PM_INST_CMPL or PM_INST_DISP? 8. More generally, on a speculative processor, there will be more instructions dispatched than completed, and at some point some instructions will be cancelled (is this correct?). Are instructions cancelled before or after they touch cache? This is important in calculating the cache miss rate, since (hopefully) the miss rate is either(PM_LD_MISS_L1 + PM_ST_MISS_L1) / (PM_LD_CMPL + PM_ST_CMPL) or (PM_LD_MISS_L1 + PM_ST_MISS_L1) / (PM_LD_DISP + PM_ST_DISP). Which is it? POWER 3 processor die • What is being counted in floating-point operations? • Simple math operations, +,-,*,/, all count as 1 FLOP (floating-point operation). • A multiply followed by an add, called an FMA instruction, is handled by special hardware and only counts as 1 FLOP. • A square root operation (sqrt), when handled by a software routine, counts as 21 or 22 FLOPs. • The Power3 has special hardware to handle sqrt operations, but the compiler does not always use the hardware. • Other operations counted: Rounding operations and register moves. • Operations not counted: Floating-point data loads and stores. • Miss rate on L1 and L2 data caches • L1 and L2 data cache miss rates are not easy to estimate because of the prefetching mechanism present in almost all modern processors. • There is a need to research indirect methods to measure L1 and L2 data cache miss rates • Prefetching reduces the miss rate for sequential data access • Complement of the miss rate can be computed as follows: • L1 hit rate = (100) * (1- (Load misses in L1 + Store misses) / Total Loads and Stores) • L2 hit rate = (100) * (1 - (Load misses in L2 + Store misses in L2) / Total L1 Misses) • This metrics were obtained at: http://www.sdsc.edu/SciApps/IBM_tools/ibm_toolkit/HPM_2_4_3.html FPU 0 (Floating-point Unit 0) If FLOP = SQRT routine Yes Counter = Counter + (21or 22) FPU 1 (Floating-point Unit 1) No FPU (Floating-point Unit ) Counter = Counter +1 • L1 Instruction Cache Hits • Many times an in-depth understanding of the architecture being studied is required to correctly analyze performance data. • Assuming that the commonly used definition for an event holds for any platform may lead to misinterpretation of the performance data. • For instance, in the Power 3, the instruction cache hit event is trigged when a block of instructions (up to 8) is fetched to the instruction buffer from the cache instead of being triggered on every single fetch of an instruction. • By experimentation and trying to answer question 7, we found that in the Power 3, the following relation holds for a sequential program: • ((PM_IC_HIT - IC_PREF_USED) + PM_IC_MISS) * 8  PM_INST_CMPL SQRT Hardware FMA Hardware Which floating-point operations contribute to total the total floating-point operations completed? • The micro-benchmarks used had to be written in assembler due to the difficulty of triggering specific events on a high-level language such as C. • A different micro-benchmark was written for each of the operations tested. • The micro-benchmarks revealed that the equation previously thought to give the number of total floating-point operations does not hold. • PM_FPU0_FMOV_FEST must be added to the equation. • (The fres instruction, which gives an estimate of the reciprocal of a floating-point operand, will be counted by the PM_FPU0_FMOV_FEST event as well as by the PM_FPU_FEST event. Thus, when using any kind of estimate instruction the proposed equation will count fres instructions twice.) • Division and square root floating-point operations are counted as FMA operations. (STILL UNDER INVESTIGATION) SPONSORS Department of Defense (DOD), MIE (Model Institutions for Excellence) REU (Research Experiences for Undergraduates) Program, and The Dodson Endowment

Understanding Performance Counter Data

Understanding Performance Counter Data

Presentation Transcript

UNDERSTANDING STATISTICAL DATA

Understanding Performance Bottlenecks using Performance Dashboard

Understanding heap data Using Windows Performance Analyzer

Understanding Job Performance

Understanding Performance of Concurrent Data Structures on Graphics Processors

Understanding Data Quality

Understanding Performance measures

Performance Tuning Using Hardware Counter Data

Understanding StorageScope Data

Understanding WIA Performance

Understanding the Data

Understanding Performance Based Bonus

Understanding Your Financial Performance

Understanding Data

Understanding Performance

Data Mining: Data Understanding

Performance Data

Assessing and Understanding Performance

Understanding Over-The-Counter Acne Fighting Products

Leveraging and Understanding Performance Data and Graphs

Performance Tuning Using Hardware Counter Data