Static and Dynamic Helper Thread Prefetching

Static and Dynamic Helper Thread Prefetching Wei Hsu 8/02/2006

Outline • Existing HW and SW Data Cache Prefetching Techniques • Helper Thread Prefetching for CMP/CMT Processors • Static Helper Thread Generation • Itanium VMT based experiments • Pentium hyper-threading based experiments • Dynamic Helper Thread Generation • Sparc Panther based prototype

Cycle Breakdown: case 1 Target Hotspot Real World Apps

Cycle Breakdown (case 2) Data collected on USIII+ with base binaries compiled without feedback Stalls from data cache form a significant portion of execution cycles on a machine even with a relatively large 2nd level cache (8MB)

Existing HW/SW Prefetching Techniques • Prediction based • HW techniques • Stride based prefetch • Correlation based prefetch • SW techniques • Stride based • Pointer chasing based • Pre-computation based

Stride Predictor 2-bit finite state machine RPT (Reference Prediction Table) PC tag prev_addr stride state prefetch address Outstanding Request List Lookahead Prefetching: use lookahead PC to look into the RPT prefetch address LA-PC (LookAhead PC) Branch Predictor RPT

SW directed HW initiated prefetch • PA7200 implemented a hardware initiated prefetch scheme using either an undirected or a directed mode. • In undirected mode, the cache automatically fetch the next sequential line (i.e. next_line prefetch). • In directed mode (i.e. compiler directed), the processor determines a prefetch direction (forward or backward) and prefetch stride based on the auto-increment amount encoded in the load/store instructions. (auto-increment is similar to post-increment feature in IA64).

SW controlled strided prefetch • Compiler inserts explicit prefetch instructions (most modern processors support data cache prefetch instructions such as lfetch, prefetch, … etc). • The compiler determines the stride, which can be computed from the loop index variable, to insert prefetch instructions for data to be used a few iterations later. • Loop unrolling is often performed to reduce instruction overhead. • Can support indirect array prefetching. • Can insert instructions to compute runtime stride for pointer chasing loops.

Pros and Cons of SW Controlled Stride Prefetching • Pros: • reduces hardware overhead, • can avoid the first round misses, • can handle complex address equations, • can prefetch much ahead of time, • can support indirect array prefetching • Cons: • code bloat, • instruction overhead, • control flow is often a problem, • unpredictable latencies (L2/L3? Network delay?)

Correlation based prefetching– using Markov predictor • Assume cache misses are correlated in that a miss usually follow a miss to the same few addresses. For example, if miss A is often followed by B,C, and D, when miss A is detected, the predictor will trigger prefetching for B, C, and D. • A correlation can be set up based a Markov model (based on probability). However, LRU replacement may work even better. • Use only cache miss address stream rather than all address stream. Predictor can be located off-chip. • It supports more than strided prefetching. However, it requires too much space for the prediction table.

Correlation based prefetching (cont.) • Recent Region based prefetching is also based on correlations. One cache miss can trigger the prefetch of nearby cache lines with temporal and spatial locality. • My group is interested in HPM support for correlation based prefetching. For example, using hardware event history to identify opportunities on correlation based prefetching, and deploy such prefetches at runtime in software. • Correlation based prefetch makes sense for instruction cache misses. Data misses also correlate, but require large table to remember.

Pre-computation based prefetching • Prefetching is driven by address pre-computation • Dynamic pre-computation • Static pre-computation • Helper thread A speculative Prefetch Thread runs ahead of the main thread and trigger early cache misses on its behalf. • Static helper thread • Dynamic helper thread • HW run-ahead

time What are Helper Threads?  Cache Miss Main thread Cache miss avoided Helper thread Prefetch initiated

time What are Helper Threads? Step 1: Identify delinquent loads Cache Miss Main thread Cache miss avoided Helper thread Prefetch initiated Step 2: Compute the backward slice as a helper thread, including live-in reg computation. Step 3: Determine the trigger point

First compilation pass Regular binary Cache Profiles Post-pass control flow graph builder Second compilation pass Delinquent load identification Slicing, scheduling, trigger point identification SSP-enabled binary generation Enhanced binary with triggers+slices

Context Sensitive vs. Context Insensitive Slice All instructions on the back slice (with both control and data dependence) from the delinquent load are included. Do not consider in-parameters. Procedure Context Insensitive Slice Delinquent load

Context Sensitive vs. Context Insensitive Slice A more efficient slice can be derived if calling context can also be considered Procedure Context Sensitive Slice Delinquent load

Region Based Slicing • Slice needs to be large enough to create sufficient slack (time difference between prefetch and actual load), small enough to avoid losing prefetched data. • Slice selection can be based on regions. A region represents a loop, a loop body, or a procedure in the program. • A region graph is a hierarchical program representation that uses edges to connect a parent region to its child regions, e.g. from callers to callees, from an outer scope to an inner scope. • A slice is expanded from inner region to outer until it is large enough.

Speculative Slicing • Static slicing may introduce a large number of unnecessary instructions in the slice. • Dynamic slicing can be prohibitively expensive. • A hybrid slicing approach called control-flow speculative slicing, alleviates the imprecision problem by exploiting block profiling and dynamic call graphs. • This control flow information is used to filter out unexecuted paths and unrealized calls (especially indirect calls).

Slice Statistics

Constructing Helper Threads Main Thread while( node != root ) { while( node ) { if( node->orientation == UP ) node->ptent = node->arc->cost + node->pred->ptent; else /* == DOWN */ { node->ptent = node->pred->ptent - node->arc->cost; checksum++; } tmp = node; node = node->child; } node = tmp; while( node->pred ) { tmp = node->sibling; if( tmp ) { node = tmp; break; } else node = node->pred; } } Helper Thread while (node != root) { tmp1 = node->arc; tmp2 = node->pred; prefetch tmp1->cost; prefetch tmp2->ptent; node = node->child; }

Virtual Mutithreading (VMT) Model • Experimental system: Itanium 2 • Stall on L3 misses: 200+ cycles • No multithreading support • Switch-on-event + Helper threads • Thread switch on processor stall for L3 miss • Helper thread prefetches for future misses under a pending L3 miss

time 0 t1 t2 t3 t4 t5 t6 VMT Illustrated Fast thread switch Async triggers Thread generation Fast Thread Switch Helper Thread

OS Level SAL (System Abstraction Layer) PAL (Processor Abstraction Layer) Application Level “Fly-weight” thread switching Fast thread switch Async triggers Thread generation • Processor Abstraction Layer (PAL) • Prototype system: ~70 cycles thread switch

Special nop L3 miss Trigger on L3 misses Fast thread switch Async triggers Thread generation • Trigger-response mechanism • Designate literal nops as triggers • Trigger on L3 miss and pipeline stall • Jump to thread switch handler Responses (OS interrupts, etc) PAL Thread Switch Signal Inst (IP, opc, etc) Data (addr, etc) Event (PMU, etc) Itanium 2 Trigger / Response

Nop example Fast thread switch Async triggers Thread generation • Main thread Fast context switch • Helper thread Load r3 = [r2] Add r4 = r1, r3 Sub r5 = r4, r6 Load r8 = [r7] Add r8 = r8, r9 Load r10 = [r9] Nop 0x100000 Fast context switch If Load miss in L3, stall at Use for 200+ cycles Helper thread chases pointer and performs prefetch within 200+ cycles

C++ Front End Fortran Front End Profiler Interprocedural Analysis & Optimizations Global Scalar Optimizations IA-32 Back End Itanium Back End Code Generation Fast thread switch Async triggers Thread generation void other1() { } void foo() { while() { bad load } } L3 miss void helper() { prefetch } AutoHelper void other2() { }

Configuration and Benchmarks • Workstation apps: dot, mcf, vpr • DSS queries access 100GB DB2 database • 16 disk array, 90+% CPU utilization • 4-way Itanium 2 • 16KB 4-way L1 data cache, 256KB 8-way shared L2 cache, 6MB 24-way shared L3 cache • 16GB RAM • Linux EL3.0 and Windows Server 2003

VMT Speedups

The above performance opportunity may be achieved by other HW/SW solutions • HW run-ahead • Aggressive out-of-order implementation • Define a new instruction that can branch on L3 miss (or branch and link). For example, if L3 miss happens, jump to a prefetch routine.

Static Helper Thread on Pentium 4 • Processor • Operating System • Windows XP Professional • Benchmark • MCF, BZIP2, ART (SPEC CPU2000) • MST, EM3D (Olden)

Resource Contention • HW resource management in hyper-threaded processor • HW resource contention • Unhelpful helper threads potentially degrade performance • Cannot launch a helper thread all the time • Dynamic mode transition between ST and MT modes

Dynamic Performance Monitoring • Don’t trigger helper thread unless there are frequent cache misses • Monitor dynamic program behavior • Fine-grain chronology • Low overhead • EmonLite • User-level library routines • Monitor microarchitectural events • e.g. cycles, cache misses

EmonLite Code Instrumentation while (arcin) { /* emonlite_sample() */ if (!(num_iter++ % SAMPLE_PERIOD)) { current_val = readpmc(16); L2miss[num_sample++] = current_val – prev_val; prev_val = current_val; } tail = arcin->tail; if (tail->time + arcin->org_cost > latest) { arcin = (arc_t *)tail->mark; continue; } ... } main( int argc, char *argv[] ) { ... emonlite_begin(); ... } • EmonLite enables analysis of time-varying behavior of the workload at a fine granularity (Vtune gives only summary) • Adjustable profiling interval & overhead • Prototyped in Intel research compiler infrastructure

Helper Threading Scenarios • Two helper threading scenarios • To invoke and synchronize helper threads while (arcin) { tail = arcin->tail; if (tail->time + arcin->org_cost > latest) { arcin = (arc_t *)tail->mark; continue; } ... }

Helper Threading Scenarios • Two helper threading scenarios • To invoke and synchronize helper threads helper_invoke(); while (arcin) { tail = arcin->tail; if (tail->time + arcin->org_cost > latest) { arcin = (arc_t *)tail->mark; continue; } ... } + Low thread synchronization overhead - Lack of synchronization Loop-based trigger

Helper Threading Scenarios • Two helper threading scenarios • To invoke and synchronize helper threads while (arcin) { if(!(num_iter++ % SAMPLE_PERIOD)) helper_invoke(); tail = arcin->tail; if (tail->time + arcin->org_cost > latest) { arcin = (arc_t *)tail->mark; continue; } ... } + Avoid run-away helper thread - Thread synchronization and execution overhead Sample-based trigger

Thread Synchronization Cost • Two mechanisms to invoke & suspend threads 1) Win32 API: SetEvent() & WaitForSingleObject() • Jump to OS scheduler • Non-deterministic transition time: 10K~30K cycles 2) Light-weight hardware mechanism - Lockbox-like instruction that can suspend/activate a thread and switch mode between MT and ST. - Deterministic transition time: ~1,500 cycles

Do Helper Threads Really Help? • Wall-clock speedup on real silicon • More speedup with light-weight HW mechanism • Synchronization is important

Cache Miss Coverage • Large cache miss reduction by helper threads • More headroom for higher miss coverage

Conclusions from P4 experiments • Helper threads can provide wall-clock speedup on real machines, even on hyperthreaded (one form of SMT) processors. • Dynamic program behavior must be taken into account (e.g. using EmonLite monitoring to manage the helper threads at runtime) • Synchronization cost is important – light-weight synchronizations allows for higher speedup.

Motivation for Dynamic Helper Thread • CMP type processors are here to stay, they are likely to share L2/L3 caches. • Single thread performance is often limited by cache misses. • Current hardware and software schemes not always effective Our goal is to improve single thread performance on latest CMP using a fully dynamic helper threaded solution

RTO Idea of Helper Threads on a CMP Second thread is monitoring performance Start Monitoring Thread Main program Second thread now acts as a helper thread for main thread Cache Miss Performance Information Cache hit Helper thread pre-fetching data into shared cache Core 1 Core 2 Main thread now finds data in cache Shared L2 Cache Time

Comparison with previous work

Adore/Sparc Framework Patch traces Code Cache Deployment Init Code $ Optimized Traces Main Thread Runtime Optimization Thread Optimization Pass traces to opt Trace Selection On phase change Phase Detection Int on buffer ovf Kernel Init PMU Int. on Event Hardware Performance Monitoring Unit (PMU)

Flow of Control New Profile No Yes New Phase? No Does code require help Yes Build new helper threads No Yes Sleep Start/Synchronize helper thread Helper thread done Wake up from sleep

Map New Profile No Yes New Phase? No Does code require help Yes Build new helper threads No Yes Sleep Start/Synchronize helper thread Helper thread done Wake up from sleep

Profile Collection • Profile is collected at runtime (dynamic profile) • Sample hardware performance counters • Samples stored in kernel buffer • On kernel buffer overflow samples are copied to user process • Interleave events on hardware counters • More information for profile • Overhead is around 1-2% • accounts for the majority of total system overhead

Map New Profile No Yes New Phase? No Does code require help Yes Build new helper threads No Yes Sleep Start/Synchronize helper thread Helper thread done Wake up from sleep

Phase Detection History of avg PC values Compute average (E) and Standard Deviation (D) of PC values in history buffer M2 M4 M5 M1 M3 Band of tolerance is from E-D to E+D. If Mk is outside band a phase change is triggered

Static and Dynamic Helper Thread Prefetching

Static and Dynamic Helper Thread Prefetching

Presentation Transcript

Static and Dynamic Analysis

Static and Dynamic Behavior

Static or Dynamic Characters

Always-available static and dynamic feedback: Unifying static and dynamic typing

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

Dynamic and Static Characters

Static, Dynamic, and Overload NAT

Hardware-Only Stream Prefetching and Dynamic Access Ordering

Combining Thread Level Speculation, Helper Threads, and Runahead Execution

Identifying Static and Dynamic Routes

Static and Dynamic Behavior

Relating Static and Dynamic Semantics

From Static to Dynamic:

Static and Dynamic Characters

Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Static vs Dynamic Scope

Static Vs Dynamic Website

Dynamic and Static Pass Box

Static and Dynamic Characters

Static versus Dynamic Routes

Dynamic and static induced EMFs

Static and Dynamic Characters