Compiler Speculative Optimizations

Compiler Speculative Optimizations Wei Hsu 7/05/2006

Speculative Execution • It means the early execution of code, the result of which may not be needed (work may be wasteful). • In pipelined processor, speculative execution is often used to reduce the cost of branch mis-predictions. • Some processors automatically prefetch the next instruction and/or data cache lines into the on-chip caches. Prefetch has also been used for disk read. • More aggressive speculative execution has been used in “run-ahead” or “execute-ahead” processors to warm up the caches. • Value prediction/speculation is another example

Compiler Controlled Speculation • Speculation is one of the most important methods for finding and exploiting ILP. • Allows the execution to exploit statistical ILP (e.g. a branch is taken 90% of time, or the address of pointer *p is different from the address of pointer *q most of the time) • To overcome two most common constraints for instruction scheduling (and other optimizations) • Control dependence • Memory dependence

Compiler Controlled Speculation (cont.) • Allows compiler to issue operation early before a dependency • Removes latency of operation from the critical path • Helps hiding long latency memory operations • Control Speculation • the execution of an operation before the branch which guards it • Data Speculation • which is the execution of a memory load prior to a preceding store which may alias with it

Control Speculation lw $r6,… sub $r3, $r6… bne … • Example lw $r1, 0($r2) add $r3, $r1,$r4 lw $r5,4($r3) sw $r5,4($sp) … If (cond) { A=p[i]->b; } In this block, there is no room to schedule the load !! Why not moving the load instruction into the previous block?

Control Speculation lw $r6,… sub $r3, $r6… bne … • Example add $r3, $r1,$r4 lw $r5,4($r3) sw $r5,4($sp) … lw $r1, 0($r2) If (cond) { A=p[i]->b; } • Is the cond most likely to be true? • profile feedback may guide the optimization • 2) What if the address of p is bad, and cause memory fault? • can we have a special load instruction that ignores • memory faults?

Control Speculation lw $r6,… lw $r1, 0($r2) sub $r3, $r6… bne … Fault!! • Example Core dump add $r3, $r1,$r4 lw $r5,4($r3) sw $r5,4($sp) … If (cond) { A=p[i]->b; } What if the address of p is bad, and cause a memory fault? can we have a special load instruction that ignores memory faults?

Control Speculation lw $r6,… lw.s $r1, 0($r2) sub $r3, $r6… bne … • Example Make this special inst, so it never faults!! add $r3, $r1,$r4 lw $r5,4($r3) sw $r5,4($sp) … If (cond) { A=p[i]->b; } What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? For example, Sparc supports non-faulting load instructions that can ignore segmentation faults…

Architecture Supports in SparcV9 • SparcV9 provides non-faulting loads (similar to silent loads used in Multiflow’s Trace and Cydrome’s Cydra-5 computers). • Nonfaulting loads execute as any other loads except that, segmentation fault conditions do not cause program termination. • To minimize page faults when a speculative load references a Null pointer (address zero), it is desirable to map low addresses (especially address zero) to a page with special attribute. • Non-faulting loads are often used in data prefetching, but are not for general code motions.

Using non-faulting loads for prefetching While (j < k) { i=Index[j][1]; x=array[i]; y=x+… j+=m; } While (j < k) { load $r1, index[j][1]; load $r2, array($r1) add $r3, $r2,$r4 …. } While (j < k) { load $r1, index[j][1]; load $r5, index[j+m][1]; load $r2, array($r1) prefetch array($r5) add $r3, $r2,$r4 …. } May incur cache misses on each iteration load $r5 may fault !!

Using non-faulting loads for prefetching While (j < k) { i=Index[j][1]; x=array[i]; y=x+… j+=m; } While (j < k) { load $r1, index[j][1]; load $r2, array($r1) add $r3, $r2,$r4 …. } While (j < k) { load $r1, index[j][1]; nf-ld $r5, index[j+m][1]; load $r2, array($r1) prefetch array($r5) add $r3, $r2,$r4 …. } May incur cache misses on each iteration load $r5 may fault !!

Non-faulting LoadsInsufficient lw $r6,… lw.s $r1, 0($r2) sub $r3, $r6… beq … add $r3, $r1,$r4 lw $r5,4($r3) sw $r5,4($sp) beq … If (cond) { A=p[i]->b; } What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? But what if the real load of p cause a memory fault? We cannot just ignore it!!

lw $r6,… lw.s $r1, 0($r2) sub $r3, $r6… bne … check.s $r1 add $r3, $r1,$r4 lw $r5,4($r3) sw $r5,4($sp) … If (cond) { A=p[i]->b; } What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? But what if the real load of p cause a memory fault? We cannot just ignore it!! Let’s remember the fault status, and check when the loaded data is actually used

Recovery Code lw $r6,… lw.s $r1, 0($r2) sub $r3, $r6… bne … check.s $r1, recovery add $r3, $r1,$r4 lw $r5,4($r3) sw $r5,4($sp) … recovery: lw $r1, 0($r2) If (cond) { A=p[i]->b; }

Recovery Code lw $r6,… lw.s $r1, 0($r2) sub $r3, $r6… add $r3, $r1,$r4 bne … check.s $r3, recovery lw $r5,4($r3) sw $r5,4($sp) … recovery: lw $r1, 0($r2) add $r3, $r1,$r4 If (cond) { A=p[i]->b; } All instructions that are data dependent on the speculative load and moved with it must go to the recovery block

Architecture Supports in IA64 • control speculation • original: • (p1) br.cond • ld8 r1 = [ r2 ] • transformed: • ld8.s r1 = [ r2 ] • . . . • (p1) br.cond • . . . • chk.s r1, recovery Control dependence

Data Speculation • Example lw $r3,4($sp) sw $r3, 0($r1) lw $r5,0($r2) addi $r6,$r5,1 sw $r6,8($sp) { *p = a; b= *q + 1; } In this block, there is no room to schedule the load !! How can we move the load instruction ahead of the store? $r2 and $r1 may be different most of the time, but could possibly be the same.

Data Speculation • Example sw $r3, 0($r1) addi $r6,$r5,1 sw $r6,8($sp) lw $r3,4($sp) { *p = a; b= *q + 1; } lw $r5,0($r2)

Data Speculation • Example lw $r3,4($sp) sw $r3, 0($r1) If (r1==r2) copy $r5,$r3 addi $r6,$r5,1 sw $r6,8($sp) lw $r5,0($r2) { *p = a; b= *q + 1; } • What if there are m loads moving above n stores? • m x n comparisons must be generated !! • So some HW/AR supports are needed

Architecture Supports in IA64 • Data Speculation • original: • st4 [ r3 ] = r7 • ld8 r1 = [ r2 ] • transformed: • ld8.a r1 = [ r2 ] • . . . • st4 [ r3 ] = r7 • . . . • chk.a r1, recovery Memory dependence

ALAT (Advance Load Address Table) • Data Speculation • original: • st4 [ r3 ] = r7 • ld8 r1 = [ r2 ] • transformed: • ld8.a r1 = [ r2 ] • . . . • st4 [ r3 ] = r7 • . . . • chk.a r1, recovery r1 0x1ab0 Assume (r2)=0x00001ab0

ALAT (Advance Load Address Table) • Data Speculation • original: • st4 [ r3 ] = r7 • ld8 r1 = [ r2 ] • transformed: • ld8.a r1 = [ r2 ] • . . . • st4 [ r3 ] = r7 • . . . • chk.a r1, recovery r1 0x1ab0 Assume (r3)= 0x0000111a There is no match in the ALAT table. No change to ALAT. chk.a find entry r1 in ALAT It turns into a NOP

ALAT (Advance Load Address Table) • Data Speculation • original: • st4 [ r3 ] = r7 • ld8 r1 = [ r2 ] • transformed: • ld8.a r1 = [ r2 ] • . . . • st4 [ r3 ] = r7 • . . . • chk.a r1, recovery r1 0x1ab0 Assume (r3)= 0x00001ab0 There is a match in the ALAT table. The r1 entry will be removed chk.a find no entry of r1 in ALAT, check failed, branch to recovery routine

More Cases for Data Speculation • Many high performance architectural features are not effectively exploited by compilers due to imprecise analysis. Examples: • Automatic vectorization / parallelization • Local memory allocation / assignment • Register allocation • …

Examples • Vectorization loop (k=1; k<n; k++) a[k] = a[j] * b[k]; end • Register Allocation = a->b; *p = … = a->b; What if a,b are pointers? What if j == k? Can we allocate a->b to a register? Could *p modify a->b? or a ?

Complete alias and dependence analysis are costly and difficult • need Inter-procedural analysis • hard to handle dynamic allocated memory objects • runtime disambiguation is expensive But … true memory dependence rarely happen!!

Static and Dynamic Dependences • Most ambiguous data dependences identified by compiler do not occur at runtime

Speculation can compensate for imprecise alias information if speculation failure can be efficiently detected and recovered • Can we effectively use hardware supports to speculatively promote memory references to registers? • Can we speculatively vectorize or parallelize loops?

Motivation Example … = *p *q = .. … = *p Original program ld r32=[r31] *q = … ld r32=[r31] … = r32 Traditional compiler code

Another Example ld8 r14=[r32] adds r14=8,r14 ld8 r14 = [r14] ld4 r14 = [r14] cmp4 p6,p7=0,r14 (p6) br…. st [r16] = r0 ld8 r14=[r32] adds r14=8,r14 ld8 r15 = [r14] ld4 r14 = [r15] adds r14=1,r14 st4 [r15] = r14 Traditional compiled code if (p->s1->s1->x1) { …. *ip = 0; p->s1->s1->x1++; …. } Original program

Our approach at UM • Use alias profile or compiler heuristics to obtain approximated alias information • Use data speculation to verify such alias information at run time • Use the Advance Load Address Table (ALAT) in IA64 for the necessary support of data speculation

Background of ALAT in IA64 ; ;

Speculative Register Promotion • Use ld.a for the first load • Check the subsequent loads • Scheme 1: use ld.c for subsequent reads to the same reference. • Scheme 2: use chk.a for subsequent reads. This allows promotion of multi-level pointer variables. (e.g. if a->b->c is speculatively promoted to a register, but a is aliased and modified, then the recovery code to reload a, a->b and a->b->c must be executed)

*p= ; *q =…. …=*p+3; st [p]=r1 ld.a r1=[p] *q = …. ld.c r1=[p] add r4=r1, 3 =*p; *q = … =*p; *q = … =*p; ld.a r1=[p] *q = … ld.c.nc r1=[p] *q = … ld.c.clr r1=[p] b. read after write c. multiple redundant loads Examples =*p+1; *q=… =*p+3; ld.a r1=[p] add r3=r1,1 *q = …. ld.c r1=[p] add r4=r1, 3 a. read after read

Compiler Support for Speculative Register Promotion • Enhanced SSA form with the notion of data speculation • SSA form for indirect memory reference •  operator : MayMod •  operator : MayUse • Speculative SSA form • s operator: the variable in s is unlikely to be updated by the corresponding definition statement • s operator: the variable in s is unlikely to be referenced by the indirect reference

s s s s Speculative SSA Form According To Alias Profiling *p = b2 (b1) a2(a1) v2 (v1) (b1) (a1)  (v1) = *p The two examples assume that the points-to set of p generated by compiler is {a, b}, the points-to set of p obtained from alias profiling is {b}. v is the virtual variable for *p. aj stands for version j of variable a.

Overview of Speculative Register Promotion* • Phi insertion • Rename • Down_safety • Will_be_available • Finalize • Code motion * Based on SSAPRE [Kennedy, et.al. ACM TOLPAS ‘99]

… = a1 *p1 = … v2 (v1) a2s (a1) b2 (b1) … = a1 <speculative> (b) Speculative Renaming Enhanced Rename The target set of *p generated by the compiler is {a, b} and v is the virtual variable for *p. The target set of *p generated by the alias profiling is {b}. … = a1 *p1 = … v2 (v1) a2 (a1) b2 (b1) … = a2 a) Traditional Renaming

t1 = a1 (ld.a) … *p1 = … v4 (v3) a2s (a1) b4 (b3) t4 = a1 (ld.c) … (b) Final Output Example of Speculative Code Motion … = a1 *p1 = … v4s (v3) a2s (a1) b4 (b3) … = a1 <speculative> (a) Before Code Motion

Implementation • Open Research Compiler v1.1 • Benchmark • Spec2000 C programs • Platform • HP i2000, 733 MHz Itanium processor, 1GB SDRAM • Redhat Linux v7.1 • Pfmon v1.1

Example from Equake Call site: smvp(,,,, disp[dispt], disp[disptplus]); void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) { . . . for (i = 0; i < nodes; i++) { . . . while (Anext < Alast) { col = Acol[Anext]; sum0 += A[Anext][0][0] *… sum1+= A[Anext][1][1] *… sum2+= A[Anext][2][2] *… w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][2] += A[Anext][2][2]*v[i][2] + … Anext++; }}} A[][][] and v[][] are not promoted to registers due to possible alias with w[][].

Example from Equake Call site: smvp(,,,, disp[dispt], disp[disptplus]); void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) { . . . for (i = 0; i < nodes; i++) { . . . while (Anext < Alast) { col = Acol[Anext]; sum0 += A[Anext][0][0] *… sum1+= A[Anext][1][1] *… sum2+= A[Anext][2][2] *… w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][2] += A[Anext][2][2]*v[i][2] + … Anext++; }}} Promoting A[][][] and v[][] to registers using ALAT improves this Procedure by 10%

Example from Equake Call site: smvp(,,,, disp[dispt], disp[disptplus]); void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) { . . . for (i = 0; i < nodes; i++) { . . . while (Anext < Alast) { col = Acol[Anext]; sum0 += A[Anext][0][0] *… sum1+= A[Anext][1][1] *… sum2+= A[Anext][2][2] *… w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][2] += A[Anext][2][2]*v[i][2] + … Anext++; }}} Using heuristic rules, our compiler can promote both ***A and **v to registers. But using alias profile, our compiler fails to promote **v, because at the call site v and w are passed with the same array name.

Performance Improvement of Speculative Register Promotion

Effectiveness of Speculative Register Promotion

Performance Improvement of Speculative Register Promotion based on Heuristic Rules

Performance Improvement of Speculative Register Promotion on Itanium-2

Advantages of Using Heuristic Rules • Full coverage. • Input-insensitive. • Efficient. • Scalable.

A case for using Profiles DO 140 L = L3,L4, 2 Q(IJ(L)) = Q(IJ(L))+W1(L)*QMLT(L) Q(IJ(L)+1) = Q(IJ(L)+1)+W2(L)*QMLT(L) …. Q(IJ(L+1))=Q(IJ(L+1))+W1(L+1)*QMLT(L+1) Q(IJ(L+1)+1)=Q(IJ(L+1)+1)+W2(L+1)*QMLT(L+1) …… 140 CONTINUE Heuristic rules think Q(IJ(L)) is different from Q(IJ(L+1)), but they are actually identical since IJ() is often sorted. e.g. 1,1,2,2,2,5,5,6,6,6,6,9,9,9

Compiler Speculative Optimizations

Compiler Speculative Optimizations

Presentation Transcript

A Probabilistic Pointer Analysis for Speculative Optimizations

Compiler Optimizations for Modern VLIW/EPIC Architectures

Generating Compiler Optimizations from Proofs

Automatically Proving the Correctness of Compiler Optimizations

Weakest Precondition Synthesis for Compiler Optimizations

Optimizing Compiler . Scalar optimizations .

Reducing Misses using Compiler Optimizations

Languages and Compilers (SProg og Oversættere) Compiler Optimizations

Compiler Optimizations in the Berkeley UPC Translator

Optimizing compiler . Interpocedural optimizations .

Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization

Performance Analysis and Compiler Optimizations

Compiler-Directed instruction cache leakage optimizations

Compiler Optimizations ECE 454 Computer Systems Programming

Compiler Optimizations

CSC D70: Compiler Optimization Memory Optimizations

Optimizing Compiler . Scalar optimizations .

Compiler Optimizations ECE 454 Computer Systems Programming