1 / 40

Idempotent Code Generation: Implementation, Analysis, and Evaluation

Idempotent Code Generation: Implementation, Analysis, and Evaluation. Marc de Kruijf ( ) Karthikeyan Sankaralingam. CGO 2013, Shenzhen. Example. source code. int sum( int *array, int len ) { int x = 0; for ( int i = 0; i < len ; ++ i ) x += array[ i ];

tavon
Download Presentation

Idempotent Code Generation: Implementation, Analysis, and Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf( ) KarthikeyanSankaralingam CGO 2013, Shenzhen

  2. Example source code int sum(int *array, intlen) { int x = 0; for (inti = 0; i < len; ++i) x += array[i]; return x; } 1

  3. Example x assembly code load ? R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 exceptions mis-speculations F FFF 0 faults 2

  4. Example assembly code R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 Bad Stuff Happens! 3

  5. Example assembly code R0 and R1 are unmodified R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 just re-execute! convention: use checkpoints/buffers 4

  6. It’s Idempotent! idempoh… what…? = int sum(int *data, intlen) { int x = 0; for (inti = 0; i < len; ++i) x += data[i]; return x; } 5

  7. Idempotent Region Construction previously… in PLDI ’12 idempotent regions All The Time before: after: 6

  8. Idempotent Code Generation now… in CGO ’13 int sum(int *array, intlen) { int x = 0; for (inti = 0; i < len; ++i) x += array[i]; return x; } how do we get from here... 7

  9. Idempotent Code Generation now… in CGO ’13 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 to here... 8

  10. Idempotent Code Generation now… in CGO ’13 R2 = load [R1] R1= 0 LOOP: R4 = load [R0 + R2] R1 = add R1, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 not here (this is not idempotent) ... 9

  11. Idempotent Code Generation now… in CGO ’13 R3 = R1 R2 = load [R3] R3= 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 and not here (this is slow) ... 10

  12. Idempotent Code Generation now… in CGO ’13 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 here... 11

  13. Idempotent Code Generation x applications to prior work Hampton & Asanović, ICS ’06 De Kruijf & Sankaralingam, MICRO ’11 Menonet al., ISCA ’12 load ? exceptions Kim et al., TOPLAS ’06 Zhanget al., ASPLOS ‘13 F mis-speculations FFF 0 De Kruijfet al., ISCA ’10 Fenget al., MICRO ’11 De Kruijfet al., PLDI ’12 faults 12

  14. Idempotent Code Generation executive summary how do we generate efficient idempotent code? not covered in this talk algorithms made available in source code form: http://research.cs.wisc.edu/vertical/iCompiler (2)how do external factors affect overhead? idempotent region size instruction set (ISA) characteristics control flow side-effects each can affect overheads by 10% or more 13

  15. Presentation Overview ❶ Introduction idempotent region size ISA characteristics control flow side-effects ❷ Analysis ❸ Evaluation 14

  16. Analysis (a) idempotent region size overhead region size - number of inputs increasing - likelihood of spills growing - maximum spill cost reached - amortized over more instructions 15

  17. Analysis (b) ISA characteristics (1) two-address (e.g. x86) vs. three-address (e.g. ARM) ADD R1, R2 -> R1 Idempotent? NO! ADD R1, R2 = R3 idempotent? YES! (2) register-memory (e.g. x86) vs. register-register (e.g. ARM) for register-memory, register spills may be less costly (microarchitecture dependent) (3) number of available registers impact is obvious, but… more registers is not always enough (see back-up slide) 16

  18. Analysis (c) control flow side-effects x’s live interval x = ... region boundaries ... = f(x) y = ... x’s“shadow interval” given no side-effects 17

  19. Analysis (c) control flow side-effects x’s live interval x = ... region boundaries ... = f(x) y = ... x’s“shadow interval” given side-effects 18

  20. Presentation Overview ❶ Introduction ❷ Analysis idempotent region size ISA characteristics control flow side-effects ❸ Evaluation 19

  21. Evaluation methodology benchmarks – SPEC 2006, PARSEC, and Parboil suites measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 – region size: instructions between boundaries (path length) 20

  22. Evaluation (a) idempotent region size 10+ instructions overhead 13.1% (geometric mean) ? region size You are HERE (baseline: typically 10-30 instructions) 21

  23. Evaluation (a) idempotent region size detection latency overhead 13.1% ? ? region size 22

  24. Evaluation (a) idempotent region size detection latency re-execution time overhead 11.1% 13.1% 0.06% ? region size 23

  25. Evaluation (b) ISA characteristics x86-64 ARMv7 percentage overhead Three-address support matters more for FP benchmarks Register-memory matters more for integer benchmarks 24

  26. Evaluation (c) control flow side-effects no side-effects side-effects percentage overhead substantial only in two cases; insubstantial otherwise intuition:typically compiler already spills for control flow divergence 25

  27. Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation 26

  28. Conclusions (a) region size – matters a lot; large regions are ideal if recovery is infrequent overheads approach zero as regions grow (b) instruction set – matters when region sizes must be small overheads drop below 10% only with careful co-design (c) control flow side-effects – generally does not matter supporting control flow side-effects is not expensive 27

  29. Conclusions code generation and static analysis algorithms http://research.cs.wisc.edu/vertical/iCompiler applicability not limited to architecture design see Zhang et al., ASPLOS ‘13: “ConAir: Featherweight Concurrency Bug Recovery [...]” thank you! 28

  30. Back-up Slides 29

  31. ISA Characteristics more registers isn’t always enough C code R0 = 0 x = 0; if (y > 0) x = 1; z = x + y; if (R1 > 0) R0 = 1 R2 = R0 + R1 30

  32. ISA Characteristics more registers isn’t always enough C code R0 = 0 x = 0; if (y > 0) x = 1; z = x + y; if (R1 > 0) R3 = R0 R3 = 1 need an extra instruction no matter what R2 = R3 + R1 31

  33. ISA Characteristics idempotence vs. fewer registers no idempotence, #GPR reduced from 16 percentage overhead data from SPEC INT only (SPEC INT uses General Purpose Registers (GPRs) only) 32

  34. Very Large Regions how do we get there? Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; hurts loops Problem #2: loop optimizations – boundaries in loops are bad for everyone (next slides) – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help Problem #3: large array structures – awareness of array access patterns can help (next slides) Problem #4: intra-procedural scope – limited scope aggravates all effects listed above 33

  35. Very Large Regions Re: Problem #2 (cut in loops are bad) C code CFG + SSA for (i = 0; i < X; i++) { ... } i0= φ(0, i1) i1= i0+ 1 if (i1 < X) 34

  36. Very Large Regions Re: Problem #2 (cut in loops are bad) C code machine code for (i = 0; i < X; i++) { ... } R0 = 0 No Boundaries = No Problem R0 = R0 + 1 if (R0 < X) 35

  37. Very Large Regions Re: Problem #2 (cut in loops are bad) C code machine code for (i = 0; i < X; i++) { ... } R0 = 0 R0 = R0 + 1 if (R0 < X) 36

  38. Very Large Regions Re: Problem #2 (cut in loops are bad) C code machine code for (i = 0; i < X; i++) { ... } R1 = 0 R0 = R1 R1 = R0 + 1 if (R1 < X) – “redundant” copy – extra boundary (pressure) 37

  39. Very Large Regions Re: Problem #3 (array access patterns) PLDI ‘12 algorithm makes this simplifying assumption: [x] = a; b = [x]; [x] = c; [x] = a; b = a; [x] = c; non-clobber antidependences… GONE! cheap for scalars, expensive for arrays 38

  40. Very Large Regions Re: Problem #3 (array access patterns) not really practical for large arrays but if we don’t do it, non-clobber antidependences remain // initialize: int[100] array; memset(&array, 100*4, 0); // accumulate: for (...) array[i] += foo(i); solution: handle potential non-clobbers in a post-pass (same way we deal with loop clobbers in static analysis) 39

More Related