1 / 19

Rational Apex 4.0 Optimization

Rational Apex 4.0 Optimization. “Beware the benchmark!”. Presentation Outline. Outline Rational Apex optimization behaviour Demonstrate some of the optimization techniques being used by modern compilers

yon
Download Presentation

Rational Apex 4.0 Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rational Apex 4.0 Optimization “Beware the benchmark!”

  2. Presentation Outline • Outline Rational Apex optimization behaviour • Demonstrate some of the optimization techniques being used by modern compilers • Show how these techniques defeat many of the assumptions made by traditional benchmarking suites

  3. Rational Apex Optimization • Optimization with Apex has 3 levels, controlled by the OPTIMIZATION_LEVEL switch • Level 0 – No optimization, maximize debuggability • This is the default • Level 1 – Many optimizations performed, some debuggability maintained • Level 2 – All optimizations performed, debugging may be very limited in some code • Optimization with Apex can have one of two objectives • Time – try to generate code with that will execute in minimal time • Space – try to generate code that is as compact as possible • These two objectives are not mutually exclusive!

  4. Rational Apex Optimization • Apex performs optimization in several different places • Front End – post semantics • Common sub-expression elimination • Code in-lining • Loop unrolling • Remove unused code from local scope • Machine independent instruction stream optimizer “optim” • Loop invariant hoisting • Range propogation • Constraint check elimination • Reduce memory movement • Machine specific code generator • Peep-hole optimization • All optimization consumes extra CPU during compilation • The default is off – OPTIMIZATION_LEVEL: 0

  5. Example Code – Summation of SQRT • Simple routine that sums up square roots and prints the result

  6. Example Code – Body of G_E_F.Sqrt

  7. Example Code – Body of G_E_F.Hardware

  8. Optimization Level 0 • No inlining, no code elimination, no check elimination • Disassembly of sum_sqrt.2.ada is 15845 lines long • No unused code has been eliminated – all the code for generic_elementary_functions remains

  9. Optimization Level 0 – Disassembly of “for” loop

  10. Optimization Level 0 – Disassembly of sqrt • 163 lines ofassembly • Slightlyabridged

  11. Optimization Level 0 – Disassembly of hardware • 56 Lines of disassembly for SQRT • 10 Instructions for SQRT_32

  12. Optimization Level 0 – Summary • Total of over 220 instructions generated for the code that we are interested in • Lots of it will be unused • Not to mention the rest of the code for the instantiation • Code maps back to source easily • Code layout follows source • Lots of overhead for this straightforward code • Subprogram prolog/epilog code • Stack checks • Register management • Subprogram call/return code (3 levels deep) • No delayed branch slots being filled

  13. Optimization Level 2 – Disassembly of “for” loop

  14. Optimization Level 2 – Observations • Disassembly of sum_sqrt.2.ada is 85 lines long • Entire loop and all the called subprogram code is now 12 instructions long • 5 instructions for “for” loop management • Includes 2 instructions for branching • 4 instructions for integer to float conversion • 2 are identical, as one copy is used to fill a delayed branch slot at the bottom of the loop • 1 instruction for the Text_Io code is used to fill a branch delay slot • 2 Instructions to perform the actual Sqrt and summation.

  15. Optimization Level 2 – Observations • The optimization objective was Time • Time is certainly optimized, but Space also benefited enormously • Different optimization techniques combined effectively to produce very effective code • Inlining of 3 levels of subprogram call eliminated a significant amount of subprogram prolog/epilog • Range propagation determined that the argument to SQRT could never be less than zero, which allowed the argument check to be removed • Evaluation of compile static expressions resulted in a lot of code not being generated • Kind of floating point type – no case statement needed • Availability of Hardware SQRT – no call needed to Has_Sqrt • Register lifetime analysis on the resulting code meant that the loop control variable and the summation variable could live in registers

  16. Performing Benchmarks • Benchmarks usually consist of two distinct loops • A “Null Timing” loop to determine the overhead of the loop code itself • The Code Under Test loop which has the same structure as the Null timing loop with the inside of the loop replaced with the C.U.T • Timing equation looks like • TCUT = (TCUT_loop – Tnull_loop) / n • Where n is the number of iterations • Usually n has to be very high so that the resolution of the system clock is not significant in the result

  17. Performing Benchmarks • One effect we notice is that sometimes a benchmark suite reports slower times for code even though we know we have improved our optimizations! • What’s happening? • The Null Timing loops of benchmark suites attempt to defeat compiler optimizations that skew their results • Compilers are better at getting rid of unnecessary code, often defeating the smart null loop • So now the equation looks like: TCUT = (TCUT_loop – 0 ) / n • So the remaining loop overhead time gets included in the time of the Code Under Test making it look worse than before

  18. Performing Benchmarks • One other effect we observe is that benchmarks often don’t do anything with the results they calculate • Compilers can detect this and conclude that running the code has no effect and (very importantly) no side-effects • Range propagation concludes that overflow cannot be raised • Result is never used • Code is thrown away • A good example is the Henessey Benchmark in the PIWG suite • Large matrix multiplications, using a range of values that will not result in overflow • Apex 4.0 reports zero time for that test

  19. Performing Benchmarks • When trying to compare different compiler technologies you need to look beyond the results printed by a benchmark program • Printed numbers can be very misleading • Look at absolute times and iteration counts • Benchmarks don’t translate well b/n processor variants and processor types • The best benchmark is your application • Or a sizable portion of it

More Related