1 / 34

Profile-driven Inlining for Erlang

Profile-driven Inlining for Erlang. Thomas Lindgren thomasl_erlang@yahoo.com. Inlining. Replace function call f(X1,…,Xn) with body of f/n Optimization enabler Simplify code Specialize code Remove ”optimization fence” Standard tool in modern compiler toolbox. Inlining.

Download Presentation

Profile-driven Inlining for Erlang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com

  2. Inlining • Replace function call f(X1,…,Xn) with body of f/n • Optimization enabler • Simplify code • Specialize code • Remove ”optimization fence” • Standard tool in modern compiler toolbox

  3. Inlining • Main problem: which calls to inline? • Code growth reduces performance • Estimate code size growth • Select the best estimated sites subject to cost • Some static estimations: • f/n is small? (= inline cost is small) • Inlining the call to f/n enables optimization • Are we optimizing the important code? • Or just the convenient code?

  4. Inlining • Dynamic estimation • Profile the program • Select the best hot call sites for inlining • Optimize the important code

  5. Our approach • Inlining driven by profiling • Permit cross-module inlining • Computations often span several modules • Code growth measured for whole program • Cross-module optimization enabled by (i) module aggregation and (ii) guarded conversion of remote to local calls • (will not describe this further here) • [Lindgren 98]

  6. The rest of this talk • Overview of method • Performance measurements

  7. Inline forest • Inlinings to be done represented by forest • Nodes are inlined call sites • Leaves are call sites to be checked • (Example shows nested inlining) f g f g h Some sites are not inlined h

  8. Priority-based inlining • All call sites (leaves in inline forest) are placed in priority queue • Priority = estimated number of calls • When a call site f is inlined, the call sites in f are added to the queue • Priority scaled appropriately

  9. Inlining algorithm • Preprocess code • call_site and size maps • Initialize priority queue • Initialize inline forest • While prio queue not empty • Take call site (k, f) • Try to inline it

  10. Preprocessing • for each function visited k times • for each call site visited k’ times • set ratio(call_site) = (k’/k) • Adjust ratio so that < 1.0 • Self-recursive call sites := 0.0 • (improves code quality) • maps (function -> [{call_site, ratio}])

  11. Original code marked with number of visits

  12. Special attention to function calls

  13. dec_bearer_capability/2 runs 200,000 times • dec_bearer_capability_6 visited 200,000 times • ratio is (200/200) = 1.0 • adjust ratio to 0.99

  14. Inlining a call site • Bookkeeping phase (code gen later) • Call to f(X1,…,Xn), visited k times • k < minimum frequency? stop • tot_size + size(f) > max_size? skip • Otherwise, • tot_size += size(f) • for each call site g of f • add (k * ratio, g) to priority queue • extend node f by call sites g1,…,gn • Iterate until no call sites remain

  15. Example • Inlining applied to decode1 • Protocol decoding • Single module

  16. decode1 decode_ie_coding_1/3 [800k] decode_action/1 [800k] dec_bearer_capability/2 [200k] dec_bearer_capability_6/2 [198k] decode_ie_heads_setup/5 [198k] … Inline forest Prio queue adjust to 0.99 dec_bearer_capability/2 -> [(dec_bearer_capability_6, 1.00)] decode_ie_heads_setup/5 -> [(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2), (decode_ie_heads_setup/5, 0.2), (decode_ie_heads_setup/5, 0.6)] … Call_site mapping (selected parts) self-recursive so set to 0.0

  17. decode1 decode1 Try to inline decode_ie_coding_1/3 [800k] decode_action/1 [800k] dec_bearer_capability/2 [200k] dec_bearer_capability_6/2 [198k] decode_ie_heads_setup/5 [198k] … Inline forest Prio queue dec_bearer_capability/2 -> [(dec_bearer_capability_6, 0.99)] decode_ie_heads_setup/5 -> [(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2), (decode_ie_heads_setup/5, 0.0), (decode_ie_heads_setup/5, 0.0)] … Call_site mapping

  18. decode1 decode1 decode1 - decode_action/1 [800k] dec_bearer_capability/2 [200k] dec_bearer_capability_6/2 [198k] decode_ie_heads_setup/5 [198k] … Inline forest Prio queue

  19. decode1 decode1 decode1 decode1 - - dec_bearer_capability/2 [200k] dec_bearer_capability_6/2 [198k] decode_ie_heads_setup/5 [198k] … Inline forest Prio queue

  20. decode1 decode1 decode1 decode1 decode1 Inline forest Prio queue • Final result: • inline dec_bearer_cap_6/2 into dec_bearer_cap/2 yielding (*) • Inline dec_ie_coding/1, decode_action/1 and (*) into decode_ie_heads_setup/5 • During inlining, one inline was rejected for too much code growth (not shown) Now time for code generation

  21. Code generation • Walk each inline tree from leaf to root • Replace inlined calls f(E1,…,En) with • (fun(X1,…,Xn) -> E end)(E1,…,En) • General case: nested inlines • Simplify the resulting function • Apply fun to arguments (above) • Case-of-case • Case-of-if • …

  22. Measurements • Used five applications • decode1 (small protocol decoder) • ldapv2 (ASN.1 encode/decode) • gen_tcp (send/rcv over socket) • beam (compiler) • mnesia (simulate HLR)

  23. Benchmarks

  24. Benchmarks Benchmarks

  25. Benchmarks Benchmarks

  26. Performance • Very preliminary • Code generation problems for beam and mnesia => unable to measure • (Probably due to name capture bug) • Did not use outlining, higher-order specialization, apply open-coding [EUC’01] • Tried only emulated code • Native code compilation failed

  27. Speedup vs baseline Native compilation of inlined decode1 provided a net slowdown

  28. Future work • Integrate with other optimizations • Plenty of opportunities for further source-level simplifications • Suggests new approach to module aggregation • (do it after inlining instead of before) • Tuning, measurements • Bugfixing …

  29. Conclusion • Profile-guided inlining speeds up real code • Whole-program, cross-module inlining probably necessary

  30. Backup slides

  31. Case-of-if

  32. Module merging • We want to optimize over several modules at a time • What to do about hot code loading? • Merge modules to aggregates • Convert suitable remote calls into local calls • Guard such calls to preserve code loading semantics • Annotate code regions with ”origin module” to enable precise process purging • Or … extend Erlang appropriately

More Related