1 / 22

Just-In-Time Java Compilation for the Itanium Processor

Just-In-Time Java Compilation for the Itanium Processor. Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs. Introduction. Itanium processor is statically scheduled machine Aggressive compiler techniques to extract ILP Just-In-Time (JIT) compiler must be fast

deanna
Download Presentation

Just-In-Time Java Compilation for the Itanium Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs

  2. Introduction • Itanium processor is statically scheduled machine • Aggressive compiler techniques to extract ILP • Just-In-Time (JIT) compiler must be fast • Must consider time & space efficiency of optimizations • Balance compilation time with code quality • Light-weight compilation techniques • Use heuristics for modeling micro architecture • Leverage semantics and meta data of JVM

  3. Outline • Introduction • Compiler overview • Register allocation • Code scheduling • Other optimizations • Conclusions

  4. Compiler Structure Code Selection Prepass Register Allocation IR construction Predication Code Scheduling Inlining GC Support Global optimizations Code Emission Back-end Front-end

  5. Register Allocation • Compilation time vs. code quality tradeoff • IPF architecture has large register files • 128 integer, 128 floating-point, 64 predicate, 8 branch • Register Stack Engine (RSE) provides 96 stack registers to each procedure • Use linear scan register allocation • “Linear Scan Register Allocation” by Massimiliano Poletto and Vivek Sarkar

  6. B1 ... ... B1 B2 B3 B2 t1=... ... v =t1 t1=... ... v =t1 t2=... ... v =t2 B3 t2=... ... v = t2 B4 ...= v B4 ...= v Live Range vs. Live Interval Live Ranges Live Intervals

  7. Coalesce v and t in v =t iff Live interval of t ends at v = t Live interval of t does not intersect with live range of v Requires one additional reverse pass over IR O(NINST + NVAR * NBB) ... B1 B2 t1=... ... v =t1 B3 t2=... ... v = t2 B4 ...= v Coalescing Algorithm

  8. Coalescing Speedup

  9. Code Scheduling • Forward cycle-based list scheduling • Scheduling unit is extended basic block • Middle exits are due to run-time exceptions (p6,p7) = cmp.eq r35, 0 (p6) br ThrowNullPointerException r10 = r35 + 16 r11 = ld8 [r10]

  10. Type-based memory disambiguation • Use JVM meta data to disambiguate memory locations • Type • Integer, floating-point, object reference … • Kind • Object field, array element, virtual table address … • Field id • putfield #10 vs. putfield #15

  11. Type-Based Disambiguation

  12. Exception Dependencies • Java exceptions are precise • Naive approach • Exception checks end basic blocks • Our approach • Instruction depends on exception check iff • Its destination is live at the exception handler, or • It is an exception check for different exception type • It is a memory reference that may be guarded by check

  13. 1: (p6, p0) = cmp.eq r16, 0 2:(p6) brThrowNullPointerException 3: r17 = add r16, 8 4:r18 = ld [r17]// load field 5: r21 = movl 0x000F14E32019000 6: f8 = fld [r21]// load static Exception Dependency Example

  14. Exception Dependencies

  15. IPF Architecture • Execution (functional) unit type – M, I, F, B • Instruction (syllable type) – M, A, I, F, B, IL • Bundles, templates • .mii .mi;;i .mil .mmi .m;;mi .mfi .mmf .mib .mbb .bbb .mmb .mfb • Instruction group – no WAR, WAW with some exceptions .mi;;i r10 = ld [r15] r9 = add r8, 1 ;; // stop bit r16 = shr r9, r32

  16. Template Selection • Pack instructions into bundles • Choose slot for each instruction • Insert NOP instructions • Assign instructions to functional units Problem: Resource over subscription Inaccurate bypass latencies

  17. Unsorted NOP I1 NOP NOP I2 NOP I3 NOP Sorted NOP I3 I1 NOP I2 Algorithm • Greedy slot assignment • Sort instruction by syllable type • M < F < IL < I < A < B I1: r20 = sxt r14 (I-type) I2: r21 = movl ADDR (IL-type) I3: f15 = fadd f10, f11 (F-type)

  18. Template Selection Heuristics

  19. I-Unit M-Unit M-Unit r17 = add r16, 8 r17 = add r16, 8 r18 = ld [r17] 1 2 Bypass Latency Accuracy • Phase ordering of functional unit assignment • Code selection time is too early: underutilizes resources • Template selection time too late: inaccurate scheduling latencies • Solution: Assign to functional unit during scheduling Assign to M-Unit if available, else Assign to I-Unit and increment latency

  20. Modeling of Address Computation Latency

  21. Other optimizations • Predication • Profitability depends on a benchmark • Performance variations within 2% • Branch hints • Up to 50% speedup from using branch hints • Sign-extension elimination • 1% potential gain for our compiler

  22. Conclusions • Light-weight optimizations techniques for Itanium • Considering micro architecture is important • Cannot ignore bypass latencies • Template selection should be resource sensitive • Language semantics helps to improve ILP • Type-based memory disambiguation • Exception dependency elimination

More Related