1 / 32

Compiling for IA-64

Compiling for IA-64. Carol Thompson Optimization Architect Hewlett Packard. CISC era: no significant ILP Compiler is merely a tool to enable use of high-level language, at some performance cost RISC era: advent of ILP Compiler-influenced architecture

mahola
Download Presentation

Compiling for IA-64

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiling for IA-64 Carol Thompson Optimization Architect Hewlett Packard

  2. CISC era: no significant ILP Compiler is merely a tool to enable use of high-level language, at some performance cost RISC era: advent of ILP Compiler-influenced architecture Instruction scheduling becomes important EPIC era: ILP as driving force Compiler-specified ILP History of ILP Compilers

  3. Increasing Scope for ILP Compilation • Early RISC Compilers • Basic block scope (delimited by branches & branch targets) • Superscalar RISC and early VLIW Compilers • Trace scope (single entry, single path) • Superblocks & Hyperblocks (single entry, multiple path) • EPIC Compilers • Composite regions: multiple entry, multiple path Basic Blocks Traces Composite Regions Superblock

  4. Unbalanced and UnbiasedControl Flow • Most code is not well balanced • Many very small blocks • Some very large • Then and else clause frequently unbalanced • Number of instructions • Pathlength • Many branches are highly biased • But some are not • Compiler can obtain frequency information from profiling or derive heuristically 60 40 5 55 0 5 55 40 0 60

  5. Basic Blocks • Basic Blocks are simple • No issues with executing unnecessary instructions • No speculation or predication support required • But, very limited ILP • Short blocks offer very little opportunity for parallelism • Long latency code is unable to take advantage of issue bandwidth in an earlier block 60 40 5 55 0 5 55 40 0 60

  6. Traces • Traces allow scheduling of multiple blocks together • Increases available ILP • Long latency operations can be moved up, as long as they are on the same trace • But, unbiased branches are a problem • Long latency code in slightly less frequent paths can’t move up • Issue bandwidth may go unused (not enough concurrent instructions to fill available execution units) 60 40 5 55 0 5 55 40 0 60

  7. Superblocks and Hyperblocks 60 • Superblocks and Hyperblocks allow inclusion of multiple important paths • Long latency code may migrate up from multiple paths • Hyperblocks may be fully predicated • More effective utilization of issue bandwidth • But, requires code duplication • Wholesale predication may lengthen important paths 40 5 55 0 5 55 40 0 5 60

  8. Composite Regions • Allow rejoin from non-Region code • Wholesale code duplication is not required • Support full code motion across region • Allow all interesting paths to be scheduled concurrently • Nested, less important Regions bear the burden of the rejoin • Compensation code, as needed 60 40 5 55 0 5 55 40 0 60

  9. 60 40 5 55 0 5 55 40 0 60 Predication Approaches • Full Predication of entire Region • Penalizes short paths

  10. 60 40 5 55 0 5 55 40 0 60 On-Demand Predication • Predicate (and Speculate) as needed • reduce critical path(s) • fully utilize issue bandwidth • Retain control flow to accommodate unbalanced paths

  11. Predicate Analysis • Instruction scheduler requires knowledge of predicate relationships • For dependence analysis • For code motion • … • Predicate Query System • Graphical representation of predicate relationships • Superset, subset, disjoint, …

  12. Predicate Computation • Compute all predicates possibly needed • Optimize • to share predicates where possible • to utilize parallel compares • to fully utilize dual-targets

  13. Predication and Branch Counts • Predication reduces branches • at both moderate and aggressive opt. levels

  14. Predication & Branch Prediction • Comparable misprediction rate with predication • despite significantly fewer branches • increased mean time between mispredicted branches

  15. Register Allocation x x = ... y = ... = ... x z = ... = … y = … z y • Modeled as a graph-coloring problem. • Nodes in the graph represent live ranges of variables • Edges represent a temporal overlap of the live ranges • Nodes sharing an edge must be assigned different colors (registers) z y z x Requires Two Colors

  16. z = ... = … z = … y x = ... x = ... y = ... = … x Register Allocation With Control Flow x x y z y z Requires Two Colors

  17. x = ... y = ... z = ... = …y x = ... = …z = … x Register Allocation With Predication x x y z z y Now Requires Three Colors

  18. x = ... y = ... z = ... = …y x = ... = …z = … x Predicate Analysis x p0 y z p1 p2 p1 and p2 are disjoint If p1 is TRUE, p2 is false and vice versa

  19. x = ... y = ... z = ... = …y x = ... = …z = … x Register Allocation With Predicate Analysis x x y z z y Now Back to Two Colors

  20. Effect of Predicate-Aware Register Allocation • Reduces register requirements for individual procedures by 0% to 75% • Depends upon how aggressively predication is applied • Average dynamic reduction in register stack allocation for gcc is 4.7%

  21. Solutions • Inlining • for non-virtual functions or provably unique virtual functions • Speculative inlining for most common variant • Dynamic optimization (e..g Java) • Make use of dynamic profile • Speculative execution • Guarantees correct exception behavior • Liveness analysis of handlers • Architectural support for speculation ensures recoverability Object-Oriented Code • Challenges • Small Procedures, many indirect (virtual) • Limits size of regions, scope for ILP • Exception Handling • Bounds Checking (Java) • Inherently serial - must check before executing load or store

  22. Method Calls Possible target Resolve target method • Barrier between execution streams • Often, location of called method must be determined at runtime • Costly “identity check” on object must complete before method may begin • Even if the call nearly always goes to the same place • Little ILP Possible target Call-dependent code Possible target

  23. Speculating Across Method Calls • Compiler predicts target method • Profiling • Current state of class hierarchy • Predicted method is inlined • Full or partial • Speculative execution of called method begins while actual target is determined

  24. Speculation Across Method Calls Resolve target method call method Dominant called method Dominant called method Resolve target method Other target method call other method if needed Other target method Other target method Other target method

  25. Bounds & Null Checks • Checks inhibit code motion • Null checks x = y.foo; if( y == null ) throw NullPointerException; x = y.foo; • Bounds checks x = a[i]; if( a == null ) throw NullPointerException; if( i < 0 || i >= a.length) throw ArrayIndexOutOfBounds Exception; x = a[i];

  26. Speculating Across Bounds Checks • Bounds checks rarely fail x = a[i]; ld.s t = a[i]; if( a == null ) throw NullPointerException; if( i < 0 || i >= a.length) throw ArrayIndexOutOfBoundsException; chk.s t x = t; • Long latency load can begin before checks

  27. Exception Handling • Exception handling inhibits motion of subsequent code if( y.foo ) throw MyException; x = y.bar + z.baz;

  28. Speculation in the Presence of Exception Handling • Execution of subsequent instructions may begin before exception is resolved if( y.foo ) throw MyException; x = y.bar + z.baz; ld t1 = y.foo ld.s t2 = y.bar ld.s t3 = z.baz add x = t2 + t3 if( t1 ) throw MyException; chk.s x

  29. If( n < p->count ) { (*log)++; return p->x[n]; } else { return 0; } Dependence Graph for Instruction Scheduling add t1 = 8,p ld4 count = [t1] cmp4.ge p1,p2=n,count (p1) ld4 t3 = [log] (p1) add t2 = 1,t2 (p1) st4 [log] = t2 mov out0 = 0 (p1) ld4 t3 = [p] shladd t4 = n,4,t3 (p1) ld4 out0 = [t4] br.ret rp

  30. During dependence graph construction, potentially control and data speculative edges and nodes are identified Check nodes are added where possibly needed (note that only data speculation checks are shown here) Dependence Graph with Predication & Speculation add t1 = 8,p ld4 count = [t1] cmp4.ge p1,p2=n,count (p1) ld4 t3 = [log] (p1) add t2 = 1,t2 (p1) st4 [log] = t2 mov out0 = 0 (p1) ld4 t3 = [p] chk.a p shladd t4 = n,4,t3 chk.a t4 (p1) ld4 out0 = [t4] br.ret rp

  31. Speculative edges may be violated. Here the graph is re-drawn to show the enhanced parallelism Note that the speculation of both writes to the out0 register would require insertion of a copy. The scheduler must consider this in its scheduling Nodes with sufficient slack (e.g. writes to out0) will not be speculated Dependence Graph with Predication & Speculation (p1) ld4 t3 = [p] (p1) ld4 t3 = [log] add t1 = 8,p (p2) mov out0 = 0 shladd t4 = n,4,t3 (p1) add t2 = 1,t2 ld4 count = [t1] (p1) ld4 out0 = [t4] cmp4.ge p1,p2=n,count (p1) st4 [log] = t2 chk.a p chk.a t4 br.ret rp

  32. Conclusions • IA-64 compilers push the complexity of the compiler • However, the technology is a logical progression from today’s • Today’s RISC compilers • are more complex • are more reliable • and deliver more performance than those of the early days • Complexity trend is mirrored in both hardware and applications • Need a balance to maximize benefits from each

More Related