1 / 26

Code Size Efficiency in Global Scheduling for ILP Processors

Code Size Efficiency in Global Scheduling for ILP Processors. Huiyang Zhou, Tom Conte. TINKER Research Group Department of Electrical & Computer Engineering North Carolina State University. Outline. Introduction Quantitative measure of code size efficiency

blenda
Download Presentation

Code Size Efficiency in Global Scheduling for ILP Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Code Size Efficiency in Global Scheduling for ILP Processors Huiyang Zhou, Tom Conte TINKER Research Group Department of Electrical & Computer Engineering North Carolina State University

  2. Outline • Introduction • Quantitative measure of code size efficiency • Best code size efficiency for a given code size limit • Optimal code size efficiency for a program • Summary • Future work

  3. Introduction • Instruction level parallelism (ILP) vs. static code size • Region enlarging optimizations usually enhance ILP • Cyclic scheduling: loop unrolling, loop peeling, etc. • Acyclic scheduling: tail duplication, recovery code, etc. • I-cache and ITLB performance vs. static code size • Larger code usually means larger I-Cache footprint • Trade off of the conflicting effects of code size increase • Especially in acyclic global scheduling

  4. BB1 BB3 BB2 BB4 BB5 BB6 Tree1 Tree2 Background of Treegion Scheduling • Treegion scheduling • An acyclic scheduling technique • Two phases • Treegion formation • Treegion-based instruction scheduling: Tree Traversal Scheduling (TTS) (HPCA-4, LCPC’01) • Treegion • Basic scheduling unit • A single-entry / multiple-exit nonlinear region with CFG forming a tree (i.e., no merge points and back-edges in a treegion)

  5. BB1 BB1 BB3 BB2 BB3 BB2 BB4’ BB4 BB4 BB5’ BB6’ BB5 BB6 BB5 BB6 Tree 1’ Tree1 Tree2 Background of Treegion Scheduling • Treegion examples Natural treegion: treegions formed without tail duplication (i.e., no code size increase during natural treegion formation)

  6. Code Size Effects in Treegion Scheduling • Tail duplication increases code size • General operation combining reduces code size … R1=R3+R4 … BB1 BB3 BB3 BB2 BB2 BB4’ BB4’ … ________ … … R1=R3+R4 … BB5’ … _________ R9=R1*4 … BB5’ … R7=R3+R4 R9=R7*4 … BB5 BB6 BB5 BB6

  7. Quantitative Measure of Code Size Efficiency • ILP vs. static code size Havanki’s heuristic: A treegion formation heuristic proposed before [HPCA-4].

  8. Code Size Efficiency for Any Code Size Related Optimizations • Use the ratio of IPC changes over code size changes as an indication of code size efficiency. • Average code size efficiency • Instantaneous code size efficiency

  9. A4 A3 A2 A1 A0 Average and Instantaneous Code Size Efficiency Static IPC Code Size

  10. Estimate Static IPC Before Scheduling • Use the expected execution time to calculate the static IPC For a multi-path region: • Now, IPC changes can be calculated as execution time saved by the optimization. Tree1’ tree1 Example: tree2

  11. Optimal Code Size Efficiency For A Given Code Size Limit Static IPC Fixed code size, try to maximize the static IPC, i.e., maximize the average code size efficiency Natural Treegion Code Size Size Limit

  12. IPC Relative Code Size limit Optimal Tail Duplication Under Code Size Constraint • Calculate the instantaneous code size efficiency for all possible tail duplication candidates in the program scope. • Find the one with best code size efficiency. • If the selected candidate satisfies the code size constraint, perform the tail duplication and update the code size efficiencies of the candidates that are affected by the tail duplication process. • Repeat steps 2-3 until the code size limit is reached.

  13. Specification Execution Dispatch/Issue/Retire bandwidth: 8; Universal function units: 8; Operation latency: ALU, ST, BR: 1 cycle; LD, floating-point (FP) add/subtract: 2 cycles. I-cache Compressed (zero-nop) and two banks with 2-way 16KB each bank. Line size: 16 operations with 4 bytes each operation. Miss latency: 12 cycles D-cache Size/Associativity/Replacement: 64KB/4-way/LRU; Line size: 32 bytes Miss Penalty: 14 cycles Branch Predictor G-share style Multiway branch prediction [20] Branch prediction table: 214 entries; Branch target buffer: 214 entries/8-way/LRU. Branch misprediction penalty: 10 cycles Processor Model

  14. Results: ILP vs. Code Size 30% 80% 5% 2% 0%

  15. Results: ILP vs. Code Size (cont.) 5% 80% 2% 30% 0% Reason: only a very small part of the program is frequently executed.

  16. A’ A l Optimal Code Size Efficiency • Definition: the point where the ‘diminishing returns’ start • Finding the optimal code size efficiency IPC Relative code size

  17. Finding the Optimal Code Size Efficiency • K is the slope of line l A or A’ K K1 K2 0 Relative code size Threshold on the first derivative of IPC vs. code size curve, which is simply the threshold on instantaneous code size efficiency !

  18. Finding the Optimal Code Size Efficiency (cont.) • Meaning of K1 and K2 • K1 and K2 are the slope of the lines l1 and l2. • The range (K1 – K2) determines the robustness of the threshold scheme. • Point B  Threshold as K1 • Point C  Threshold as K2 C IPC l2 B A l1 Relative code size

  19. Algorithm for Finding the Optimal Code Size Efficiency • Set the threshold k anywhere between tan(/6) to tan(/12) • Calculate the instantaneous code size efficiency for all possible tail duplication candidates in the program scope. • If there is a candidate whose instantaneous code size efficiency is above the threshold, duplicate the candidate and update the efficiency of affected candidates, repeat until there are no more candidates. When the expected execution time is used, the threshold scheme becomes (derivation details in ref [21])

  20. Results for Optimal Code Size Efficiency • Vary threshold from tan(/12) to tan(/6), the threshold scheme finds the optimal efficiency accurately. • Use m88ksim as an example 20% 10% 5% 2% 0%

  21. I-Cache Impacts of the Code Size Increase Code size impacts and locality impacts (ref [3])

  22. I-Cache Impacts of the Code Size Increase (cont.) Denser schedule of optimal efficiency results

  23. I-Cache Impacts of the Code Size Increase (cont.) The combined impact

  24. Processor Performance In average, significant speedup (17% over natural treegion) in dynamic IPC at the cost of 2% code size increase.

  25. Conclusions • Quantitative measure of the code size efficiency: the ratio of IPC changes over code size increase • Best code size efficiency for a given code size limit • Results • Significant but varying impact on IPC • Optimal efficiency: simple yet robust threshold scheme to find ‘knee’ of the curve • Results • Improved I-cache performance (4%) • Significant speedup (17%) • Moderate static code size increase (2%) • Future Work • Combine with other optimization, e.g., loop unrolling.

  26. Contact Information Huiyang Zhouhzhou@eos.ncsu.edu Tom Conteconte@eos.ncsu.edu TINKER Research Group North Carolina State University www.tinker.ncsu.edu

More Related