1 / 23

Cases 2007 Florida State University Chris Zimmer , Steve Hines, Prasad Kulkarni

     Facilitating Compiler Optimizations Through the           Dynamic Mapping of Alternate Register Structures. Cases 2007 Florida State University Chris Zimmer , Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley. Motivation. Embedded Processors have fewer registers.

Download Presentation

Cases 2007 Florida State University Chris Zimmer , Steve Hines, Prasad Kulkarni

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1.      Facilitating Compiler Optimizations Through the          Dynamic Mapping of Alternate Register Structures Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley

  2. Motivation • Embedded Processors have fewer registers. • Compiler Optimizations increase register pressure • Difficult to apply aggressive compiler optimizations on embedded systems

  3. Vector Multiply Example • Even before aggressive optimizations, 60% of available registers are already used • Further optimizations like Loop Unrolling and Software Pipelining are inhibited .L3: ldr r1,[r2,r3, lsl #2] ldr r12,[r4], #4 mul r0,r12,r1 str r0,[r5,r3, lsl #2] add r3,r3,#1 cmp r3, #1000 blt .L3 int A[1000], B[1000]; void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * B[I-2]; }

  4. Application Configurable Processors • Exploit common reference patterns found in code • Small register files mimic these reference behaviors. • Map Table provides register redirection. • Changed architecture to add more registers, but have minimal impact on ISA support, particularly not increasing operand size

  5. Architectural Modifications Register File Map Table R0 R0 R1 Q1 R6 R6 Queue Q1 R15 R15 Queue Q2 Queue Q3 Stack Q4 Circular Buffer Q5

  6. Software Pipelining • Software pipelining is not often found in embedded compilers. • Software pipelining reduces the overall cycle time of a loop. • Extracts iterations • Consumes Stalls • Consumes registers!!

  7. Software Pipelining Example Stalls Present when Loop Run .L3: ldr r1,[r2,r3, lsl #2] ldr r12,[r4], #4 stall stall stall mul r0,r12,r1 stall stall stall str r0,[r5,r3, lsl #2] add r3,r3,#1 cmp r3, #1000 bgt .L3 int A[1000], B[1000]; void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I]; } .L3: ldr r1,[r2,r3, lsl #2] ldr r12,[r4], #4 mul r0,r12,r1 str r0,[r5,r3, lsl #2] add r3,r3,#1 cmp r3, #1000 blt .L3

  8. Instruction • Goal: Minimal modification to existing instruction set. • Single cycle instruction latency • Method: Add a single instruction to the ISA that is used to map and unmap a common register specifier into a customized register structure. qmap <Reg Specifier><Custom reg map information><Custom reg specifier> qmap r3,#4,q3 8

  9. Architectural Modifications Register File Map Table R0 R0 R1 Q1 R6 R6 Queue Q1 R15 R15 Queue Q2 An access to R0, which has no mapping in the table would get the data from the register file. R1 is mapped into Q1 and would retrieve its data from there. Queue Q3 Destructive Queue Q4 Circular Buffer Q5

  10. Software Pipelining Example qmap r1,#2,q1 qmap r12,#2,q2 qmap r0,#3,q3 Prolog: 6 loads and 2 mults Loop: ldr r1,[r2,r3, lsl #2] ldr r12,[r4], #4 mul r0,r12,r1 str r0,[r5,r3, lsl #2] add r3,r3,#1 cmp r3, #1000 blt .L3 Epilog 1 multiply and 3 stores int A[1000], B[1000]; void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I]; } Q1 15 30 5 25 Q2 4 1 3 2 Q3 30 5 75 10

  11. Register Usage 11

  12. Results – Multiplies varying latency, load latency set at four

  13. Results – Loads varying latency, multiply latency set at four

  14. Conclusions • Customized register structures reduce register pressure. • Software pipelining is viable in resource constrained environments • Performance can be improved with minor impact to the ISA.

  15. Extra’s

  16. Reference Behaviors ldr r1,[r6,r4, lsl #4] ldr r12,[r6,r4, lsl #8] ldr r8,[r6,r4, lsl #12] str r8,[r3,r4, lsl #16] str r12,[r3,r4, lsl #20] str r1,[r3,r4, lsl #24] Stack Reference Behavior

  17. Application Configurable Architecture • Application configurable processors are designed using a mapping table similar to a register rename table found in many out of order implementations. • The map table is read during every access to the architected register file. • This serves as a method of determining if a register specifier is used in the original architected register file or a customized register structure.

  18. Application Configurable Architecture • The customized register files are small in size but they efficiently manage the values that would require many architected registers. • The customized register files can mimic queues, stacks, and circular buffers. • These structures are accessed using the same register specifier that is used to access the architected register file.

  19. RemoveReference Behaviors r1 R8 R12 R1 ldr r1,[r6,r4, lsl #4] ldr r12,[r6,r4, lsl #8] ldr r8,[r6,r4, lsl #12] str r8,[r3,r4, lsl #16] str r12,[r3,r4, lsl #20] str r1,[r3,r4, lsl #24] Stack Reference Behavior ldr r1,[r6,r4, lsl #4] ldr r1,[r6,r4, lsl #8] ldr r1,[r6,r4, lsl #12] str r1,[r3,r4, lsl #16] str r1,[r3,r4, lsl #20] str r1,[r3,r4, lsl #24] Free up r8 and r12 for use. 19

  20. Remove Qmap Instruction q0 R8 R12 R1 qmap r1,#0,q0 ldr r1,[r6,r4, lsl #4] ldr r1,[r6,r4, lsl #8] ldr r1,[r6,r4, lsl #12] str r1,[r3,r4, lsl #16] str r1,[r3,r4, lsl #20] str r1,[r3,r4, lsl #24] qmap r1,#0,q0 Free up r8 and r12 for use. 20

  21. Modulo Scheduling • For our work we used modulo scheduling. This requires using the dependences and latencies of the loop instructions to generate a modulo scheduled loop. • The prolog and epilog are then built based off of this schedule. • The prolog and epilog in require register renaming of loop carried dependencies to verify a correct loop. • Renaming in embedded processors is often not possible. 21

  22. Register Renaming due to software pipelining • Renaming doesn’t work… not enough registers. • Rotating registers would require a significant rewrite of the embedded ISA. • The loop carried values can simply be mapped into a register queue to hold the value across several iterations. 22

  23. Results Register Savings • As latency grows for the instructions more iterations of the loop are extracted to spread out the latency. • The extra registers that would be required to perform renaming have measured from 25% to 200% of the available registers in the ARM. 23

More Related