1 / 14

Development of an efficient DSP Compiler based on Open64

Development of an efficient DSP Compiler based on Open64. Subrato K. De, Anshuman Dasgupta, Sundeep Kushwaha, Tony Linthicum, Susan Brownhill, Sergei Larin, Taylor Simpson Qualcomm Incorporated, San Diego & Austin, USA. Email:

borisj
Download Presentation

Development of an efficient DSP Compiler based on Open64

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Development of an efficient DSP Compiler based on Open64 Subrato K. De, Anshuman Dasgupta, Sundeep Kushwaha, Tony Linthicum, Susan Brownhill, Sergei Larin, Taylor Simpson Qualcomm Incorporated, San Diego & Austin, USA. Email: {sde, adasgupt, sundeepk, tlinth, yzhu, slarin, tsimpson}@qualcomm.com

  2. Introduction • DSP specific extensions to the C/C++ language and support in Open64 for efficient usage of DSP hardware features. • Enhancement of WOPT with impact on DSPs for mobile devices. • Modification of the hyperblock scheduler for DSP with limited predication. • Register pair allocation. • Results at a glance.

  3. DSP specific extensions to C/C++: use of circular load/store intrinsics int *coeffPointer; for(i=0; i < noOfInputSamples; ++i){ sum = 0; coeffPointer = &coeff[0]; for(j =0; j < 4; ++j){ sum += inputSample[i+j] * *coeffPointer++; } outputSample[i] = sum; } int *coeffPointer; int value; for(i=0; i < noOfInputSamples; ++i){ sum = 0; for(j =0; j < 4; ++j){ LOAD_CIRC_INT(value, coeffPointer, 1, 4, &coeff[0]); sum += inputSample[i+j] * value; } outputSample[i] = sum; }

  4. DSP specific extensions to C/C++: use of bit-reversed load/store intrinsics unsigned int ReverseBits (unsigned int index, unsigned int NumBits ){ unsigned int i, rev; for ( i=rev=0; i < NumBits; i++ ){ rev = (rev << 1) | (index & 1); index >>= 1; } return rev; } for ( i=0; i < NumSamples; i++ ){ j = ReverseBits ( i, NumBits ); RealOut[j] = RealIn[i]; ImagOut[j] = ImagIn[i]; } int *realOutPointer; int *imagOutPointer; for ( i=0; i < NumSamples; i++ ){ STORE_BREV_INT(RealIn[i], realOutPointer, 1, NumSamples,&RealOut[0] ). STORE_BREV_INT(ImagIn[i], imagOutPointer, 1, NumSamples, &ImagOut[0] ). }

  5. Design and implementation of circular and bit-reverse load/store intrinsics • Macros expansions. #define LOAD_CIRC_INT(v, p, s, l, a) \ ( (v) = *( p); (p) = (int *) circ_update((void *) (p), (s), (l), (void *) (a)) ) #define STORE_CIRC_INT(v, p, s, l, a) \ ( *( p) = (v); (p) = (int *) circ_update((void *) (p), (s), (l), (void *) (a)) ) • The WHIRL IR: • <result> = load_indirect(pointer); OR store_indirect(pointer) = <source>; • <pointer> = circ_update (pointer, step, CR). Where, CR is the configuration register that defines the circular/bit-reverse buffer based on buffer size and the start address. • The load/store and circular/bit-reverse update combined in the CG phase: • <result, pointer> =load_circular_update(pointer, step, CR), OR • <pointer> =store_circular_update(source, pointer, step, CR) • CRs are considered dedicated temporary names (TNs): • Algorithm for the hoisting of the loop-invariant CR assignment statements, and • Algorithm for CR allocation, when multiple target supports multiple CRs.

  6. DSP specific extensions to the C/C++ : use and optimizations through loop pragmas • Loop pragmas: • #pragma LOOP_TRIP_COUNT_MIN (min) • #pragma LOOP_TRIP_COUNT_MAX(max) • #pragma LOOP_TRIP_COUNT_MODULO(modulo) • #pragma LOOP_UNROLL(N) • loop guard condition can be removed. • loop trip count computation: • can be simplified (e.g., divide replaced by right shifts). • can be replaced with a constant count value. • alternate loop during software pipelining may not be generated. • remainder loops during unrolling (LNO & GC) may not be needed. • Unrolling: the standard benefits • Less loop overhead by reducing the branch/jump/comparison. • Scheduling can be better due to increased number of operations in the loop body.

  7. Register promotion of small structs (structures and unions <= 8 bytes that could fit in register) THE WHIRL IR USING DISTINCT AUXILIARY SYMBOLS FOR EACH FIELD IN THE SMALL STRUCT: e.g., h[1] == st 9; w[0] == st8 typedef union { long long d; int w[2]; short h[4]; char b[8]; } PAIR; PAIR var, *Pi, *Po; var.h[1] = 3; x = var.w[0]; I4INTCONST 3 (0x3) I2STID 2 <st 9> I4I4LDID 0 <st 8> I4STID 0 <st 5> I8I8LDID 0 <st 4> I4INTCONST 3 (0x3) I8CVTL 16 I8COMPOSE_BITS <bofst:16 bsize:16> I8STID 0 <st 4> I8I8LDID 0 <st 4> I8EXTRACT_BITS <bofst:0 bsize:32> I4STID 0 <st 5> EXTRACT_BITS & COMPOSE_BITS TRANSFORMATION REPLACES AUXILIARY SYMBOLS FOR EACH FIELD OF THE SMALL STRUCT MEMBER BY THE UNIQUE FULL-SIZED AUXILIARY REPRESENTING THE FULL STRUCTURE TYPE: e.g., PAIR == st 4

  8. Hyperblock Formation • Hyperblock scheduling can be important on DSPs • Many DSP architectures support predicated instructions • Especially profitable on control code • Restricted form of predication on DSP targets • Not every instruction is predicatable • Architecture decisions mainly due to resource and power savings • Default hyperblock algorithm rejected too many basic blocks due to instructions that aren’t predicatable in the DSP • Most potential hyperblocks on our target contained a few basic blocks • Small hyperblocks exacerbated the problem • Needed a more permissive version of hyperblock formation

  9. B1 B2 B3 B4 B5 Changes to Hyperblock Formation Algorithm • Allowed a basic block (e.g. B3) to contain non-predicatable instructions if it post-dominates the entry block (e.g. B1) of the hyperblock B1: Hyperblock entry blockB3:Block contains non-predicatable instructionsB2, B4, B5: large completely predicatable basic blocks. • Default hyperblock algorithm would reject blocks B1-B5 • Successfully formed hyperblock after our modifications • Modified algorithm led to more basic blocks considered and more hyperblocks formed. Details in paper.

  10. Efficient Register Pair Allocation • Many DSPs support 64-bit register by combining two adjacent 32-bit registers. • Two choices: • Use single 8-byte TN • Handle all complexities in the register allocator. • Direct mapping of data-types from whirl to CGIR. • Simple semantics – 1 result, 2 or more operands. • Use two 4-byte TNs • Need to handle complexities in GRA/LRA to assign adjacent registers. • Need to keep a mapping between the 8-byte WN to its two 4-byte TNs. • Difficult to handle ops producing 8-byte result (two 4-byte TNs). • Difficult to handle ops using 8-byte source (each input is two 4-byte TNs) • Accessing lower / higher part of register pair is trivial.

  11. Motivating example #include <stdio.h> extern long long bar(int a, long long b); extern int baz(long long); int foo(long long a, int c) { long long tmp = c + bar(1, a); int retval = (tmp>>32); if (c > 0) tmp++; return retval + baz(tmp); } • C Example • After Expansion • After EBO [ 11] TN249:8 :- asr_i_p TN242:8 (0x20) ; [ 11] TN252:4 :- tfr TN249:8 ; [ 11] TN1(r0):4 :- add TN248:4 TN252:4 ; [ 11] TN252:4 :- pseudo_pair_high GTN242:8 ; [ 11] GTN1(r0):4 :- add GTN1(r0):4<defopnd> TN252:4 ;

  12. TN252:4 :- pseudo_pair_high GTN242:8 • Pseudo pair instructions are expanded after register allocation • NOP if high of GTN242:8 is allocated the same register as TN252:4. • Copy otherwise. • We want to avoid copy • Make sure both high of GTN242:8 and TN252:4 share the same register. • Problem • LRA and GRA phases work independently. • Live ranges of source TN and result TN can interfere in many ways. • GLOBAL GLOBAL Both handled in GRA • GLOBAL LOCAL Needs GRA LRA interaction • LOCAL GLOBAL Needs GRA LRA interaction • LOCAL LOCAL Both handled in LRA • LRA does not build the interference graph.

  13. The solution strategy • Globalize any local TN if the other TN is global TN252:4 :- pseudo_pair_high GTN242:8 • Need to solve only 2 cases GLOBAL GLOBAL Both handled in GRA LOCAL LOCAL Both handled in LRA • Minor changes in GRA • GRA considers pseudo pair instructions as copy • Attempts to remove copy by preferencing • Some changes in LRA • Do very simple liveness analysis to identify if pseudo pair op TNs can preference i.e. share same register • Maintain a list of preference TN in each TN’s live range • When choosing color for a TN, check if any TN in its preference list is already allocated and get the same register if it is available

  14. Results at a glance • Enhanced Open64 v.s. baseline Open64 improvements • WOPT enhancement: e.g., on modem applications • Cycle count reduced 3% to 40%, • stack reduced as much as 50%, • code size reduced 1% to 2% • Register-pair allocation: • Telecommunication: cycle count reduced 3.91% on average. • kernel codes: cycle count reduced 1.77% on average. • Hyperblock enhancement: (illustrated in terms of the ratio of HBs formed or BBs considered by the “modified algorithm” when compared to the “default algorithm”) • Networking: HBs formed 1.13; BBs considered 1.72; • Telecommunication: HBs formed 1.00; BBs considered 1.49;

More Related