FlexCC2 : An Optimizing Retargetable C Compiler for DSP Applications

FlexCC2 : An Optimizing Retargetable C Compiler for DSP Applications V. Bertin, J-M.Daveau, P. Guillaume, D. Pilat, C.Robine, M. Santana, T. Théry FlexWare Embedded System Technology

Plan • Context • Goals • FlexCC2 • architecture • optimizations • Results • Conclusion

Context: Industrial Compiler • Specific instr./features to certain classes of applications. • Loop intensive. Enabling technology for embedded processors ASIP / AS-DSP Digital imaging MP3 Hard-disk • Performance located in small portions of critical code. • Productivity • Time-to-market • Retargetability Embedded System Embedded software Mobile

Goals • High-quality generated code • best in class for DSP compilers • eliminate any interest of ASM hand coding. • Irregular target architectures • encoding constraints • irregular instruction-level parallelism • register-set constraints • Specific instructions and features • hardware loops, multiply-accumulate, addressing modes, post-operations, … . • Short retargeting time • Shorten time-to-market for new processors

FlexCC2 Overall Design • Flexible compilation framework : • Easily add/remove generic/custom optimizations. • Re-order optimizations. • Retargetable compilation system. • Multi-level framework • Machine level optimizations. • Multi-level optimizations. • DSP oriented • Support for DSP datatypes and operations • High added-value DSP optimizations

FlexCC2 Architecture register allocator HW Loops HW Loops post operation arT High Level IR Low Level IR .c .asm code generation anc0 cse lower software pipeliner local scheduler global scheduler EDL TDF CGD EDF SDF

High-level framework: CoSy® Front End Back End BEG strength chainflow gra CCMIR .asm .c match emit anc0 sched cse lowering engine Engine Description List CGD Target Description File Code Generator Description EDL TDF

Loops + arrays Loops + pointers Parallel evolution of references 1 set = 1 pointer Partitioning Loop analysis Connivance sets Sets manipulation Pointers generation addressing resources & operations Induction Expressions (IEs) ADDRESS { Ax[1..6]; …} OPERATIONS { Ax:++; Ax[1]+=2; …} Specific DSP optimizations Array to pointer transformation Support for hardware-do loops Intrinsic functions recognition and replacement • Group access op. into families • Optimize address modes • Use index registers • Handle loop nesting

EliXir Back-end Infrastructure software pipeliner SDF local scheduler post operation Machine Description microengines dataflow API EliXir API Low Level IR register allocator global scheduler reg. alloc. API scheduling API soft. pipe. API EDF -engine Chaining dwarf liveness HW Loops

engines Flow Register Allocation Liveness Scheduler Post-Op Coalesce Super Blocker Code Generator Dataflow Peephole Hwloop Scheduler Output assembly file Software Pipelining Post-Op … Dominator Paths Software Pipelining ASMdump Pre-allocationoptimizations Post-allocationoptimizations

microengines C++ classes Register Allocation Framework Conservative Coalescer Briggs Allocator Callahan Allocator allocation API Briggs API Spill Manager / Optimizer Shuffle code Manager Targeting API RegsetGroup Interference Graph Interference Graph low level API SSA StackInfo RegsetGroup RegId Dependencies LoopTree

Processor Specific Instr./Features • Managed as target specific or generic • Intrinsics recognition and replacement. • Post operation, post increment. • Mainly handled by specific engine or engine. • Some optimizations require retargeting. • Make use of various EliXir APIs (dataflow graph, scheduling, …).

if(ab) Then Else max = b max = a C Instruction Patterns Graph Pattern Matching Intrinsics Recognition & Replacement if(ab) max = a; else max = b; Control Flow Graph Expression Trees • Complex expressions • Multi-statements cmp r1,r2 move r2,r3 move if(ge),r1,r3 max r1,r2,r3 max = L_max(a, b); Unoptimized ASM Optimized ASM

Dataflow Peephole rep L14, r5h L12: ldx_f ax1,r4h ldx_f ax2,r1h ldx_f axx1,r0h L_fmul r0h,r1h,r3 dmv r4h,r1h fmul r0h,r1h,r0h X_deplsp r0h,r0 L_addsat r3,r0,r3 … … mea axx1,++#1 mea ax2,++#-1 mea ax1,++#-1 L14: ldx_fax1,r4h Dataflow Graph dmvr4h,r1h meaax1,++#-1 Def-Use Dataflow Instruction Patterns Graph Pattern Matching Liveness ldx_fax1--,r4h

Retargeting FlexCC2 Machine description Code generation rules SDF CGD Engine flow µ-engine flow EDL EDF C++ API BEG High level IR Code generation engines Low level IR Lower Intr. patterns Lowered IR (µ-) engines

Results • MMDSP+ single MAC DSP core. • Retargeting time  4 months. • ETSI Enhanced Full Rate benchmark (EFR).

Results CoSy EliXir + HWLoops + arT Software pipelining + post-op Register Allocation

Original research work • FlexCC2 includes advanced in-house research work • arT / GarT. • flexible back-end infrastructure • retargetable register allocation for irregular architecture. • retargetable dataflow peephole optimizer. • automatic intrinsic functions recognition. • MMX optimization using pattern matching

Future work • Inter procedural optimizations. • Aliasing. • Memory placement. • MMX optimization using pattern matching. • Interaction between scheduling and register allocation. • Improved retargetability.

Conclusion • Keystone for embedded software development • Synthesizing application code into processor I/S • Exploiting processor features • Optimizing code and resource usage • Driving processor architecture evolution • Modular and extendible compiler framework • At high and low level. • State of the art optimizations. • Advanced DSP optimizations. • Target specific optimizations. • Short retargeting time. • Perspectives: compiler as a CAD tool for SoCs

FlexCC2 : An Optimizing Retargetable C Compiler for DSP Applications