hArtes 2010 Final Review Toolchain Overview & Demo

hArtes 2010 Final Review Toolchain Overview & Demo Andrea Michelotti, Atmel WP2 Leader

Agenda • Key Achievements • Developing an Application for a Heterogeneous Platform • Initialization • Sharing resources • Calling a DSP function • Targeting Execution • Expressing parallelism • Conclusions

Key Achievements (1) • The hArtes toolchain… • dramatically minimizes learning curve • hides heterogeneous complexity: • no expert knowledge of the platform, tools or new programming languages • use ‘C’ or tools that produce ‘C’ (Scilab/Nutech) • automatically speeds-up application (in respect to GPP-only execution) by exploiting Processor Element capabilities

Key Achievements (2) • Easily retarget new platforms: • basic OMAP porting took ~1 month • other popular platforms like GPUs can be targeted as well • must conform to Molen machine • Quickly evaluate how an application behaves on different architectures • important for time to market

Targeted Platforms/Architectures • Heterogeneous Platforms: • Atmel DEB (Diopsis Evaluation Board): ARM9 GPP + mAgicV DSP • UniFE’s hArtes HW Platform: ARM9 GPP + mAgicV DSP + Xilinx FPGA • Scaleo’s hArtes Emulation Platform: ARM9 GPP + mAgicV DSP + Altera FPGA • TI Experimenter OMAPL138: ARM9GPP + C67 TI DSP

Developing an Application for a Heterogeneous Platform in a nutshell • Before hArtes, toolchains for heterogeneous hardware (Diopsis and OMAP) consist of separate subchains: one for GPP and one for DSP. • MAIN PROBLEMS: • High learning curve: target tools, architecture, APIs…. • Without knowledge of the underlying software/hardware it’s difficult to use and to benefit from the existence of PEs; • Code maintainability: two distinct projects must be kept aligned; • Code portability: usually GPP code contains specific APIs to load, execute and access PEs resources; An identical C-code cannot be produce correct result across all the platforms; • Debugging: not an unified image, not an unified debugger.

Writing an application for a heterogeneous platform in a nutshell Suppose we want to port our legacy code on a powerful but heterogeneous architecture like OMAP or DIOPSIS… C code running on my host PC void main(){ … unsigned *shared_array=malloc(SIZE); … my_fft(int param,shared_array…); another_kernel(…); … } Seems easy, but, after a while, becomes a nightmare! …

InitializationOMAP toolchain (through dsplink) GPP code int main(int argc,char**argv){ … if (DSP_SUCCEEDED (status)) { status = PROC_setup (NULL) ; } if (DSP_SUCCEEDED (status)) { status = PROC_attach (processorId, NULL) ; if (DSP_FAILED (status)) { RDWR_1Print ("PROC_attach failed. Status: [0x%x]\n", status) ; } } else { RDWR_1Print ("PROC_setup failed. Status: [0x%x]\n", status) ; } if (DSP_SUCCEEDED (status)) { args [0] = strNumIterations ; { status = PROC_load (processorId, dspExecutable_myfft, NUM_ARGS, args) ; } if (DSP_FAILED (status)) { RDWR_1Print ("PROC_load failed. Status: [0x%x]\n", status) ; } } if (DSP_SUCCEEDED (status)) { status = PROC_start (processorId) ; if (DSP_FAILED (status)) { RDWR_1Print ("PROC_start failed. Status: [0x%x]\n", status) ; } } Use of specific API to access DSP. The DSP image is loaded as a file. The DSP has its own main DSP code void myfft(); int main(int argc,char**argv){ myfft(); }

Initialization Diopsis toolchain case GPP code Very low API to access DSP. The DSP image is binary loaded as a file. The DSP has its own main int main (int argc, char *argv[]){ .... ret = mAgicV_load_PM("myfft.bin",_m_fd_extm); if(ret!=0) return ret; ret = mAgicV_load_DM(“myfft_datamem.bin"); if(ret!=0) return ret; ret = mAgicV_load_XM(“myfft_extmem.bin",(0x365890)); if(ret!=0) return ret; ret = mAgicV_init_PMU(); if(ret!=0) return ret; magicV_start(); .. magicV_wait(); .... } DSP code void myfft(); int main(int argc,char**argv){ … myfft(); … }

Initialization hArtes toolchain case GPP/Application code The hArtes Toolchain and hArtes Runtime take care of hiding all the initialization details. JUST one code. int main (int argc, char *argv[]){ .... }

Sharing resourcesOMAP toolchain case Use of specific API to create a shared area, translate the address for DSP and then pass the translated address to DSP via messages. Memory Layout must be configured by recompiling drivers!! GPP code int main(int argc,char**argv){ SMAPOOL_Attrs poolAttrs poolAttrs.bufSizes = (Uint32 *) &size ; poolAttrs.numBuffers = (Uint32 *) &numBufs ; poolAttrs.numBufPools = NUM_BUF_SIZES ; poolAttrs.exactMatchReq = TRUE ; volatile unsigned* my_shared_array; status = POOL_open (POOL_makePoolId(processorId, SAMPLE_POOL_ID), &poolAttrs) ; if (DSP_FAILED (status)) { MPCSXFER_1Print ("POOL_open () failed. Status = [0x%x]\n", status) ; } } if (DSP_SUCCEEDED (status)) { status = POOL_alloc (POOL_makePoolId(processorId, SAMPLE_POOL_ID), &my_shared_array, SIZE, DSPLINK_BUF_ALIGN)) ; /* Get the translated DSP address to be sent to the DSP. */ if (DSP_SUCCEEDED (status)) { status = POOL_translateAddr ( POOL_makePoolId(processorId, SAMPLE_POOL_ID), &dspCtrlBuf, AddrType_Dsp, (Void *) &my_shared_array_from_dsp, AddrType_Usr) ; DSP code int main(int argc,char**argv){ DSPlink_init(); …. }

Sharing resourcesDiopsis toolchain case Very raw access, sharing directly addresses that are manually mapped, using local copy and write back. Use of specific APIs and Specific compiler directives. NO LINKER used, many problems of source alignment, debug GPP code #define MY_SHARED_ARRAY_ADDR 2 int main(int argc,char**argv){ .... unsigned local_my_shared_array[]; mAgicV_read_buff(local_my_shared_array,MY_SHARED_ARRAY_ADDR,sizeof(my_shared_array)); .... // modify local copy, write back mAgicV_write_buff(local_my_shared_array,MY_SHARED_ARRAY_ADDR,sizeof(my_shared_array)); … DSP code #define MY_SHARED_ARRAY_ADDR 2 volatile long chess_storage(DATA:MY_SHARED_ARRAY_ADDR) my_shared_array; volatile long chess_storage(DATA:MY_SHARED_ARRAY_ADDR+SIZEOF(my_shared_array) my_other_variable; int main(int argc,char**argv){ // access the variable }

Sharing resourceshArtes toolchain case GPP code Intuitive and portable. DSE turns automatically malloc into hmalloc (hArtes API), that allocates and traces memory in a shared physical space of the target platform. int main(int argc,char**argv){ .... Unsigned* my_shared_array; my_shared_array = malloc(MYSIZE); #pragma map call_hw dsp0 dsp_func(my_shared_array); Very natural access

Calling a DSP routineOMAP/Diopsis toolchain cases GPP code GPP code int main(int argc,char**argv){ … // initialization, see main mAgicV_start() … int main(int argc,char**argv){ … // initialization, see main if (DSP_SUCCEEDED (status)) { status = PROC_start (processorId) ; if (DSP_FAILED (status)) { RDWR_1Print ("PROC_start failed. Status: [0x%x]\n", status) ; } } … There is not concept of DSP call from GPP, the GPP can start a DSP process that executes the desired function. The call can be inefficient or maybe cannot be executed correctly on the DSP. The programmer must know the underlying architecture! For example the DSP in the DIOPSIS architecture the type int is 16 bit wide. Typically is 32 bit. DSP code void my_fft(int pp, float*…); int main(int argc,char**argv){ … // initialization, see main my_fft(); … }

Calling a DSP routinehArtes toolchain case GPP code void my_fft(…){ … } int main(int argc,char**argv){ .. #pragma call_hw dsp 0 my_fft(); .. } Intuitive and portable. DSE checks if the my_fft function can be executed on the target DSP (checks parameters, used stack memory). It also estimates the cost of the call to decide if it’s convenient to move the execution on the DSP. To call the DSP function, the DSE adds a pragma to the function call and generate a C-source that can be compiled by the DSP toolchain.

Expressing ParallelismOMAP/Diopsis toolchain NOT KNOWN/NOT IMPLEMENTED

Expressing ParallelismhArtes toolchain case Intuitive and portable. hArtes supports some openMP construct to express parallelism. DSE in some case automatically detects kernels that can go in parallel and adds openMP annotations to the c-source. The parallelism can be also explicited via POSIX threads Void main(){ … #pragma omp parallel sections { #pragma omp section { #pragma call_hw dsp 0 my_fft(); } #pragma omp section { another_kernel(…); } } } #pragma call_hw dsp 0 void my_fft(…); int main(int argc,char**argv){ … // initialization, see hthread_create(my_fft()…); Another_kernel(); hthread_join(); }

Target Execution (under Linux)OMAP/Diopsis toolchain Two separate binaries, not common symbols, not unified debugger, I/O messages (printf) often relies on jtag connection. RUN and HOPE! bash$./my_fft_arm.elf <my_fft_dsp.bin>

Target Execution (under Linux)hArtes toolchain Single ELF binary, common symbols, unified debugger, I/O messages (printf) on the target. RUN! $bash ./my_fft.elf

Conclusions • Although the original “Brain to Bit” (B2B) objective was very ambitious, the hArtes toolchain fulfilled its original promise: to support software development of heterogeneous hardware without expert knowledge of the target platform, and therefore allowing developers to achieve high-performance applications through complete automated solutions and by abstracting low-level hardware details. • Areas of improvement regards mainly data flow analysis, automatic parallelization, AET integration in Eclipse, debugging capabilities.

hArtes 2010 Final Review Toolchain Overview & Demo