Compiling High-Level Descriptions on a Heterogeneous System

Compiling High-Level Descriptions on a Heterogeneous System José Gabriel de Figueiredo CoutinhoDepartment of Computing, Imperial College London The Programming Challenge of Heterogeneous Architectures WorkshopUniversity of BirminghamJuly 2-3, 2009

Overview • 1. hArtes Project • 2. Research • a) Task Transformation • b) Mapping Selection • c) High-Level Synthesis • 3. Harmonic toolchain • 4. Challenges

Why Heterogeneous Systems? Because... • orders-of-magnitude faster than conventional single-core processors • targetcomputationhungryapplications: • financial modeling • pharmaceuticalapplications • simulationofreal-lifecomplexsystems • strategy: mixconventional processors with specialised processors However... • how to develop applications? • portability... new system, new application? • design exploration... how to decide the partitioning and mapping? • optimisation... how to exploit specialised processors (FPGAs, DSPs)? • control vs automation.. how developers interact with compilation process?

1. hArtes Project - Consortium Atmel Roma (Italy)‏ Scaleo Chip (France)‏ Faital (Italy)‏ Thales Communications (France)‏ Fraunhofer IGD (Germany)‏ Thomson(France)‏ TU Delft (Netherlands)‏ Imperial College (U.K.)‏ Universitè d'Avignon (France)‏ INRIA (France)‏ Universitàdi Ferrara (Italia)‏ Leaff (Italy)‏ UP delle Marche (Italy)‏ Politecnico di Bari (Italy)‏ Politecnico di Milano (Italy)‏ 15 partners in 5 countries

Algorithm Exploration Tools .c source code hArtes Tool-Chain DSP GPP FPGA Scope HolisticApproach toReconfigurable real TimeEmbeddedSystems www.hartes.org

Applications Audio and Video Applications • Enhanced In-Car audio and video: • Multichannel audio system • Automatic Echo Cancellation (AEC)‏ • Automatic Speech and Speaker Recognition (ASR)‏ • Adaptive filtering • Video Transcoding • Intra-cabin communication Hardware Platforms (multi-purpose hardware)‏

Hardware Platforms hArtesHarware Platform(ARM+DSP+FPGA) Atmel Diopsis 940H Evaluation Board (ARM+DSP)

Toolchain The hArtestoolchain is composed by three toolboxes: 1) Algorithm Exploration Toolbox 2) Design Space Exploration Toolbox 3) System Synthesis Toolbox MappingSelection

Algorithm Exploration Toolbox: SciLab • Physical Model SCILAB • Algorithm To SCILAB2C and Design Exploration Toolbox hArtes

Algorithm Exploration Toolbox: Nu-Tech The NU-Tech Graphical Exploration (GAE) is the hArtesplatform to validate the complex algorithms. Design Exploration Toolbox hArtes Thanks to the plug-in architecture the developer can write his/her own NUTs (NU-Techs satellites) and immediately plug them into the graphical interface design environment.

Design Space Exploration Toolbox Input Source TU Delft (Netherlands)‏ Profiling Annotated C Politecnicodi Milano (Italy)‏ Task Partitioning Annotated C Task Transformation Imperial College (U.K.)‏ Annotated C Data Representation Optimisation Annotated C

System Synthesis Toolbox Annotated C Mapping Selection Imperial College (U.K.)‏ Annotated C Code Generation Generic GPP (C+macros)‏ GPP Molen code DSP C code FPGA Atmel Roma (Italy)‏ GPP comp Molen DSP comp C2VHDL ELF obj ELF obj ELF obj Bitstream Linker TU Delft (Netherlands)‏ Executable code (ELF)‏ Loader

Accelerating an application

Design Exploration and Synthesis 42 54 TaskTransformation Implementations Cost Estimation 43 System Description T1_gpp1 T3_fpga2 T1_gpp2 T1_gpp3 T1_dsp1 Tasks 34 T2_dsp2 Partitioning 32 T3_dsp3 C Description Mapping Selection T1DSP2 T3GPP T4DSP5

2. a) Task transformation • What are task transformations? • Source-to-source transformations • pattern matching on syntax or dataflow • Why use them? • Compilers cannot include all optimisations • Use knowledge of domain or platform experts • Use to influence task mapping • How to use them? • Write in C++ using ROSE framework: hard • Write in our domain-specific language, CML: easier • Who writes them? • Domain or platform experts • Developers needing design-space exploration

CML for task transformations • Basic CML: 3 parts to a transform • Pattern: syntax to match, label elements • Conditions based on dataflow • Resulting pattern to substitute • Proposed novel aspects of extended CML • Systematic description of dataflow conditions • Parameterised transforms • Features for labelling subpatterns • Probabilities for machine learning • Extend: CML code matching DFGs • s1->s2 matches true dependence arc from s1 to s2 • s1 -/> s2 matches antidependence arc from s2 to s2 • s1 -@-> s2 matches output dependence arc from s1 to s2

Requirements: CML language • Aim: compact transformation description • Describe transformations on • Abstract Syntax Tree (AST)‏ • Data Flow Graph (DFG)‏ • Support transformations specific to • Application domain: embedded media • Target technology: CPU + DSP + FPGA • Allow parameterisable transforms • e.g. unrolling factor • Interpretation • Can change transform without recompilation • Saves time, eases learning curve • Can rapidly explore transform design space • Customize existing transforms • Facilitate cost estimate: e.g. number of registers

CML example: replace multiply-by-n with shift • Replacing multiplies by shift is usually an optimisation in hardware • lower area, greater speed Transform name transform times2ToShift { pattern { expr(1) * n } conditions{ n & (n-1) == 0 } result { expr(1) << LOG2(n) } } Pattern: expression multiplied by n. Pattern section: syntax pattern with labelled parts expr(1): Labelledsubexpression Conditions section: optional; only replace if conditions all true Result: labelled expression, shift replaces multiply Result section: what to replace matched pattern with if conditions apply 18

Simple CML example • Eliminate addition with zero • Expr + 0 => 0 • Not always applicable (Floating-point: NaN + 0 = NaN) Match any addition to zero; label left-hand side as x classAddZero : publicAvisitor { Expr* result; public: voidvisit(Add * a) { // recurse to left-hand side a->getLhs()->accept(this); Expr* x = result; if(IntLiteral * il = dynamic_cast<IntLiteral*>(a->getRhs())){ if(il->getValue() == 0){ // pattern matched result = x; } else{ result = new Add(x, result); } } else { a->getRhs()->accept(this); result = new Add(x, result); } } }; transformaddZero { pattern{ expr(1) + 0 } result { expr(1) } } • C++: CML Match pattern in several stages If pattern matched, replace with expr(1)/x C++ / visitor pattern

CML Interpreter CML AST source AST CML: transformaddZero { pattern { expr(a) + 0 } result { expr(a) } } Interpreter • Interpret: • Depth-first visit of source AST • At each node • If node matches root of CML pattern • Match pattern in depth-first, postorder • Save labelled nodes (“a” in example) • Exit at first mismatch • If patterns match and conditions apply • Visit result pattern to apply result CML parser Add SgAddOp CMLExpr a IntLiteral 0 SgAddOp SgIntVal 1 SgIntVal 2 SgIntVal 0

Ray tracing: Design Space Exploration Start 46.0 Simple parallel 23.3 Simple parallel 23.0 Loop interchange Loop coalesce Loop interchange Simple parallel 22.6 Simple parallel 22.2 Start: simple, sequential loop Add transforms to aid parallelisation Best result from pixel-cyclic parallel Key: Last transform Time (secs) Pixel-cyclic parallel 20.1

Loop coalescing • Replace loop nest with single loop • Should run in same order as original • Declare new variable to control replacement loop • Synthesise old variables in terms of new variables • This allows body to be copied unmodified transformloopCoalesce { pattern{ for(var(0)=0;var(0)<expr(1);var(0)++){ for(var(2)=0;var(2)<expr(3);var(2)++){ stmt(4); } } } result { // single loop with new variable nv // range from 0 to product of trip // counts of original loops for(intnv=0;nv<expr(0)*expr(1);nv++){ // generate variable values // in terms of nv // note: not strength-reduced var(0) = nv / e0; var(1) = nv % e0; // the original body stmt(4); } } }

Experimental work: combine with model-based transforms • CML transforms are pattern-based • Match syntax or dataflow patterns • Model-based patterns • Map to underlying mathematical model + solution method • Combine pattern-based with model-based • Simplify model-based (transform into preferred input)

Experimental work: combine with verification framework • Design verification flow is based on symbolic simulation and equivalence checking • The symbolic simulation results (outputs) from source and target code are compared using equivalence checker (Yices) • Limitations • subset of C • integer types only • loop count constant

2. b) Mapping Selection • Overall goal • Given an application, find the best implementation for a heterogeneous computing system such that the execution time is minimised • Proposed techniques • Integrated mapping and scheduling technique • Multiple neighborhoodfunctions • Multi-loop parallelisation

Mapping Selection: Design Flow • Tabu search • Generate neighbor iteratively • Minimise processing time • Mapping criteria • Implementations and costs associated with each task • Available processing elements • Communication cost • Configuration cost

Integrated technique • Clustering + Mapping + Scheduling • Integrated in one neighbourhood function • Move tasks between processing elements • Extended solution space • Contain good solutions

Multiple neighborhood functions • Multiple Neighbourhoods Functions • Increase diversification • Search better solutions • Parallel search • Multi-processor systems

Experiments (80 – 112 tasks) • FIR filtering • Matrix multiplication • Hidden Markov model decoding • BGM interest rate model INT, TABU, [Porto, 1995] SEP, TABU, [Wiangtong, 2005] INT, TABU, MultNF [This work]

Multi-loop parallelisation • Find the best unrolling factor for each loop • Iterative approach • Unrolling configuration • Unrolling factors of all loops

Loops Results IWR : speech recognition SUSAN : corner detection for image processing N-Body : particle modeling

2. c) High-Level Synthesis R1:Rapid Development R2: Design Exploration R3:ExtensibilityR4:Manual Control Haydn Behavioural Structural our work existing work

2 a c MUX MUX b * b * << - 0 > true false c c 2 0 = == true false c c 0 num_sol = = 1 num_sol num_sol Haydn interpretation rules b a c 2 * << * - executed at cycle1 delta > 0 == 0 2 1 executed at cycle 2 0 StructuralInterpretation(Handel-C) BehaviouralInterpretation num_sol { delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0; }

2 a c MUX MUX b a c b * b stage 1-7 * pmult[0] pmult[1] << stage 8 << 2 - tmp0 tmp1 0 > - stage 9 cycle1 true false tmp2 c c 2 0 = == stage 10 true false c c > 0 0 num_sol = = 1 == 2 1 0 num_sol num_sol num_sol delta Rapid development scheduling par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q << 2; // ==================[stage 9] tmp2 = tmp0 - tmp1; // ==================[stage 10] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; } { delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0; } constraints + unscheduling(behavioural interpretation) synthesis(structural interpretation)

2 a c MUX MUX MUX MUX b a c b * b stage 1-7 * pmult [0] pmult [1] << stage 8 << 2 - tmp0 tmp1 0 > - stage 9 cycle1 true false tmp2 c c 2 0 = == stage 10 true false c c > 0 0 num_sol = = 1 == 2 1 0 num_sol num_sol num_sol delta Design exploration scheduling par { { // ================= [stage 1] pipe_mult[0].in(b,b); pipe_mult[0].in(a,c); } .... } par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q << 2; // ==================[stage 9] tmp2 = tmp0 - tmp1; // ==================[stage 10] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; } synthesis(structural interpretation) constraints + b a c stage 1-4 cycle 1 pmult[0] pmult[0] tmp0 cycle 2 << stage 5 2 cycle 1 tmp1 - cycle 2 tmp2 stage 6 unscheduling(behavioural interpretation) 0 > cycle 1 == 2 1 0 num_sol cycle 2

2 a c b * b * << - 0 > true false c c 2 0 = == true false c c 0 num_sol = = 1 num_sol num_sol Abstraction par { { // ================= [stage 1] pipe_mult[0].in(b,b); pipe_mult[0].in(a,c); } { // ==================[stage 4] delay; tmp0 = pipe_mult[0].q; } { // ==================[stage 5] tmp1 = pipe_mult[0].q << 2; tmp2 = tmp0 - tmp1; } { // ==================[stage 6] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; } } { delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0; } abstraction unscheduling(behavioural interpretation)

2 Design quality scheduling par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q << 2; // ==================[stage 9] tmp2 = tmp0 - tmp1; // ==================[stage 10] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; } { delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0; } constraints + a c b * b Manual Scheduling * << - User Intervention 0 > unscheduling true false c c 2 0 = == true false c c 0 num_sol = = 1 num_sol num_sol

Unscheduling 2. 1. 3. 4.

Haydn transformations: interactive mode @resources.set (*; UNITS:6);{ @HLS.run(II:1); // original code } @resources.set (*; UNITS:6);{ // transformed code }

Haydn-C: GARCH walk kernel kernel specification constraints

Design exploration: batch mode constraints • 5 multiplications: • 1 cycle per result => 5 multipliers • 2 cycles per result => 3 multipliers • 5 cycles per result => 1 multiplier

Evaluation: speed vs area

Initiation interval vs area

3. Harmonic Toolchain: Design Flow C source files, hardware description Task Partitioning task B task A Task Transformation Engine CML description task A1 (FPGA),task A2 (FPGA),task A3 (DSP)task B1 (GPP)task B2 (DSP)... request new partition pattern to match matching conditions Mapping Selection result pattern input task parameters CML transforms C code (specific to each PE) GPPcompiler Haydn(HLS) DSPcompiler Task Transformation Engine GPP transforms Handel-C (cycle-accurate description) FPGA transforms ROSEC++ transforms DSP transforms application and domain specific transformations description FPGA Synthesis Generic TransformLibraries implementations bitstream binaries runtime support

{ #pragmahaydn pipeline II(1) s = SQRT(a); y = (s + b) * (c + d);} Haydn(High-Level Synthesis) par { sqrt_v4.in(a); adder_v4[0].in(sqrt_v4.res, b); adder_v4[1].in(c, d); mult_v4.in(adder_v4[0].res, adder_v4[1].res); y = mult_v4.res;} Tools and Annotations Source-Files tasks/implementations Mapping Selection Task Partitioning void filter(...) { { #pragma omp section { #pragma map call_hw \ impl(MAGIC, 14) \ param(x,1000,r) \ param(h,100, rw) filter(x, h); } #pragma omp section { #pragma map call_hw \impl(ARM, 15) \ param(y,2000,r) \param(i,50, rw) filter2(y, i); } }} void foo(...) { ... #pragmaomp parallel sections num_threads(2) { #pragmaomp section { #pragma map call_hw \impl(MAGIC, 14) \param(x,1000,r) \param(h,100, rw) filter(x, h); } #pragmaomp section { #pragma map call_hw \impl(ARM, 15) \param(y,2000,r) \param(i,50, rw) filter2(y, i); } }} #pragma map clustervoid d0d2Sci2CMixRealChTmpd2(...) { ... ssOpStarsa1(a,x,t1); ...ssOpStarsa2(b,y,t2); ... ssOpPlusaa1(t1,t2,z);}

ROSE source infrastructure • Software analysis and optimization for scientific applications • Tool for building source-to-source translators • Support for C,C++, Fortran, binary • Loop optimizations • Lab and academic use • Software engineering • Performance analysis • Domain-specific analysis and optimizations • Development of new optimization approaches • http://rosecompiler.org

4. Challenges • Theoretical • define and meet global constraints (application/platform) • correctness: verify transformation results • effective combination of static and dynamic analysis • Practical: • reuse legacy code • incremental approach for using toolchain • create modular toolchain that can evolve with new applications and platforms

5. Summary • hArtes Project • complete toolchain targeting heterogeneous systems • 2. Research • Task Transformations: CML language for describing transformations • Mapping Selection: integrated approach with multiple neighbourhood functions • High-Level Synthesis (Haydn): combined behavioural and structural approach • 3. Harmonic toolchain • modular: enable customisation and technology evolution

Compiling High-Level Descriptions on a Heterogeneous System

Compiling High-Level Descriptions on a Heterogeneous System

Presentation Transcript

Revised level descriptions: Level 5 writing

Harmonized System Code Descriptions

A Flexible Approach to High Level Simulation of Complex System-on-Chip

Global and high-level image descriptions

Understanding Achievement Level Descriptions (ALDs)

A high level overview of the TeamUnify system.

Compiling High Performance Fortran

High Performance Computing on Heterogeneous Networks

High Level Test Descriptions related topics

Pig, a high level data processing system on Hadoop

International High-Level Conference on

A Framework for High-Level Synthesis of System-on-Chip Designs

HIE on FHIR ( high level )

Pig, a high level data processing system on Hadoop

System Descriptions

Level 2 Account Type Descriptions

A High Level Pub/Sub Layer for Open Distributed Heterogeneous Environments

COE 405 Utilities for High-Level Descriptions

High Level Forum on SNA

COE 405 Utilities for High-Level Descriptions