Compilers for embedded systems: Why are compilers an issue?

Compilers for embedded systems:Why are compilers an issue? • Many reports about low efficiency of standard compilers • Special features of embedded processors have to be exploited. • High levels of optimization more important than compilation speed. • Compilers can help to reduce the energy consumption (energy optimization). • Compilers could help to meet real-time constraints. • Less legacy problems than for PCs. • There is a large variety of instruction sets. • Design space exploration for optimized processors makes sense

Use of assembly languages in embedded systems C-Code [Paulin, 1995] (DSP) C-Code (µController) Assembler (DSP) Assembler (µController) Similar situation more recently

Optimizations considered • Energy-aware compilation • Compilation for digital signal processors • Compilation for multimedia processors • Compilation for VLIW (very long instruction word) processors • Compilation for network processors • Compiler generation, retargetable compilers and design space exploration

Efforts for Reducing Energy • Device Level • Development of Low Power Devices • Reducing Power Supply Voltage • Reducing Threshold Voltage • Circuit Level • Gated Clock • Path Transistor Logic • Asynchronous Circuits • System Level • Fabrication level

Optimization for low-energy the same as optimization for high performance? • No ! • High-performance if available memory bandwidth fully used; low-energy consumption if memories are at stand-by mode • Reduced energy if more values are kept in registers ADD r3,r0,r2 MOV r0,#28 MOV r2,r12 MOV r12,r11 MOV r11,rr10 MOV r0,r9 MOV r9,r8 MOV r8,r1 LDR r1, [r4, r0] ADD r0,r3,r1 ADD r4,r4,#4 ADD r5,r5,#1 CMP r5,#100 BLT LL3 LDR r3, [r2, #0] ADD r3,r0,r3 MOV r0,#28 LDR r0, [r2, r0] ADD r0,r3,r0 ADD r2,r2,#4 ADD r1,r1,#1 CMP r1,#100 BLT LL3 int a[1000]; c = a; for (i = 1; i < 100; i++) { b += *c; b += *(c+7); c += 1; } 2231 cycles 16.47 µJ 2096 cycles 19.92 µJ

Energy models • Commercial tools frequently very imprecise • Model of Tiwari (Dissertation, Princeton 1996):Cost of instructions and of transitions between instructions;Does not separate out the cost of memory access • Model of Simunic, de Micheli (DAC 99):Model based on data sheets; does not require measurements.Does not take transitions into account. • Russell, Jacome (ICCD, 1998): based on precise measurement for two fixed configurations;cannot predict effect of changes to memory architecture. • Lee (LCTES 2001): detailed analysis of the effect pipeline stages; does not include multi-cycle operations and stalls • Dedicated energy models.

Energy Model by Knauer and Steinke VDD ARM7 DataMemory DAddr ALU Register File Data RegValue Reg# BarrelShifter IAddr InstructionMemory Imm Instr Instr Opcode Multi-plier Instr. Decoder& Control Logic Etotal = Ecpu_instr + Ecpu_data + Emem_instr + Emem_data

Instruction dependent costs in the CPU • Ecpu_instr = • MinCostCPU(Opcodei) +1 *  w(Immi,j) + ß1 *  h(Immi-1,j, Immi,j) +2 *  w(Regi,k) + ß2 *  h(Regi-1,k, Regi,k) + • 3 *  w(RegVali,k) + ß3 *  h(RegVali-1,k, RegVali,k) + • 4 *  w(IAddri) + ß4 *  h(IAddri-1, IAddri) + • FUCost(Instri-1,Instri) Cost for a sequence of m instructions w: number of ones;h: Hamming distance;FUCost: cost of switching functional units, ß: determined through experiments

Other costs • Ecpu_data =  5 * w(DAddri) + ß5 * h(DAddri-1, DAddri) +6 * w(Datai) + ß6 * h(Datai-1, Datai) Emem_instr = MinCostMem(InstrMem,Word_widthi) +7 * w(IAddri) + ß7 * h(IAddri-1, IAddri) +8 * w(IDatai) + ß8 * h(IDatai-1, IDatai) Emem_data =  MinCostMem (DataMem, Direction, Word_widthi) + 9 * w(DAddri) + ß9 * h(DAddri-1, DAddri) +10 * w(Datai) + ß10 * h(Datai-1, Datai)

Results • It is not important, which address bit is set to ‘1’ • The number of ‚1‘s in the address bus is irrelevant • The cost of flipping a bit on the address bus is independent of the bit position. • It is not important, which data bit is set to ‘1’ • The number of ‚1‘s on the data bus has a minor effect (3%) • The cost of flipping a bit on the data bus is independent of the bit position.

Compiler optimizations for improving energy efficiency • Energy-aware scheduling • Energy-aware instruction selection • Operator strength reduction: e.g. replace * by + and << • Minimize the bitwidth of loads and stores • Standard compiler optimizations with energy as a cost function R2:=a[0];for i:= 1 to 10 do begin R1:= a[i]; C:= 2 * R1 + R2; R2 := R1; end; E.g.: Register pipelining: for i:= 0 to 10 do C:= 2 * a[i] + a[i-1]; • Exploitation of the memory hierarchy

3 key problems for future memory systems • (Average) Speed • Energy/Power • Predictability/Worst Case Execution Time Energy Access times smaller, faster, less energy

1. (Average) Speed • Speed gap between processor and main DRAM increases Speed • early 60ties (Atlas):page fault ~ 2500 instructions • 2002 (2 GHz µP):access to DRAM ~ 500 instructions •  penalty for cache miss soon same as for page fault in Atlas 8 4 CPU (1.5-2 p.a.)  2x every 2 years 2 DRAM (1.07 p.a.) [P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane] 1 years 0 1 2 3 4 5

2. Power/Energy Example (CACTI Model): 2.5 2 1.5 Energy per access [nJ] 1 0.5 0 64 128 256 512 1024 2048 4096 8192 Memory size [bytes] [Steinke et al., Inf 12, UniDo, 2002]

3. Predictability/WCET • Predictability: For satisfying timing constraints in hard real-time systems, predictability is the most important concern;pre run-time scheduling is often the only practical means of providing predictability in a complex system [Xu, Parnas] • Time-triggered, statically scheduled operating systems • What about memory accesses? • Currently available caches don‘t solve the problem: • Improve the average case behavior • Use “non-deterministic“ cache replacement algorithms • Scratch-pad/tightly coupled memory based predictability

main SPM processor 0 scratch pad memory ARM7TDMI cores, well-known for low power consumption FFF.. Hierarchical memoriesusing scratch pad memories (SPM) • Address space Hierarchy Example no tag memory

For i .{ } for j ..{ } while ... Repeat call ... Example: Which segment (array, loop, etc.) to be stored in SPM? Gain gi and size si for each segment i. Maximise gain G = gi, respecting constraint K  si. Static memory allocation: Solution: knapsack algorithm. Dynamic reloading: Finding optimal reloading points. board Main memory (On-board) Array ... ? SPMcapacity K Array Processor Int ... Exploitation of SPM

Reduction in energy and average run-time Cycles Multi_sort (mix of sort algorithms)

Energy consumption per functional unit,as a function of the SPM size • Parameters different from previous slide

Hardware-support for block-copying • The DMA unit was modeled in VHDL, simulated, synthesized. Unit only makes up 4% of the processor chip. • The unit can be put to sleep when it is unused. DMA MEMORY CPU • Code size reductions of up to 23% for a 256 byte SPM were determined using the DMA unit instead of the dynamic approach that uses processor instructions for copying.

Compilers for embedded systems: Why are compilers an issue?

Compilers for embedded systems: Why are compilers an issue?

Presentation Transcript

Peter van Lith

525.415 : Embedded Microprocessor Systems

IBM i Application Development Tools and compilers WDS, RDi, RDi SOA, RTCi

COS 320 Compilers

UBC104 Embedded Systems

Languages and Compilers (SProg og Oversættere) Concurrency and distribution

Languages and Compilers (SProg og Oversættere)

Predictable Integration of Safety-Critical Software on COTS- based Embedded Systems

Fundamentals of Embedded Operating Systems

Introduction

Fall 2012

Embedded System Overview

Catholic University, PUCRS (Brazil) Faculty of Informatics and Faculty of Engineering

Modeling Embedded Systems

Software for embedded multiprocessors

Programming Languages and Compilers (CS 421)

Languages and Compilers (SProg og Oversættere)

Embedded Systems

Storage Systems

Hardware/Software Codesign of Embedded Systems

Modeling Embedded Systems

Languages and Compilers (SProg og Oversættere) Lecture 14 Concurrency and distribution