1 / 30

Embedded TechCon Practical Techniques for Embedded System Optimization Processes

This presentation will cover practical techniques for optimizing embedded systems, including following a process, setting quantitative goals, considering platform architecture, optimizing algorithms, estimation and modeling, helping the compiler, power considerations, and multi-core systems.

bwebb
Download Presentation

Embedded TechCon Practical Techniques for Embedded System Optimization Processes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Embedded TechConPractical Techniques for Embedded System Optimization Processes Rob Oshana robert.Oshana@freescale.com

  2. Agenda • Follow a process • Define the goals, quantitatively • The platform architecture makes a big difference • Don’t be naïve about the algorithms • Do some estimation and modelling • Help out the compiler if possible • Power is becoming more important • What about multiple cores? • Track what you are doing

  3. There is a right way and a wrong way Donald Knuth; "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil” Discipline and an iterative approach are the keys to effective serial performance tuning measurements and careful analysis to guide decision making change one thing at a time meticulously re-measure to confirm that changes have been beneficial

  4. Symptoms • Excessive optimization • Premature optimization • These consume project resources, delay release, compromise software design w/o directly improving performance • Fixation on efficiency • Model first before optimizing There are always tradeoffs

  5. Follow a process

  6. Functional; “The embedded software shall…(monitor, control, etc)” • Non-Functional; “The embedded software shallbe..(fast, reliable, scalable, etc) Spend time up front understand your non-functional requirements “it has to be really fast” “it has to be able to kick <competitor A>’s butt” -- examples of real performance “requirements” IPFwd Fast Kpps Should 600 Must 550 Functional = what the system should do Non-functional = how well the system should do it

  7. There is a Difference Between Latency and Throughput “It is not possible to determine both the position and momentum of an object beyond a certain amount of precision.” -Heisenberg’s Principle Similarly, it is not possible to design a system that provides both low latency and high performance However, real-world systems (such as Media, eNodeB, etc.) need both Need to tune the system for the right balance of latency and performance Latency; 10usec avg, 50 uses max wake up latency for RT Tasks Throughput; 50Mbps UL, 100 Mbps DL for 512B packets

  8. CPU (latency oriented cores) Map the application to the core Or offload to the cloud GPU (throughput oriented cores)

  9. Estimating Embedded Performance can be done prior to writing the code • Maximum CPU Performance “What is the maximum number of times the CPU can execute your algorithm? (max # channels) • Maximum I/O Performance “Can the I/O keep up with this maximum #channels?” • Available Hi-Speed Memory “Is there enough hi-speed internal memory?” 1. CPU Load (% of maximum) 2. At this CPU Load, what other functions can I perform?

  10. FIR benchmark: (nx/2) (nh+7) = 128 * 207 = 26496 cyc/frm #times frm full/s: (samp freq / frm size) = 48000/256 = 187.5 frm/s MIP calc: (frm/s) (cyc/frm) = 187.5 * 26496 = 4.97M cyc/s Conclusion: FIR takes ~5MIPs on Embedded Core XYZ Max #channels: 60 @300MHz C P U Required I/O rate: 48Ksamp/s * #Ch = 48000 * 16 * 60 = 46.08 Mbps DSP SP rate: serial port is full duplex 50.00 Mbps DMA Rate: (2x16-bit xfrs/cycle) * 300MHz = 9600 Mbps Req’d Data Mem: (60 * 200) + (60 * 4 * 256) + (60 * 2 * 199)= 97K x 16-bit Avail int’l mem: 32K x 16-bit I / O Example – Performance Calculation How many channels can the core handle given this algorithm? Max # channels: does not include overhead for interrupts, control code, RTOS, etc. Are the I/O and memory capable of handling this many channels? Algorithm: 200-tap (nh) low-pass FIR filter Frame size: 256 (nx) 16-bit elements Sampling frequency: 48KHz   X Required memory assumes: 60 different filters, 199 element delay buffer, double buffering rcv/xmt

  11. CPU Load Graph 20% 2 1 CPU Load Graph 100% GPP Estimation results drive options Application: simple, low-end (CPU Load 5-20%) What do you do with the other 80-95%? • Additional functions/tasks • Increase sampling rate (increase accuracy) • Add more channels • Decrease voltage/clock speed (lower power) or + Core Core Acc Application: complex, high-end (CPU Load 100%+) How do you split up the tasks wisely? • GPP/uC (user interface), DSP (all signal processing) • DSP (user i/f, most signal proc), FPGA (hi-speed tasks) • GPP (user i/f), DSP (most signal proc), FPGA (hi-speed) + + DSP Acc

  12. Help out the compiler • A compiler maps high-level code to a target platform • Preserves the defined behavior of the high level language • The target may provide functionality that is not directly mapped into the high level language • Application may use algorithmic concepts that are not handled by the high level language • Understanding how the compiler generates code is important to writing code that will achieved desired results

  13. Big compiler impact 1; ILP restrict enables SIMD optimizations Stores may alias loads. Must perform operations sequentially. Independent loads and stores. Operations can be performed in parallel!

  14. Big compiler impact 2; data locality for (i=0; i<N; i+=2) for (j=0; j<N; j++) { A[i][j] = B[j][i]; A[i+1][j] =B[j][i+1]; } for (i=0;i<N;i++) for (j=0;j<N;j++) A[i][j] = B[j][i]; • Spatial locality of B enhanced • Unroll outer loop and fuse new copies of the inner loop • Increases size of loop body and hence available ILP • General guideline; Align computation and locality

  15. Use cache efficiently

  16. Use the right algorithm  think Big O O(n^2) vs O(nlogn) 40 cycles 100 cycles 200 cycles

  17. Understand Performance Patterns (and anti-patterns) (green – data, red – control, blue – termination)

  18. Power Optimization; Active vs. Static Power • Power consumption in CMOS circuits: P total = Pactive + Pstatic • C = charge (q) / voltage (V), • q = CV • W = V * q, or in other terms, W = V * CV or W = CV² • Power is defined as work over time, or in this discussion it is how many times a second we oscillate the circuit. • P = (work) W / (time) T and since T = 1/F then P = WF or substituting, P = CV²F

  19. Top Ten Power Optimization Techniques Architect SW to have natural “idle” points (inc. low power boot) Use interrupt-driven programming (no polling, use OS to block) Code and data placement close to processor to minimize off-chip accesses (and overlays from non-volatile to fast memory) Smart placement to allow frequently accessed code/data close to CPU (and use hierarchical memory models) Size optimizations to reduce footprint, memory and corresponding leakage Optimize for speed for more CPU idle modes or reduced CPU frequency (benchmark and experiment!) Don’t over calculate, use minimum data widths, reduce bus activity, smaller multipliers Use DMA for efficient transfer (not CPU) Use co-processors to efficiently handle/accelerate frequent/specialized processing Use more buffering and batch processing to allow more computation at once and more time in low power modes Use OS to scale V/F and analyze/benchmark (make right 1st !!)

  20. When you have more than 1 core to optimize (multicore) Goal; Exploit multicore resources Step 1; Optimize serial implementation easier less time consuming less likely to introduce bugs reduce the gap  less parallelization is needed allows parallelization to focus on parallel behavior, rather than a mix of serial and parallel issues Serial optimization is not the end goal Apply changes that will facilitate parallelization and the performance improvements that parallelization can bring Serial optimizations that interfere with, or limit parallelization should be avoided avoid introducing unnecessary data dependencies exploiting details of the single core hardware architecture (such as cache capacity)

  21. There’s Amdahl and then there’s Gustafson (know the difference) Time/Problem Size • Conventional Wisdom • Speedup decreases with increasing portion of serial code (S)- diminishing returns • Imposes fundamental limit (1/S) on speedup • Assumes parallel vs. serial code ratio is fixed for any given application – unrealistic? • Theoretical Max? • Applies to applications without a fixed code ratio – e.g. networking/routing • Speedup becomes proportional to the number of cores in the system • Packet processing provides opportunity for parallelism

  22. Many types of Parallelism (more than one may apply)

  23. Multithreaded Programming has some hazards Deadlock Livelock False sharing Data hazards Lock contention

  24. Optimize for best case scenario, not worst case No Lock Contention – No System Call

  25. Optimize for best case scenario, not worst case Since most operations will not require arbitration between processes, this will not happen in most cases Very useful in low number of threads scenario

  26. Top Ten Performance Optimization Techniques for Multicore Achieve proper load balancing Improved data locality and reduce false sharing Affinity Scheduling if necessary Lock granularity Lock frequency & ordering Remove sync barrier Async vs sync communication Scheduling Worker thread pool Manage thread count Use parallel libraries (pthreads, openMP, etc)

  27. Threads, cache, naïve and smart

  28. Threads, cache, naïve and smart

  29. Recommendation; Start developing crawl charts 1.DL throughput : 60 Mbps (with MCS=27 DL MIMO) 2.UL throughput : 20 Mbps (with MCS=20 for UL )

  30. Recommendation; Form a Performance Engineering Team As content is upstreamed Feature Content Feature Content Configuration Settings Configuration Settings Repository / Branch / Patches SoC Features and NPIs Upstream Kernel SoCKernel Performance Engineering Feature Merge Feature Integration

More Related