Embedded OpenCV Acceleration

Embedded OpenCV Acceleration Dario Pennisi

Introduction • Open-Source ComputerVision Library • Over 2500 algorithms and functions • Cross platform, portable API • Windows, Linux, OS X, Android, iOS • Real Time performance • BSD license • Professionally developed and maintained

History • Launched in 1999 by Intel • Showcasing Intel Performance Library • First Alpha released in 2000 • 1.0 version released in 2006 • Corporate support by Willow Garage in 2008 • 2.0 version released in 2009 • Improved c++ interfaces • Releases each 6 months • In 2014 taken over by ItSeez • 3.0 in beta now • Drop C API support

Application structure • Building blocks to ease vision applications Image Retrieval Pre Processing Feature Extraction Object Detection OpenCV imgproc highgui objdetect features2d ml stitching calib3d video Recognition Reconstruction Analisys Decision Making

Environment Application C++ Java Python OpenCV cv::parallel_for_ Threading APIs Concurrency CStripes GCD OpenMP TBB OS Acceleration CUDA SSE/AVX/NEON OpenCL

Desktop vs Embedded

System Engineering • Dimensioning system is fundamental • Understand your algorithm • Carefully choose your toolbox • Embedded means no chance for “one size fits all”

Acceleration Strategies • Optimize Algorithms • Profile • Optimize • Partition (CPU/GPU/DSP) • FPGA acceleration • High level synthesis • Custom DSP • RTL coding • Brute Force • Increase number of CPUs • Increase CPU Frequency • Accelerated libraries • NEON • OpenCL/CUDA

Bottlenecks Know your enemy

Memory • Access to external memory is expensive • CPU load instructions are slow • Memory has Latency • Memory bandwidth is shared among CPUs • Cache • Prevents CPU to access external memory • Data and instruction

Disordered accesses • What happens when we have cache miss? • Fetch data from same memory row  13 clocks • Fetch data from a different row 23 clocks • Cache line usually 32 bytes • 8 clocks to fill a line (32 bit data bus) • Memory bandwidth Efficiency • 38% on same row • 26% on different row

Bottlenecks - Cache • 1920x1080 YCbCr 4:2:2 (Full HD) 4MB • Double the size of the biggest ARM L2 cache • 1280x720 YCbCr 4:2:2 (HD)  1.8 MB • Just fits L2 Cache… ok if reading and writing to the same frame • 720x576 YCbCr 4:2:2 (SD)  800KB • 2 images in L2 cache…

OpenCV Algorithms • Mostly designed for PCs • Well structured • General purpose • Optimized functions for SSE/AVX • Relatively optimized • Small number of accelerated functions • NEON • Cuda (nVidia GPU/Tegra) • OpenCL (GPU, Multicore processors)

Multicore ARM/NEON • NEON SIMD instructions work on vectors of registers • Load-process-storephilosophy • Load/store costs 1 cycle only if in L1 cache • 4-12 cycles if in L2 • 25 to 35 cycles on L2 cache miss • SIMD instructions can take from 1 to 5 clocks • Fast clock useless on big datasets/small computation

Generic DSP • Very similar to ARM/NEON • High speed pipeline impaired by inefficient memory access subsystem • When smart DMA is available it is very complex to program • When DSP is integrated in SoC it shares ARM’s bandwidth

OpenCL on GPU • OpenCL on Vivante GC2000 • Claimed capability up to 16 GFLOPS • Real Applications • only on internal registers: 13.8 GFLOPS • computing 1000x1000 matrix: 600 MFLOPS • Bandwidth and inefficiencies: • Only 1K local memory and 64 byte memory cache

OpenCL on FPGA • Same code can run on FPGA and GPU • Transform selected functions in hardware • Automated memory access coalescing • Each function requires dedicated logic • Large FPGAs required • Partial reconfiguration may solve this • Significant compilation time

HLS on FPGA • High Level Synthesis • Convert C to hardware • HLS requires Code to be heavily modified • Pragmas to instruct compiler • Code restructuring • Not portable anymore • Each function requires dedicated logic • Large FPGAs required • Partial reconfiguration may solve this • Significant compilation time

A different approach Demanding algorithms on low cost/power HW Algorithm Analysis Memory Access Pattern Data intensive processing Decision Making DSP NEON Custom Instruction (RTL) ARM program DMA

External co-processing ARM Memory ARM Memory GPU FPGA PCIe Memory FPGA

Co-processor details • FPGA Co-Processor • Separate memory • Adds bandwidth • Reduces access conflict • Algorithm aware DMA • Access memory in ordered way • Add caching through embedded RAM • Algorithm specific processors • HLS/OpenCL synthesized IP blocks • DSP with custom instructions • Hardcoded IP blocks ARM ARM Memory DMA Processor DMA Processor Block capture Block capture Block capture DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM DPRAM DSP core (s) DSP core (s) DSP core/IP Block

Co-processor details • Flex DMA • Dedicated processor with DMA custom instruction • Software defined memory access pattern • Block Capture • Extracts data for each tile • DPRAM • Local, high speed cache • DSP Core • Dedicated processor with Algorithm specific custom instructions ARM ARM Memory Flex DMA Flex DMA Flex DMA Flex DMA Block capture Block capture Block capture Block capture Block capture Block capture DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM DPRAM DPRAM DPRAM DSP core (s) DSP core (s) DSP core (s) DSP core (s) DSP core/IP Block DSP core/IP Block

Environment Application C++ Java Python OpenCV cv::parallel_for_ Threading APIs Concurrency CStripes GCD OpenMP TBB OS OpenVX Acceleration SSE/AVX/NEON OpenCL CUDA FPGA

OpenVX

OpenVX Graph Manager • Graph Construction • Allocates resources • Logical representation of algorithm • Graph Execution • Concatenate nodes avoiding memory storage • Tiling extensions • Single node execution can be split in multiple tiles • Multiple accelerators executing single task in parallel Memory Node1 Memory Node2 Memory Memory Node1 Node2 Memory

Summary • OpenCV today is mainly PC oriented. • ARM, Cuda, OpenCL support growing • Existing acceleration only on selected functions • Embedded CV requires good partitioning among resources • When ASSPs are not enough FPGAs are key • OpenVX provides a consistent HW acceleration platform, not only for OpenCV What we learnt

Questions

Thank you

Embedded OpenCV Acceleration

Embedded OpenCV Acceleration

Presentation Transcript

OpenCV Tutorial

OpenCV Tutorial

OpenCV Tutorial

OpenCV

OpenCV

OpenCV 3.0

OpenCV

OpenCV + Kinect

OpenCV Introduction

OpenCV Tutorial

OpenCV

OpenCV Tutorial

OpenCV

OpenCV API

OpenCV Tutorial

OpenCV Tutorial