1 / 28

Embedded OpenCV Acceleration

Embedded OpenCV Acceleration. Dario Pennisi. Introduction. Open -Source Computer Vision Library Over 2500 algorithms and functions Cross platform, portable API Windows, Linux, OS X, Android, iOS Real Time performance BSD license Professionally developed and maintained. History.

zahir-irwin
Download Presentation

Embedded OpenCV Acceleration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Embedded OpenCV Acceleration Dario Pennisi

  2. Introduction • Open-Source ComputerVision Library • Over 2500 algorithms and functions • Cross platform, portable API • Windows, Linux, OS X, Android, iOS • Real Time performance • BSD license • Professionally developed and maintained

  3. History • Launched in 1999 by Intel • Showcasing Intel Performance Library • First Alpha released in 2000 • 1.0 version released in 2006 • Corporate support by Willow Garage in 2008 • 2.0 version released in 2009 • Improved c++ interfaces • Releases each 6 months • In 2014 taken over by ItSeez • 3.0 in beta now • Drop C API support

  4. Application structure • Building blocks to ease vision applications Image Retrieval Pre Processing Feature Extraction Object Detection OpenCV imgproc highgui objdetect features2d ml stitching calib3d video Recognition Reconstruction Analisys Decision Making

  5. Environment Application C++ Java Python OpenCV cv::parallel_for_ Threading APIs Concurrency CStripes GCD OpenMP TBB OS Acceleration CUDA SSE/AVX/NEON OpenCL

  6. Desktop vs Embedded

  7. System Engineering • Dimensioning system is fundamental • Understand your algorithm • Carefully choose your toolbox • Embedded means no chance for “one size fits all”

  8. Acceleration Strategies • Optimize Algorithms • Profile • Optimize • Partition (CPU/GPU/DSP) • FPGA acceleration • High level synthesis • Custom DSP • RTL coding • Brute Force • Increase number of CPUs • Increase CPU Frequency • Accelerated libraries • NEON • OpenCL/CUDA

  9. Bottlenecks Know your enemy

  10. Memory • Access to external memory is expensive • CPU load instructions are slow • Memory has Latency • Memory bandwidth is shared among CPUs • Cache • Prevents CPU to access external memory • Data and instruction

  11. Disordered accesses • What happens when we have cache miss? • Fetch data from same memory row  13 clocks • Fetch data from a different row 23 clocks • Cache line usually 32 bytes • 8 clocks to fill a line (32 bit data bus) • Memory bandwidth Efficiency • 38% on same row • 26% on different row

  12. Bottlenecks - Cache • 1920x1080 YCbCr 4:2:2 (Full HD) 4MB • Double the size of the biggest ARM L2 cache • 1280x720 YCbCr 4:2:2 (HD)  1.8 MB • Just fits L2 Cache… ok if reading and writing to the same frame • 720x576 YCbCr 4:2:2 (SD)  800KB • 2 images in L2 cache…

  13. OpenCV Algorithms • Mostly designed for PCs • Well structured • General purpose • Optimized functions for SSE/AVX • Relatively optimized • Small number of accelerated functions • NEON • Cuda (nVidia GPU/Tegra) • OpenCL (GPU, Multicore processors)

  14. Multicore ARM/NEON • NEON SIMD instructions work on vectors of registers • Load-process-storephilosophy • Load/store costs 1 cycle only if in L1 cache • 4-12 cycles if in L2 • 25 to 35 cycles on L2 cache miss • SIMD instructions can take from 1 to 5 clocks • Fast clock useless on big datasets/small computation

  15. Generic DSP • Very similar to ARM/NEON • High speed pipeline impaired by inefficient memory access subsystem • When smart DMA is available it is very complex to program • When DSP is integrated in SoC it shares ARM’s bandwidth

  16. OpenCL on GPU • OpenCL on Vivante GC2000 • Claimed capability up to 16 GFLOPS • Real Applications • only on internal registers: 13.8 GFLOPS • computing 1000x1000 matrix: 600 MFLOPS • Bandwidth and inefficiencies: • Only 1K local memory and 64 byte memory cache

  17. OpenCL on FPGA • Same code can run on FPGA and GPU • Transform selected functions in hardware • Automated memory access coalescing • Each function requires dedicated logic • Large FPGAs required • Partial reconfiguration may solve this • Significant compilation time

  18. HLS on FPGA • High Level Synthesis • Convert C to hardware • HLS requires Code to be heavily modified • Pragmas to instruct compiler • Code restructuring • Not portable anymore • Each function requires dedicated logic • Large FPGAs required • Partial reconfiguration may solve this • Significant compilation time

  19. A different approach Demanding algorithms on low cost/power HW Algorithm Analysis Memory Access Pattern Data intensive processing Decision Making DSP NEON Custom Instruction (RTL) ARM program DMA

  20. External co-processing ARM Memory ARM Memory GPU FPGA PCIe Memory FPGA

  21. Co-processor details • FPGA Co-Processor • Separate memory • Adds bandwidth • Reduces access conflict • Algorithm aware DMA • Access memory in ordered way • Add caching through embedded RAM • Algorithm specific processors • HLS/OpenCL synthesized IP blocks • DSP with custom instructions • Hardcoded IP blocks ARM ARM Memory DMA Processor DMA Processor Block capture Block capture Block capture DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM DPRAM DSP core (s) DSP core (s) DSP core/IP Block

  22. Co-processor details • Flex DMA • Dedicated processor with DMA custom instruction • Software defined memory access pattern • Block Capture • Extracts data for each tile • DPRAM • Local, high speed cache • DSP Core • Dedicated processor with Algorithm specific custom instructions ARM ARM Memory Flex DMA Flex DMA Flex DMA Flex DMA Block capture Block capture Block capture Block capture Block capture Block capture DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM DPRAM DPRAM DPRAM DSP core (s) DSP core (s) DSP core (s) DSP core (s) DSP core/IP Block DSP core/IP Block

  23. Environment Application C++ Java Python OpenCV cv::parallel_for_ Threading APIs Concurrency CStripes GCD OpenMP TBB OS OpenVX Acceleration SSE/AVX/NEON OpenCL CUDA FPGA

  24. OpenVX

  25. OpenVX Graph Manager • Graph Construction • Allocates resources • Logical representation of algorithm • Graph Execution • Concatenate nodes avoiding memory storage • Tiling extensions • Single node execution can be split in multiple tiles • Multiple accelerators executing single task in parallel Memory Node1 Memory Node2 Memory Memory Node1 Node2 Memory

  26. Summary • OpenCV today is mainly PC oriented. • ARM, Cuda, OpenCL support growing • Existing acceleration only on selected functions • Embedded CV requires good partitioning among resources • When ASSPs are not enough FPGAs are key • OpenVX provides a consistent HW acceleration platform, not only for OpenCV What we learnt

  27. Questions

  28. Thank you

More Related