embedded opencv acceleration n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Embedded OpenCV Acceleration PowerPoint Presentation
Download Presentation
Embedded OpenCV Acceleration

Loading in 2 Seconds...

play fullscreen
1 / 28

Embedded OpenCV Acceleration - PowerPoint PPT Presentation


  • 162 Views
  • Uploaded on

Embedded OpenCV Acceleration. Dario Pennisi. Introduction. Open -Source Computer Vision Library Over 2500 algorithms and functions Cross platform, portable API Windows, Linux, OS X, Android, iOS Real Time performance BSD license Professionally developed and maintained. History.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Embedded OpenCV Acceleration' - zahir-irwin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction
Introduction
  • Open-Source ComputerVision Library
  • Over 2500 algorithms and functions
  • Cross platform, portable API
    • Windows, Linux, OS X, Android, iOS
  • Real Time performance
  • BSD license
  • Professionally developed and maintained
history
History
  • Launched in 1999 by Intel
    • Showcasing Intel Performance Library
  • First Alpha released in 2000
  • 1.0 version released in 2006
  • Corporate support by Willow Garage in 2008
  • 2.0 version released in 2009
    • Improved c++ interfaces
    • Releases each 6 months
  • In 2014 taken over by ItSeez
  • 3.0 in beta now
    • Drop C API support
application structure
Application structure
  • Building blocks to ease vision applications

Image Retrieval

Pre Processing

Feature Extraction

Object Detection

OpenCV

imgproc

highgui

objdetect

features2d

ml

stitching

calib3d

video

Recognition

Reconstruction

Analisys

Decision Making

environment
Environment

Application

C++

Java

Python

OpenCV

cv::parallel_for_

Threading APIs

Concurrency

CStripes

GCD

OpenMP

TBB

OS

Acceleration

CUDA

SSE/AVX/NEON

OpenCL

system engineering
System Engineering
  • Dimensioning system is fundamental
    • Understand your algorithm
    • Carefully choose your toolbox
    • Embedded means no chance for “one size fits all”
acceleration strategies
Acceleration Strategies
  • Optimize Algorithms
    • Profile
    • Optimize
    • Partition (CPU/GPU/DSP)
  • FPGA acceleration
    • High level synthesis
    • Custom DSP
    • RTL coding
  • Brute Force
    • Increase number of CPUs
    • Increase CPU Frequency
  • Accelerated libraries
    • NEON
    • OpenCL/CUDA
bottlenecks
Bottlenecks

Know your enemy

memory
Memory
  • Access to external memory is expensive
    • CPU load instructions are slow
    • Memory has Latency
    • Memory bandwidth is shared among CPUs
  • Cache
    • Prevents CPU to access external memory
    • Data and instruction
disordered accesses
Disordered accesses
  • What happens when we have cache miss?
    • Fetch data from same memory row  13 clocks
    • Fetch data from a different row 23 clocks
  • Cache line usually 32 bytes
    • 8 clocks to fill a line (32 bit data bus)
  • Memory bandwidth Efficiency
    • 38% on same row
    • 26% on different row
bottlenecks cache
Bottlenecks - Cache
  • 1920x1080 YCbCr 4:2:2 (Full HD) 4MB
    • Double the size of the biggest ARM L2 cache
  • 1280x720 YCbCr 4:2:2 (HD)  1.8 MB
    • Just fits L2 Cache… ok if reading and writing to the same frame
  • 720x576 YCbCr 4:2:2 (SD)  800KB
    • 2 images in L2 cache…
opencv algorithms
OpenCV Algorithms
  • Mostly designed for PCs
    • Well structured
    • General purpose
    • Optimized functions for SSE/AVX
    • Relatively optimized
    • Small number of accelerated functions
      • NEON
      • Cuda (nVidia GPU/Tegra)
      • OpenCL (GPU, Multicore processors)
multicore arm neon
Multicore ARM/NEON
  • NEON SIMD instructions work on vectors of registers
    • Load-process-storephilosophy
    • Load/store costs 1 cycle only if in L1 cache
      • 4-12 cycles if in L2
      • 25 to 35 cycles on L2 cache miss
    • SIMD instructions can take from 1 to 5 clocks
  • Fast clock useless on big datasets/small computation
generic dsp
Generic DSP
  • Very similar to ARM/NEON
    • High speed pipeline impaired by inefficient memory access subsystem
    • When smart DMA is available it is very complex to program
  • When DSP is integrated in SoC it shares ARM’s bandwidth
opencl on gpu
OpenCL on GPU
  • OpenCL on Vivante GC2000
    • Claimed capability up to 16 GFLOPS
  • Real Applications
    • only on internal registers: 13.8 GFLOPS
    • computing 1000x1000 matrix: 600 MFLOPS
  • Bandwidth and inefficiencies:
    • Only 1K local memory and 64 byte memory cache
opencl on fpga
OpenCL on FPGA
  • Same code can run on FPGA and GPU
  • Transform selected functions in hardware
  • Automated memory access coalescing
  • Each function requires dedicated logic
    • Large FPGAs required
    • Partial reconfiguration may solve this
  • Significant compilation time
hls on fpga
HLS on FPGA
  • High Level Synthesis
    • Convert C to hardware
  • HLS requires Code to be heavily modified
    • Pragmas to instruct compiler
    • Code restructuring
    • Not portable anymore
  • Each function requires dedicated logic
    • Large FPGAs required
    • Partial reconfiguration may solve this
  • Significant compilation time
a different approach
A different approach

Demanding algorithms on low cost/power HW

Algorithm Analysis

Memory Access Pattern

Data intensive processing

Decision Making

DSP

NEON

Custom Instruction

(RTL)

ARM program

DMA

external co processing
External co-processing

ARM

Memory

ARM

Memory

GPU

FPGA

PCIe

Memory

FPGA

co processor details
Co-processor details
  • FPGA Co-Processor
    • Separate memory
      • Adds bandwidth
      • Reduces access conflict
    • Algorithm aware DMA
      • Access memory in ordered way
      • Add caching through embedded RAM
    • Algorithm specific processors
      • HLS/OpenCL synthesized IP blocks
      • DSP with custom instructions
      • Hardcoded IP blocks

ARM

ARM

Memory

DMA Processor

DMA Processor

Block capture

Block capture

Block capture

DPRAM(s)

DPRAM(s)

DPRAM(s)

DPRAM(s)

DPRAM

DPRAM

DSP core (s)

DSP core (s)

DSP core/IP Block

co processor details1
Co-processor details
  • Flex DMA
    • Dedicated processor with DMA custom instruction
    • Software defined memory access pattern
  • Block Capture
    • Extracts data for each tile
  • DPRAM
    • Local, high speed cache
  • DSP Core
    • Dedicated processor with Algorithm specific custom instructions

ARM

ARM

Memory

Flex DMA

Flex DMA

Flex DMA

Flex DMA

Block capture

Block capture

Block capture

Block capture

Block capture

Block capture

DPRAM(s)

DPRAM(s)

DPRAM(s)

DPRAM(s)

DPRAM(s)

DPRAM(s)

DPRAM(s)

DPRAM(s)

DPRAM

DPRAM

DPRAM

DPRAM

DSP core (s)

DSP core (s)

DSP core (s)

DSP core (s)

DSP core/IP Block

DSP core/IP Block

environment1
Environment

Application

C++

Java

Python

OpenCV

cv::parallel_for_

Threading APIs

Concurrency

CStripes

GCD

OpenMP

TBB

OS

OpenVX

Acceleration

SSE/AVX/NEON

OpenCL

CUDA

FPGA

openvx graph manager
OpenVX Graph Manager
  • Graph Construction
    • Allocates resources
    • Logical representation of algorithm
  • Graph Execution
    • Concatenate nodes avoiding memory storage
  • Tiling extensions
    • Single node execution can be split in multiple tiles
    • Multiple accelerators executing single task in parallel

Memory

Node1

Memory

Node2

Memory

Memory

Node1

Node2

Memory

summary
Summary
  • OpenCV today is mainly PC oriented.
  • ARM, Cuda, OpenCL support growing
  • Existing acceleration only on selected functions
  • Embedded CV requires good partitioning among resources
  • When ASSPs are not enough FPGAs are key
  • OpenVX provides a consistent HW acceleration platform, not only for OpenCV

What we learnt