towards a heterogeneous computer architecture for cactus n.
Skip this Video
Loading SlideShow in 5 Seconds..
Towards a Heterogeneous Computer Architecture for CACTuS PowerPoint Presentation
Download Presentation
Towards a Heterogeneous Computer Architecture for CACTuS

Loading in 2 Seconds...

play fullscreen
1 / 36

Towards a Heterogeneous Computer Architecture for CACTuS - PowerPoint PPT Presentation

  • Uploaded on

Towards a Heterogeneous Computer Architecture for CACTuS. Anthony Milton. Supervisors: Assoc. Prof. David Kearney ( UniSA ) Dr. Sebastien Wong (DSTO). Reconfigurable Computing Lab Collaboration Partners. Motivation for Heterogeneous CACTuS.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Towards a Heterogeneous Computer Architecture for CACTuS' - tam

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
towards a heterogeneous computer architecture for cactus

Towards a Heterogeneous Computer Architecture for CACTuS

Anthony Milton


Assoc. Prof. David Kearney (UniSA)

Dr. Sebastien Wong (DSTO)

Reconfigurable Computing Lab

motivation for heterogeneous cactus
Motivation for Heterogeneous CACTuS
  • CACTuS originally developed and prototyped in MATLAB:
    • Great testbed for algorithm development,
    • BUT poor computational performance
  • As CACTuS is a visual tracking algorithm real-time operation is desired.
motivation data parallelism
Motivation – Data Parallelism

Input Frame

Posterior Position

Observed Image

Posterior Velocity

motivation task parallelism
Motivation – Task Parallelism

SEF #2

SEF #1

SEF #n-1

SEF #n

motivation gpus fpgas
Motivation – GPUs & FPGAs
  • It is well known that GPUs and FPGAs are well suited to data-parallel computation
  • GPUs originally used for computer graphics, now used in a huge number of application areas (GPGPU)
  • FPGAs used for specialized applications requiring high performance but low power (Radar processing, TCP/IP packet processing…)
heterogeneous computing
Heterogeneous Computing
  • Each computational resource has strengths and weaknesses
  • Using a mix of different (heterogeneous) computing resources for computation, drawing on the strengths of each resource.




heterogeneous computing systems
Heterogeneous Computing Systems
  • Construction of a hardware prototype with disparate compute resources is easy
  • Application development for such a system is hard:
    • Algorithm translation
    • Design partitioning
    • Languages and development environments
    • Models of computation
    • Communication and data transfer
    • Etc..
  • How to create designs that are partitioned across the different computing resources?
project goals
Project Goals
  • Develop a heterogeneous computer architecture for CACTuS
  • Maintain tracking metric compared to MATLAB “gold standard”
  • Improve execution performance of the algorithm
our research platform
Our Research Platform
  • Xenon Systems workstation:
    • Intel X5677 Xeon quad-core CPU @ 3.46GHz
    • 6GB DDR3 DRAM
    • NVIDIA Quadro 4000 GPU (2GB GDDR5 DRAM, OpenCL 1.1 device, CUDA 2.0 device)
  • Alpha Data ADM-XRC-6T1 FPGA board
    • Xilinx Virtex-6 XC6VLX550T FPGA (549,888 logic cells, 864 DSP slices, 2.844MB BRAM)
    • 2GB off-chip DDR3 DRAM
    • Connects to host via PCIe 2.0 x4
development approach
Development Approach
  • Maintain similar high-level abstractions across all versions
  • Use 3rd party libraries and designs, open source where possible
  • Incremental approach to overall development
design decision common infrastructure
Design Decision – Common Infrastructure
  • Necessary to develop the C++/CPU version as much of the infrastructure code would need to be re-usedfor GPU & FPGA versions.
  • This included video, MATLAB and text file I/O, visualisation, timing & unit testing.
  • Third party libraries used for this infrastructure included:
    • Qt – visualisation
    • Boost – non-MATLAB file I/O, timing, unit testing
    • MatIO – MATLAB file I/O
design decisions c cpu
Design Decisions – C++/CPU
  • To reduce development time, and help ensure high-level similarity with MATLAB code, the open source C++ linear algebra library, Armadillo, was utilised.
  • At the start of development (late 2011), Armadillo did not feature any 2D FFT implementations1, so the industry standard FFTW library was used2.

1. Since been added in September 2013.2. MATLAB itself uses FFTW for computing FFTs.

design decisions opencl gpu
Design Decisions – OpenCL/GPU
  • Essentially 2 choices for GPU programming framework: CUDA and OpenCL
    • CUDA limited to NVIDIA hardware, mature, has good supporting libraries such as CUBLAS, CUFFT, good development tools
    • OpenCL vendor agnostic, less mature, not limited to just GPUs - multicore, GPU, DSP, FPGA (portable)
  • OpenCL selected: avoid vendor lock-in and eye to the future, as OpenCL likely to become dominant in the future.
design decisions opencl gpu1
Design Decisions – OpenCL/GPU
  • To reduce development time, and help ensure high-level similarity with MATLAB code, the open source OpenCL computing library, ViennaCL, was utilised.
  • Provided methods for most linear algebra operations required for CACTuS
  • Did not support complex numbers, but as complex numbers only required for 2D f -domain convolution, workarounds possible.
design decisions fpga bluespec
Design Decisions – FPGA/Bluespec
  • Traditional HDL, Verilog & VHDL, very low level and require designer to design control logic, implement hardware flow control etc.
    • Design flexibilitybut lower productivity
  • Bluespec (BSV) – modern, rule-based, high-level HDL:
    • Rule based approach naturally matches parallel nature of HW
    • Designer freed from (error prone) control logic design
  • Alpha Data hardware infrastructure & SDK
    • Get data to and from FPGA via 4 DMA channels.
    • Drivers & SDK on PC side, support hardware and reference designs on FPGA side.
design decisions heterogeneous
Design Decisions - Heterogeneous
  • How to best map the algorithm to the heterogeneous platform?
    • Still a work-in-progress and currently being explored
bottleneck observe velocity
Bottleneck - Observe Velocity
  • The Observe Velocity stage of CACTuS was primary focus for FPGA, is generally between 40% & 90%+ of total FLOPs of algorithm
  • To perform Observe Velocity in f -domain :
    • 2D-FFT on Xs to give Xs_freq
    • 2D-FFT on Xm to give Xm_freq
    • Per-element-multiply betweenXs_freqand Xm_freqto give Vm_freq
    • 2D-IFFT on Vm_freqto give Vm
evaluation methods performance and accuracy
Evaluation Methods – Performance and Accuracy
  • Need for evaluation of computational performance and tracking accuracy
    • Verification of development
    • Provide a basis for comparison
  • Functionality for evaluating computational performance (timing) and tracking accuracy (tracking metrics) integrated into common infrastructure
    • Allows evaluation for single executions
    • External scripts allow for evaluation of batch jobs
results performance
Results – Performance


1. Vm performed on CPU2. Vm on GPU, padded to nearest-power-of-2

limitations problems
Limitations & Problems
  • Early phase of exploring algorithm mappings to heterogeneous platform
  • 3rd party libraries not efficient
  • Task parallelism not yet exploited
    • Refactor algorithm flow control – lose connection with MATLAB “gold standard” version
  • Software aspect of project is complex
    • Multiple developers, multiple third party libraries,
  • FPGA pipeline currently limited to 2D f -domain convolution, only relevant to predict and observe stages
    • Also limited in size due to resource utilisation constraints
  • Many issues encountered with FPGA development
lessons learnt
Lessons Learnt
  • Developer (in)experience impacts on development time and achieved performance greatly.
  • OpenCL difficult to develop with, becoming easier as it matures and associated libraries improve
    • CUDA might have been a better initial choice
  • Use of immature libraries not the best idea (unless frequent code changes are your idea of fun)
  • FPGA functionality takes a lot of time and effort to develop
    • Evaluate exactly what functionality is required to meet performance constraints.
future work
Future Work
  • Continue to improve exploitation of data parallelism
    • Likely to be inefficient due to use of small kernels, consider combining small kernels
  • Task parallelism not yet exploited
    • Incorporate multi-core threading to fully exploit
    • Investigate problem of scheduling computational resources in system
  • DRAM integration would benefit FPGA performance greatly (images currently not large enough to amortise DMA overheads), open up further application mappings.


  • Single Instruction Single Data (SISD)
  • Excel at sequential code, heavily branched code and task-parallel workloads
  • Easiest to develop for: software languages, environments, strong debugging
  • Easily understood memory architecture – generally transparent to developer
  • Single Instruction Multiple Data (SIMD)
  • Excel at data parallel workloads with deterministic memory accesses
  • Best architecture for floating point arithmetic
  • Moderate to develop for: few languages but rapidly maturing ecosystem
  • Moderately complex memory architecture – developer must be aware of structure
  • No fixed model model of computation, designed defined
  • Flexible enough to excel at a variety of tasks
  • Best architecture for fixed point arithmetic
  • Excel at bit, integer and logic operations
  • Difficult to develop for: hardware languages (HDLs), simulators
  • Memory architecture required to be defined by designer
nature of fpga design
Nature of FPGA Design
  • The reconfigurable nature of FGPAs is both the major strength & weakness of the platform:
    • Freedom to create custom hardware structures & circuits for specific purposes = specialised, efficient, high-performance HW
    • No existing microarchitecture, designer needs to create = long development time, hard to debug, huge number of options to consider & choices to be made.
  • Design Space Exploration (DSE) – DS encapsulates all possible variations & permutations of designs that implement a system.
challenges of hw design
Challenges of HW Design

Debugging FPGA designs is hard and time consuming – combination of simulation and run-time techniques

  • To simulate in software need to develop testbenchesand analysewaveform:
  • To analysebehaviour at run-time in hardware need to insert Chipscope cores (modify design), anaylse waveform:
limitations of current fpga implementation
Limitations of Current FPGA Implementation
  • Using 128-point configuration:
    • BRAM utilisation is 24%
    • Timing constraints are just met
  • Using 256-point configuration:
    • BRAM utilisation is 85%
    • Timing constraints are not met: timing paths associated with the larger BRAMs are the main cause of problems.
  • Because of the failing timing constraints of the 256-point configuration, currently restricted to 128-point configuration (2D-FFTs on 128 x 128 images).
  • Moving away from exclusive use of BRAM by incorporating off-chip DRAM will likely allow much larger input images.
future work fpga dram
Future Work FPGA - DRAM
  • Complete integration of DRAM into infrastructure, unfortunately not a PnP solution:
    • Have a reference design, but additional components to interface between existing Alpha Data infrastructure and system components developed with Bluespec have been required.
    • Also additional clock domains, and many additional constraints to be considered.
  • Close to finalising a design for testing initial integration of DRAM into system.
  • Modules to perform transpose operations in DRAM have already been developed, so once integration is verified, using DRAM with 2D frequency domain convolution design will be straightforward.
future work fpga further integration of hw modules
Future work FPGA – Further Integration of HW Modules
  • Developed a functional spatial convolution array in VHDL:
  • Not yet used or integrated into system
  • Has transpose linear filtering architecture, essentially systolic array.
  • Highly parallel so exhibits highperformance, but high DSP utilisation.
misc information
Misc Information
  • R2013a version of MATLAB used, with Image Processing toolbox