Improving the Performance of Parallel Backprojection on a Reconfigurable Supercomputer
1 / 1

PIO transfer mode Requires two-step staged transfers because of limited address space - PowerPoint PPT Presentation

  • Uploaded on

Improving the Performance of Parallel Backprojection on a Reconfigurable Supercomputer. Ben Cordes, Miriam Leeser, Eric Miller Northeastern University, Boston MA {bcordes,mel,elmiller} Richard Linderman Air Force Research Laboratory, Rome NY [email protected]

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' PIO transfer mode Requires two-step staged transfers because of limited address space' - mandell

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Improving the Performance of Parallel Backprojection on a Reconfigurable Supercomputer

Ben Cordes, Miriam Leeser, Eric MillerNortheastern University, Boston MA{bcordes,mel,elmiller}

Richard LindermanAir Force Research Laboratory, Rome [email protected]

Old Block Diagram

New Block Diagram

Backprojection is an image reconstruction algorithm that is used in a number of applications, including synthetic aperture radar (SAR). Backprojection for SAR contains a high degree of parallelism, which makes it well-suited for implementation on reconfigurable devices. We have previously reported results of an implementation of backprojection for the AFRL Heterogeneous High Performance Cluster (HHPC), a supercomputer that combines traditional CPUs with reconfirgurable computing resources. Using 32 hybrid (CPU+FPGA) nodes, we achieved a 26x performance gain over a single CPU-only node. In this work we achieve significant speedup by eliminating the performance bottlenecks that were previously experienced. A number of improvements in the system are projected to provide greater than 500x speedup over single-node software when included in a highly parallelized implementation.

Input BlockRAMs

Input BlockRAMs










Target Memories

Target Memories

HHPC Cluster Architecture

  • 48-node Beowulf cluster

  • Dual 2.2GHz Xeon CPUs

  • Linux operating system

  • Annapolis Microsystems Wildstar II/PCI FPGA board

  • Gigabit Ethernet interface board

  • Myrinet (MPI) interface board

Target Output FIFO

Gigibit Ethernet


Key Improvements to Single-Node Performance

Old Design

New Design

  • PIO transfer mode

    • Requires two-step staged transfers because of limited address space

  • Processing controlled by Host PC

    • PIO data transfer requires blocking API calls

    • Host issues “run” commands to FPGA at each processing step

  • FPGA processes four projections per iteration

    • Total processing requires 1024 projections to be processed

    • If 4 projections are processed, 1024/4=256 iterations are required

  • DMA transfer mode

    • Allows direct memory loading

    • More efficient utilization of PCI bus

    • Shifts performance bottleneck onto data processing

    • Improves transfer speeds 15x Host-to-FPGA, 45x FPGA-to-Host

  • Processing controlled by FPGA

    • DMA data transfers are mastered by the FPGA

    • Host PC is freed to perform other work during processing

    • 30% speedup in processing time

  • FPGA processes eight projections per iteration

    • Single-step processing time increased slightly even though twice as much data is processed

    • Overall data transfer time unaffected

    • 32% speedup in overall processing time

Data Block 0Host-Stage RAM(a)

Data Block 1Host-Stage RAM(b)

Data Block 2Host-Stage RAM(a)


Data Block 0Stage RAM(a)-Input RAM

Data Block 1Stage RAM(b)-Input RAM

Parallel Implementation Improvements


Block 0Host-Input

Block 1Host-Input

Block 2Host-Input

Block 3Host-Input


  • Removed dependency on clustered filesystem

    • Previous system read input data from disk

    • Significant source of non-deterministic runtime

    • Impact: as much as 10x overall performance improvement

  • Improved data distribution model: Swathbuckler project

    • Previous work distributed projection data from a single node

    • New model allows incoming data to be streamed to each node

    • Non-recurring setup time can be amortized across multiple runs

    • Output images are not collected; they remain on the individual nodes

    • Publish-subscribe system provides consumers with needed images

    • Impact: approximately 2x improvement

  • Improved single-node implementation

    • See detail at right

    • Impact: approximately 2.5x improvement


Transfer Data


Transfer Data





Transfer Data


Transfer Data


Transfer4 Projections


Transfer4 Projections



Transfer8 Projections


Repeat 1024/4 = 256 times


Projected Parallel Performance

Repeat 1024/8 = 128 times

Single-node performance: 5.3x over software

Single-node performance: 2.1x over software


  • A. Conti, B. Cordes, M. Leeser, E. Miller, and R. Linderman, “Adapting Parallel Backprojection to an FPGA Enhanced Distributed Computing Environment”. Ninth Annual Workshop on High-Performance Embedded Computing (HPEC), September 2005.

  • S. Tucker, R. Vienneau, J. Corner, and R. Linderman, “Swathbuckler: HPC Processing and Information Exploitation”. Proceedings of the 2006 IEEE Radar Conference, April 2006.