Improving the Performance of Parallel Backprojection on a Reconfigurable Supercomputer
This presentation is the property of its rightful owner.
Sponsored Links
1 / 1

PIO transfer mode Requires two-step staged transfers because of limited address space PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on
  • Presentation posted in: General

Improving the Performance of Parallel Backprojection on a Reconfigurable Supercomputer. Ben Cordes, Miriam Leeser, Eric Miller Northeastern University, Boston MA {bcordes,mel,[email protected] Richard Linderman Air Force Research Laboratory, Rome NY [email protected]

Download Presentation

PIO transfer mode Requires two-step staged transfers because of limited address space

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Pio transfer mode requires two step staged transfers because of limited address space

Improving the Performance of Parallel Backprojection on a Reconfigurable Supercomputer

Ben Cordes, Miriam Leeser, Eric MillerNortheastern University, Boston MA{bcordes,mel,[email protected]

Richard LindermanAir Force Research Laboratory, Rome [email protected]

Old Block Diagram

New Block Diagram

Backprojection is an image reconstruction algorithm that is used in a number of applications, including synthetic aperture radar (SAR). Backprojection for SAR contains a high degree of parallelism, which makes it well-suited for implementation on reconfigurable devices. We have previously reported results of an implementation of backprojection for the AFRL Heterogeneous High Performance Cluster (HHPC), a supercomputer that combines traditional CPUs with reconfirgurable computing resources. Using 32 hybrid (CPU+FPGA) nodes, we achieved a 26x performance gain over a single CPU-only node. In this work we achieve significant speedup by eliminating the performance bottlenecks that were previously experienced. A number of improvements in the system are projected to provide greater than 500x speedup over single-node software when included in a highly parallelized implementation.

Input BlockRAMs

Input BlockRAMs

AddressGenerator

AddressGenerator

Swath

LUT

Swath

LUT

StagingBlockRAMs

PCI Bus

PCI Bus

Target Memories

Target Memories

HHPC Cluster Architecture

  • 48-node Beowulf cluster

  • Dual 2.2GHz Xeon CPUs

  • Linux operating system

  • Annapolis Microsystems Wildstar II/PCI FPGA board

  • Gigabit Ethernet interface board

  • Myrinet (MPI) interface board

Target Output FIFO

Gigibit Ethernet

Myrinet

Key Improvements to Single-Node Performance

Old Design

New Design

  • PIO transfer mode

    • Requires two-step staged transfers because of limited address space

  • Processing controlled by Host PC

    • PIO data transfer requires blocking API calls

    • Host issues “run” commands to FPGA at each processing step

  • FPGA processes four projections per iteration

    • Total processing requires 1024 projections to be processed

    • If 4 projections are processed, 1024/4=256 iterations are required

  • DMA transfer mode

    • Allows direct memory loading

    • More efficient utilization of PCI bus

    • Shifts performance bottleneck onto data processing

    • Improves transfer speeds 15x Host-to-FPGA, 45x FPGA-to-Host

  • Processing controlled by FPGA

    • DMA data transfers are mastered by the FPGA

    • Host PC is freed to perform other work during processing

    • 30% speedup in processing time

  • FPGA processes eight projections per iteration

    • Single-step processing time increased slightly even though twice as much data is processed

    • Overall data transfer time unaffected

    • 32% speedup in overall processing time

Data Block 0Host-Stage RAM(a)

Data Block 1Host-Stage RAM(b)

Data Block 2Host-Stage RAM(a)

Host

Data Block 0Stage RAM(a)-Input RAM

Data Block 1Stage RAM(b)-Input RAM

Parallel Implementation Improvements

FPGA

Block 0Host-Input

Block 1Host-Input

Block 2Host-Input

Block 3Host-Input

FPGA

  • Removed dependency on clustered filesystem

    • Previous system read input data from disk

    • Significant source of non-deterministic runtime

    • Impact: as much as 10x overall performance improvement

  • Improved data distribution model: Swathbuckler project

    • Previous work distributed projection data from a single node

    • New model allows incoming data to be streamed to each node

    • Non-recurring setup time can be amortized across multiple runs

    • Output images are not collected; they remain on the individual nodes

    • Publish-subscribe system provides consumers with needed images

    • Impact: approximately 2x improvement

  • Improved single-node implementation

    • See detail at right

    • Impact: approximately 2.5x improvement

Host+FPGA

Transfer Data

RunIteration

Transfer Data

RunIteration

PrepareData

“Go”

Host

Transfer Data

RunIteration

Transfer Data

FPGA

Transfer4 Projections

RunIteration

Transfer4 Projections

RunIteration

FPGA

Transfer8 Projections

RunIteration

Repeat 1024/4 = 256 times

FPGA

Projected Parallel Performance

Repeat 1024/8 = 128 times

Single-node performance: 5.3x over software

Single-node performance: 2.1x over software

References

  • A. Conti, B. Cordes, M. Leeser, E. Miller, and R. Linderman, “Adapting Parallel Backprojection to an FPGA Enhanced Distributed Computing Environment”. Ninth Annual Workshop on High-Performance Embedded Computing (HPEC), September 2005.

  • S. Tucker, R. Vienneau, J. Corner, and R. Linderman, “Swathbuckler: HPC Processing and Information Exploitation”. Proceedings of the 2006 IEEE Radar Conference, April 2006.


  • Login