Accelerating and Benchmarking Radix-k Image Compositing at Large Scale

Accelerating and Benchmarking Radix-k Image Compositing at Large Scale Wesley Kendall, Tom Peterka, Jian Huang, Han-Wei Shen, and Robert Ross

A Thanks To The Coauthors Tom Peterka Jian Huang Argonne National Laboratory The University of Tennessee, Knoxville Han-Wei Shen Robert Ross The Ohio State University Argonne National Laboratory

Image Compositing ? • Direct Send • All processes communicate portions of local images with each other in one step • Scheduling of communication (SLIC) can reduce contention [Stompel et al. PGV 03] • Non-power-of-two processes are ok 3

Image Compositing ? • Binary Swap • All processes halve their local images in each step and communicate in groups of two • Must have power-of-two process count • 2-3 swap compositing extends binary swap to non-power-of-two [Yu et al. SC 08] • Less network contention than direct send 4

Image Compositing ? • Radix-k • Group sizes are configurable and network contention can be managed • No penalty for non-power-of-two process counts • Communication and computation can also be overlapped 7

Image Compositing ? • Radix-k • Group sizes are configurable and network contention can be managed • No penalty for non-power-of-two process counts • Communication and computation can also be overlapped 8

Image Compositing • Common algorithms may break at large scale. The real issues for scalability: • Networkcontention– Need to saturate network without overloading it • Compression – Local images have more empty space at larger process counts • Load imbalance – When compression is used, load imbalance occurs Radix-k jumpshot visualization with k = {8, 8} [Peterka et al. SC 09] Parallel rendering with direct send [Peterka et al. ICPP 09] • Our improvements to Radix-k: • Automatic k-value selection for maximizing computation overlap and minimizing network contention • An efficient run-length encoding implementation for compression • A new method to minimize the load imbalance from compression 9

Testing Environment • Target Architectures: • Intrepid – 40,960 nodes of quad-core 850 MHz IBM PowerPCs • Eureka – 100 nodes of two quad-core 2 GHz Intel Xeons • Jaguar XT5 – 18,688 nodes of two hex-core 2.6 GHz AMD Opterons • Lens – 32 nodes of four quad-core 2.3 GHz AMD Opterons Synthetic benchmark: regular (left) and zoomed out (right) We perform all k-value selection and load balancing tests with a synthetic benchmark. Scalability and comparison against binary swap are tested with a parallel volume renderer. Parallel volume renderer: zoomed in (left), regular (center), and zoomed out (right) Jaguar (left) and Intrepid (right)

Compression • Drawbacks of typical run-length encoding implementations: • Byte alignment issues • Taking a subset is more difficult and memory consuming • Many varying pixels cause inflation Typical run-length encoding implementation [Ahrens et al. EGPGV 98] • Our modified run-length encoding implementation: • Two buffers – one for counts of empty / nonempty pixels and one for pixel data • Byte alignment is not an issue • Varying pixels are not inflated • Taking a subset involves only the creation of a new header and assigning a pointer to pixel data A modified run-length encoding implementation that uses a count for empty and nonempty pixels 11

K-value Selection 3D torus network Switched network • K-value selection is dependent on many factors • Network topology / bandwidth • Process count • Image size • Additional hardware support –DMA and multiple network links Intrepid Jaguar We encode the k-values into our Radix-k implementation and use these for the rest of the tests 12 Lens Eureka

Load Balancing • Groups first compute a local division of their images by dividing the prefix sum of their bounding boxes. Processes in a group have equal data to operate on. • Processes then swap their data and perform the compositing operator.

Load Balancing New groups are created for the next round of compositing. Groups will then operate on their respective local parts of the image. Because of the local partition step, groups do not have equal portions of the image to operate on. A redistribution step must be introduced for correctness.

Load Balancing A global division of the image is computed by dividing the prefix sum of all bounding boxes. The processes then send parts of their composited image to the group members that will operate on it in the next round.

Load Balancing Once groups have equal portions of the image to operate on, the first step is carried out again. Groups compute a local division of their image, swap, and perform the compositing operator.

Load Balancing Compositing is now complete and the image remains distributed across all the processes.

Load Balancing Jumpshot visualization of Radix-k without load balancing Jumpshot visualization of Radix-k with load balancing The improvement factor of using load balancing when compared to not using it. Points below the x axis denote worse performance with load balancing.

Volume Rendering Large Images Scalability tests from the parallel volume renderer on a 64 megapixel image The improvement factor of optimized Radix-k when compared to using optimized binary swap

Conclusions / Future Work • The improvements to Radix-k have shown: • Scalability on most of the architectures at high process counts • Significant performance increases over optimized binary swap • Load balancing improvement at modest process counts • Configurability that can be used in software packages to tune Radix-k to the underlying architecture • We would like to acknowledge: DOE SciDAC Institute for Ultrascale Visualization, John Blondin, Sean Ahern, ALCF, and NCCS. • Wes Kendall • The University of Tennessee, Knoxville • Email – kendall@eecs.utk.edu Currently we are implementing Radix-k into the IceT library that Paraview and Visit use. We would also like to assess static load balancing methods [Takeuchi et al. Journal of Parallel Computing 03] and implement accelerations for occlusion culling.

Accelerating and Benchmarking Radix-k Image Compositing at Large Scale