pFPC : A Parallel Compressor for Floating-Point Data

pFPC: A Parallel Compressorfor Floating-Point Data Martin Burtscher1 and Paruj Ratanaworabhan2 1The University of Texas at Austin 2Cornell University

Introduction • Scientific programs • Often produce and transfer lots of floating-point data(e.g., program output, checkpoints, messages) • Large amounts of data • Are expensive and slow to transfer and store • FPC algorithm for IEEE 754 double-precision data • Compresses linear streams of FP values fast and well • Single-pass operation and lossless compression March 2009

Introduction (cont.) • Large-scale high-performance computers • Consist of many networked compute nodes • Compute nodes have multiple CPUs but only one link • Want to speed up data transfer • Need real-time compression to match link throughput • pFPC: a parallel version of the FPC algorithm • Exceeds 10 Gb/s on four Xeon processors March 2009

Sequential FPC Algorithm [DCC’07] • Make two predictions • Select closer value • XOR with true value • Count leading zero bytes • Encode value • Update predictors March 2009

pFPC: Parallel FPC Algorithm • pFPC operation • Divide data stream into chunks • Logically assign chunks round-robin to threads • Each thread compresses its data with FPC • Key parameters • Chunk size & number of threads March 2009

Evaluation Method • Systems • 3.0 GHz Xeon with 4 processors • Others in paper • Datasets • Linear streams of real-world data (18 – 277 MB) • 3observations: error, info, spitzer • 3simulations: brain, comet, plasma • 3 messages: bt, sp, sweep3d March 2009

Compression Ratio vs. Thread Count • Configuration • Small predictor • Chunk size = 1 • Compression ratio • Low (FP data) • Other algos worse • Fluctuations • Due to multi-dimensional data March 2009

Compression Ratio vs. Chunk Size • Configuration • Small predictor • 1 to 4 threads • Compression ratio • Flat for 1 thread • Steep initial drop • Chunk size • Larger is better for history-based pred. March 2009

Throughput on Xeon System Compression Decompression • Throughput increases with chunk size • Loop overhead, false sharing, TLB performance • Throughput scales with thread count • Limited by load balance and memory bandwidth March 2009

Summary pFPCalgorithm Chunks up data and logically assigns chunks in round-robin fashion to threads Reaches 10.9 and 13.6 Gb/s throughput with a compression ratio of 1.18 on a 4-core 3 GHz Xeon Portable C source code is available on-linehttp://users.ices.utexas.edu/~burtscher/research/pFPC/ March 2009

Conclusions • For best compression ratio, thread count should equal to or be small multiple of data’s dimension • Chunk size should be one • For highest throughput, chunk size should at least match system’s page size (and be page aligned) • Larger chunks also yield higher compression ratios with history-based predictors • Parallel scaling is limited by memory bandwidth • Future work should focus on improving compression ratio without increasing the memory bandwidth March 2009

pFPC : A Parallel Compressor for Floating-Point Data