1 / 11

pFPC : A Parallel Compressor for Floating-Point Data

pFPC : A Parallel Compressor for Floating-Point Data. Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University. Introduction. Scientific programs

cruz
Download Presentation

pFPC : A Parallel Compressor for Floating-Point Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. pFPC: A Parallel Compressorfor Floating-Point Data Martin Burtscher1 and Paruj Ratanaworabhan2 1The University of Texas at Austin 2Cornell University

  2. Introduction • Scientific programs • Often produce and transfer lots of floating-point data(e.g., program output, checkpoints, messages) • Large amounts of data • Are expensive and slow to transfer and store • FPC algorithm for IEEE 754 double-precision data • Compresses linear streams of FP values fast and well • Single-pass operation and lossless compression March 2009

  3. Introduction (cont.) • Large-scale high-performance computers • Consist of many networked compute nodes • Compute nodes have multiple CPUs but only one link • Want to speed up data transfer • Need real-time compression to match link throughput • pFPC: a parallel version of the FPC algorithm • Exceeds 10 Gb/s on four Xeon processors March 2009

  4. Sequential FPC Algorithm [DCC’07] • Make two predictions • Select closer value • XOR with true value • Count leading zero bytes • Encode value • Update predictors March 2009

  5. pFPC: Parallel FPC Algorithm • pFPC operation • Divide data stream into chunks • Logically assign chunks round-robin to threads • Each thread compresses its data with FPC • Key parameters • Chunk size & number of threads March 2009

  6. Evaluation Method • Systems • 3.0 GHz Xeon with 4 processors • Others in paper • Datasets • Linear streams of real-world data (18 – 277 MB) • 3observations: error, info, spitzer • 3simulations: brain, comet, plasma • 3 messages: bt, sp, sweep3d March 2009

  7. Compression Ratio vs. Thread Count • Configuration • Small predictor • Chunk size = 1 • Compression ratio • Low (FP data) • Other algos worse • Fluctuations • Due to multi-dimensional data March 2009

  8. Compression Ratio vs. Chunk Size • Configuration • Small predictor • 1 to 4 threads • Compression ratio • Flat for 1 thread • Steep initial drop • Chunk size • Larger is better for history-based pred. March 2009

  9. Throughput on Xeon System Compression Decompression • Throughput increases with chunk size • Loop overhead, false sharing, TLB performance • Throughput scales with thread count • Limited by load balance and memory bandwidth March 2009

  10. Summary pFPCalgorithm Chunks up data and logically assigns chunks in round-robin fashion to threads Reaches 10.9 and 13.6 Gb/s throughput with a compression ratio of 1.18 on a 4-core 3 GHz Xeon Portable C source code is available on-linehttp://users.ices.utexas.edu/~burtscher/research/pFPC/ March 2009

  11. Conclusions • For best compression ratio, thread count should equal to or be small multiple of data’s dimension • Chunk size should be one • For highest throughput, chunk size should at least match system’s page size (and be page aligned) • Larger chunks also yield higher compression ratios with history-based predictors • Parallel scaling is limited by memory bandwidth • Future work should focus on improving compression ratio without increasing the memory bandwidth March 2009

More Related