1 / 23

High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa

High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel. Electrical & Computer Engineering Carnegie Mellon University. Sponsors: DARPA-DESA, NSF, ARO, and Mercury Inc. Cell Broadband Engine. Cell BE Chip. EIB. SPE. LS.

galeno
Download Presentation

High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel Electrical & Computer Engineering Carnegie Mellon University Sponsors: DARPA-DESA, NSF, ARO, and Mercury Inc.

  2. Cell Broadband Engine Cell BE Chip EIB SPE LS SPE LS SPE SPE LS LS SPE LS SPE LS SPE SPE LS LS Main Mem How do we harness the Cell’s impressive peak performance? Multicore cpu (8 SPEs+1 PPE) SPEs: SIMD cores designed for numerical computing 256KB “local store” per SPE (scratchpad-like) Programmer-driven DMA 204 Gflop/s peak

  3. DFT on the Cell BE Spiral generated (this paper) 350x FFTC FFTW Numerical Recipes • Platform-tuned code is 350x faster. But hard to write!

  4. Overview Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong,Franz Franchetti, AcaGacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo:SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE93(2), 2005 Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks

  5. “Fitting” Dataflow to Hardware Stage 1 Stage 4 Stage 5 Stage 3 Stage 2 Stage 1 Core 0 Stage 2 Stage 3 Core 1 Stage 4 Parallel execution (multicore) Iterative Algorithm (programming ease) Recursive algorithm (memory hierarchy) • How to map dataflowto architecture automatically? To “fit” DFT to architecture: Various traversals Various factorizations

  6. “Fitting” Dataflow to Platform (contd.) 1 4 5 3 2 1 Core 0 2 3 Core 1 4 • Intuition: rewrite formulas to obtain suitable dataflow

  7. Optimization at allabstraction levels parallelizationvectorization loop optimizations constant folding scheduling …… Program Generation in Spiral Transformuser specified Fast algorithmin SPLmany choices ∑-SPL Iteration of this process to search for the fastest But that’s not all … C Code

  8. Common Abstraction: SPL SPL: Tensor-product representation Eg.: Cooley-Tukey fast Fourier transform (FFT): • Tensor products in SPL represent loop structures

  9. Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks

  10. Mapping DFTs to the Cell Objective: High-performance transform library for Cell BE Cell BE Chip EIB SPE LS SPE LS SPE SPE LS LS DFT SPE LS SPE LS Parallelize DFT across p SPEs, and use a DMA packet size of  Optimize DFT for throughput (s DFTs required) Vectorize DFT for vector length  SPE SPE LS LS Cell’s architectural paradigms: Main Mem Vectorization Parallelization Multibuffering Tags guide formula rewriting

  11. A A A A x y SPL to Parallel Code • Natural parallel construct in SPL: Processor 0 Processor 1 Processor 2 Processor 3 Independent, load-balanced, communication-free operation • Parallelizing other constructs in SPL: • Permutations require message exchange (on-chip DMA comm.) x y Idea: rewrite all SPL constructs to parallel constructs + on-chip DMA

  12. SPL to Streaming Code i'th iteration Write Ai-1 Compute Ai Read Ai+1 A A A (Trickier for other SPL constructs) x y Idea: rewrite algorithm at SPL level to achieve largest DMA packets • Streaming: Overlapping computation with communication • On-chip (SPE ↔ SPE) and off-chip (SPE ↔ Main memory) • Idea: tensor loops become multi-buffered loops • Useful for: • Throughput-optimized code • Large, out-of-chip sizes

  13. Generating Cell Code Transformuser specified Rewriting Fast algorithmin SPLtag guided All-to-all communication (on-chip) SIMD kernel optimized for memory hierarchy Load balanced across p SPEs Streamed from memory for throughput Loop operations in ∑-SPL Cell-specific optimized C code (intrinsics, DMA etc.)

  14. Generated Code Sample vectorized DMA parallelized • DFT 216: 4,000+ lines of code! /* Complex-to-complex DFT size 64 on 2 SPEs */ dft_c2c_64(float *X, float *Y, intspuid) { // Block 1 (IxA)L for(i:=0; i<=7; i++) // Right most gather { DMA_GATHER(gath_func(X,i), gath_func(T1,i), 4) } // uses spu_mfcdma() spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait on gather // compute vectorized DFT kernel of size m for(i:=0; i<=7; i++) // Scatter at interface { DMA_SCATTER(scat_func(T1,i), scat_func(T2,i), 4) } all_to_all_synchronization_barrier(); // uses mailbox msgs // Block 2 (AxI) /* Gather is a no operation since the scatter above accounted for it */ // compute vectorized DFT kernel of size n for(i:=0; i<=7; i++) // Left most scatter { DMA_SCATTER(scat_func(T1,i), scat_func(Y,i), 4) } all_to_all_synchronization_barrier(); }

  15. Problem Space: Options SPE SPE SPE SPE SPE SPE SPE DFT DFT DFT DFT DFT DFT DFT DFT Parallelization Base (Vectorized) Vectorization assumed Single DFT parallelized across multiple SPEs Main Memory Operations (Only for small DFTs) SPE SPE SPE SPE SPE SPE Multiple independent DFTs on multiple SPEs DFT DFT DFT Latency optimized (default) SPE SPE SPE SPE SPE SPE Multiple parallelized independent DFTs Throughput, multibuffered

  16. Problem Space: Combinations SPE SPE SPE SPE DFT DFT DFT DFT DFT DFT DFT DFT Throughput-optimized usage scenarios Latency-optimized usage scenarios Single DFT from main memory Parallel, multibuffered DFT SPE SPE SPE SPE DFT DFT DFT SPE SPE SPE SPE Independent DFTs multibuffered in parallel • Devise rewrite rules for tags. Nestings describe all scenarios

  17. Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks

  18. 8-SPEs 4-SPEs 2-SPEs SPE SPE DFT SPE SPE 1-SPE

  19. Spiral: 8-SPEs FFTW FFTC Spiral: 1-SPE SPE SPE DFT SPE SPE • 4.5x faster than FFTW, 1.63x faster than FFTC

  20. More Performance Results • Single-SPE DFT code • Split/interleaved complex formats • Non-2-power sizes • Double precision (PowerXCell 8i) Mercury Spiral Chow IBM SDK

  21. Other Linear Transforms • Discrete Sine, Cosine transforms, DFT with real inputs (single-SPE) • 2-D DFTs • Out-of-core sizes • Limited to 2D DFTs on 1-SPE (for now) More performance results: Srinivas Chellappa, Franz Franchetti , and Markus Püschel:Computer Generation of fast Fourier Transforms for the Cell Broadband Engine Proceedings of International Conference on Supercomputing (ICS) 2009

  22. Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks

  23. Conclusion architecture space algorithm space • Automatic generation of transform libraries • High performance • Variety of scenarios, formats • High performance on Cell requires: • Vectorization multi-core parallelization, streaming, DMA code • Future processors likely to have similar paradigms, tradeoffs • Spiral approach: • Common abstraction of transform, algorithm, architecture (SPL) • Rewrite rules to go from transform to architecture

More Related