1 / 38

Orchestrating the Execution of Stream Programs on Multicore Platforms

Orchestrating the Execution of Stream Programs on Multicore Platforms. Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan. 512. PicoChip. AMBRIC. 256. CISCO CSR1. 128. NVIDIA G80. Larrabee. 64. Unicore. 32. Homogeneous Multicore. RAZA XLR.

ahava
Download Presentation

Orchestrating the Execution of Stream Programs on Multicore Platforms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan

  2. 512 PicoChip AMBRIC 256 CISCO CSR1 128 NVIDIA G80 Larrabee 64 Unicore 32 Homogeneous Multicore RAZA XLR Cavium Heterogeneous Multicore 16 RAW Cell Niagara 8 AMD Fusion Opteron 4P BCM 1480 4 Core2Quad Xeon Xbox 360 Power6 Power4 PA8800 PA8800 4004 8008 8080 4004 8008 8080 2 Opteron CoreDuo Core2Duo 8086 286 386 486 Pentium P2 P3 P4 1 Core Athlon Itanium Itanium2 1975 1980 1985 1990 1995 2000 2005 2010 Cores are the New Gates (Shekhar Borkar, Intel) Stream Programming CUDA Courtesy: Gordon’06 Courtesy: Gordon’06 X10 Peakstream Fortress C/C++/Java # cores/chip Accelerator Ct C T M Rstream Rapidmind

  3. Stream Programming • Programming style • Embedded domain • Audio/video (H.264), wireless (WCDMA) • Mainstream • Continuous query processing (IBM SystemS), Search (Google Sawzall) • Stream • Collection of data records • Kernels/Filters • Functions applied to streams • Input/Output are streams • Coarse grain dataflow • Amenable to aggressive compiler optimizations [ASPLOS’02, ’06, PLDI ’03]

  4. Core 1 Core 2 Core 3 Core 4 Mem Mem Mem Mem Compiling Stream Programs Stream Program Multicore System ? • Heavy lifting • Equal work distribution • Communication • Synchronization

  5. SPE0 SPE1 SPE7 SPU SPU SPU 256 KB LS 256 KB LS 256 KB LS MFC(DMA) MFC(DMA) MFC(DMA) EIB PPE (Power PC) DRAM Stream Graph Modulo Scheduling(SGMS) • Coarse grain software pipelining • Equal work distribution • Communication/computation overlap • Target : Cell processor • Cores with disjoint address spaces • Explicit copy to access remote data • DMA engine independent of PEs • Filters = operations, cores = function units

  6. Streamroller Software Overview SPEX (C++ with stream extensions) StreamIt Stylized C Focus of this talk Streamroller Profiling Stream IR Code Generation Analysis Scheduling Multicore (Cell, Core2Quad, Niagara) SODA (Low power multicore processor for SDR) Custom Application Engine

  7. int wgts[N]; wgts = adapt(wgts); Push and pop items from input/output FIFOs Preliminaries • Synchronous Data Flow (SDF) [Lee ’87] • StreamIt [Thies ’02] int->int filter FIR(int N, int wgts[N]) { work pop 1 push 1 { int i, sum = 0; for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); } } Stateless Stateful

  8. PE0 T1 ≈ 4 T4 SGMS Overview Prologue PE0 PE1 PE2 PE3 DMA DMA DMA DMA T1 DMA T4 DMA DMA DMA Epilogue

  9. Fission + Processor assignment Stage assignment Code generation Load balance Causality DMA overlap SGMS Phases

  10. A A A 5 PE0 PE1 B B1 A B2 B S B C C C 40 10 C B1 B2 D D A C 5 S D J J D D Processor Assignment • Assign filters to processors • Goal : Equal work distribution • Graph partitioning? • Bin packing? Modified stream program Original stream program Speedup = 60/40 = 1.5 Speedup = 60/32 ~ 2

  11. PE0 PE1 PE2 PE3 Filter Fission Choices Speedup ~ 4 ?

  12. Split/Join overhead factored in … Integrated Fission + PE Assign • Exact solution based on Integer Linear Programming (ILP) • Objective function- Maximal load on any PE • Minimize • Result • Number of times to “split” each filter • Filter → processor mapping

  13. Fission + Processor assignment Stage assignment Code generation Load balance Causality DMA overlap SGMS Phases

  14. PE0 PE1 A B A PE0 PE1 Time B A B C A1 C B1 A2 A1 A1 A→B A→B A2 A2 B1 A→B B1 A3 Forming the Software Pipeline • To achieve speedup • All chunks should execute concurrently • Communication should be overlapped • Processor assignment alone is insufficient information X Overlap Ai+2 with Bi

  15. Stage Assignment • Data flow traversal of the stream graph • Assign stages using above two rules PE 1 PE 1 Si i i Sj≥ Si SDMA > Si DMA j PE 2 Sj = SDMA+1 j Preserve causality (producer-consumer dependence) Communication-computation overlap

  16. A S B1 Stage 0 Stage 1 DMA DMA DMA C B2 Stage 2 J DMA DMA Stage 3 D Stage 4 Stage Assignment Example A S C B1 B2 J D PE 0 PE 1

  17. Fission + Processor assignment Stage assignment Code generation Load balance Causality DMA overlap SGMS Phases

  18. Code Generation for Cell • Target the Synergistic Processing Elements (SPEs) • PS3 – up to 6 SPEs • QS20 – up to 16 SPEs • One thread / SPE • Challenge • Making a collection of independent threads implement a software pipeline • Adapt kernel-only code schema of a modulo schedule

  19. Complete Example A void spe1_work() { char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); } } S B1 Time A S B1 A B1toJ S StoB2 B1 AtoC DMA DMA DMA A B2 B1toJ S StoB2 B1 C B2 AtoC J C J A JtoD B2 B1toJ S CtoD StoB2 DMA DMA B1 AtoC J D C A JtoD B2 B1toJ S CtoD StoB2 B1 AtoC J C D SPE1 DMA1 SPE2 DMA2

  20. Experiments • StreamIt benchmarks • Signal processing – DCT, FFT, FMRadio, radar • MPEG2 decoder subset • Encryption – DES • Parallel sort – Bitonic • Range of 30 to 90 filters/benchmark • Platform • QS20 blade server – 2 Cell chips, 16 SPEs • Software • StreamIt to C – gcc 4.1 • IBM Cell SDK 2.1

  21. Evaluation on QS20 Split/join overhead reduces benefit from fission Barrier synchronization (1 per iteration of the stream graph)

  22. SGMS(ILP) vs. Greedy (MIT method, ASPLOS’06) • Solver time < 30 seconds for 16 processors

  23. Conclusions • Streamroller • Efficient mapping of stream programs to multicore • Coarse grain software pipelining • Performance summary • 14.7x speedup on 16 cores • Up to 35% better than greedy solution (11% on average) • Scheduling framework • Tradeoff memory space vs. load balance • Memory constrained (embedded) systems • Cache based system

  24. Questions ? http://cccp.eecs.umich.edu http:/www.eecs.umich.edu/~sdrg

  25. Backup Slides

  26. DMA DMA DMA A A A A B B B B C C C C DMA D D D D E E E E DMA DMA DMA F F F F DMA PE1 PE2 PE3 PE4 A A A A B C B C B C B C D E D E D E D E F F F F Naïve Unfolding Completely stateless stream program PE1 PE2 PE3 PE4 Stream program with stateful nodes Stream data dependence State data dependence

  27. SGMS (ILP) vs. Greedy (MIT method, ASPLOS’06) • Highly dependent on graph structure • DMA overlapped in both ILP and Greedy • Speedup drops with exposed DMA • Solver time < 30 seconds for 16 processors

  28. Unfolding vs. SGMS

  29. PE 1 PE 2 PE 3 All input data need to be available for iterations to begin A A A B B B A C Sync + DMA D B C C A Sync + DMA D D C B A D Sync + DMA B C A Sync + DMA D B C Sync + DMA D C D Pros and Cons of Unfolding • Pros • Simple • Good speedups for mostly stateless programs • Cons Speedup affected by filters with large state Sync+DMA • Long latency for one iteration • Not suitable for real time scheduling

  30. SGMS Assumptions • Data independent filter work functions • Static schedule • High bandwidth interconnect • EIB 96 Bytes/cycle • Low overhead DMA, low observed latency • Cores can run MIMD efficiently • Not (directly) suitable for GPUs

  31. Speedup Upperbounds 15% work in one stateless filter Max speedup = 100/15 = 6.7

  32. Summary • 14.7x speedup on 16 cores • Preprocessing steps (fission) necessary • Hiding communication important • Extensions • Scaling to more cores • Constrained memory system • Other targets • SODA, FPGA, GPU

  33. Stream Programming • Algorithmic style suitable for • Audio/video processing • Signal processing • Networking (packet processing, encryption/decryption) • Wireless domain, software defined radio • Characteristics • Independent filters • Segregated address space • Independent threads of control • Explicit communication • Coarse grain dataflow • Amenable to aggressive compiler optimizations [ASPLOS’02, ’06, PLDI ’03,’08] Bitstream parser Zigzag unorder Motion decode Inverse Quant AC Inverse Quant DC 1D-DCT (row) Saturate 1D-DCT (column) Bounded saturate MPEG Decoder

  34. Effect of Exposed DMA

  35. Scheduling via Unfolding Evaluation • IBM SDK 2.1, pthreads, gcc 4.1.1, QS20 Blade server • StreamIt benchmarks characteristics

  36. Absolute Performance

  37. 3 3 Σ Σ P P P P P j=0 j=0 Σ Σ Σ Σ Σ i=1 i=1 i=1 i=1 i=1 4 Σ j=0 4 Σ j=0 0_0 a1,0,0,i – b1,0 = 0 Original actor 1_2 a1,1,j,i – b1,1≤ Mb1,1 1_0 1_1 Fissed 2x a1,1,j,i – b1,1 – 2 ≥ -M + Mb1,1 1_3 2_3 a1,2,j,i – b1,2≤ Mb1,2 2_0 2_1 2_2 Fissed 3x a1,2,j,i – b1,2 – 3 ≥ -M + Mb1,2 2_4 b1,0 + b1,1 + b1,2 = 1

  38. Bit mask to control active stages Activate Stage 0 Go through all input items Bit mask controls what gets executed Start DMA operation Call filter work function Left shift bit mask (activate more stages) Poll for completion of all outstanding DMAs Barrier synchronization SPE Code Template void spe_work() { char stage[N] = {0, 0,...,0}; stage[0] = 1; for (i=0; i<max_iter+N-1; i++) { if (stage[N-1]) { } if (stage[N-2]) { } ... if (stage[0]) { Start_DMA(); FilterA_work(); } if (i == max_iter-1) stage[0] = 0; for(j=N-1; j>=1; j--) stage[j] = stage[j-1]; wait_for_dma(); barrier(); } }

More Related