1 / 35

Static Translation of Stream Programming to a Parallel System

Static Translation of Stream Programming to a Parallel System. S. M. Farhad PhD Student Supervisor : Dr. Bernhard Scholz Programming Language Group School of Information Technology University of Sydney. Picochip PC102. Ambric AM2045. Cisco CSR-1. Intel Tflops. Raza XLR. Cavium

mirari
Download Presentation

Static Translation of Stream Programming to a Parallel System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School of Information Technology University of Sydney

  2. Picochip PC102 Ambric AM2045 Cisco CSR-1 Intel Tflops Raza XLR Cavium Octeon Raw Cell Niagara Opteron 4P Broadcom 1480 Xeon MP Xbox360 PA-8800 Tanglewood Opteron Power4 PExtreme Power6 Yonah Multicores Are Here! 512 256 128 64 32 # of cores 16 8 4 2 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20??

  3. 512 Picochip PC102 Ambric AM2045 256 Cisco CSR-1 128 Intel Tflops 64 32 # of cores Raza XLR Cavium Octeon Raw 16 8 Cell Niagara Opteron 4P Broadcom 1480 4 Xeon MP Xbox360 PA-8800 Tanglewood Opteron 2 Power4 PExtreme Power6 Yonah 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? Multicores Are Here! • For uniprocessors, • C was: • Portable • High Performance • Composable • Malleable • Maintainable Uniprocessors: C is the common machine language

  4. 512 Picochip PC102 Ambric AM2045 256 Cisco CSR-1 128 Intel Tflops 64 32 # of cores Raza XLR Cavium Octeon Raw 16 8 Cell Niagara Opteron 4P Broadcom 1480 4 Xeon MP Xbox360 PA-8800 Tanglewood Opteron 2 Power4 PExtreme Power6 Yonah 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? Multicores Are Here! What is the common machine language for multicores?

  5. Common Machine Languages Uniprocessors: Multicores: Register Allocation Instruction Selection Instruction Scheduling von-Neumann languages represent the common properties and abstract away the differences Need common machine language(s) for multicores

  6. Properties of Stream Programs • A large (possibly infinite) amount of data • Limited lifespan of each data item • Little processing of each data item • A regular, static computation pattern • Stream program structure is relatively constant • A lot of opportunities for compiler optimizations

  7. Streaming as a Common Machine Language AtoD FMDemod • Regular and repeating computation • Independent filters with explicit communication • Segregated address spaces and multiple program counters • Natural expression of Parallelism: • Producer / Consumer dependencies • Enables powerful, whole-program transformations Scatter LPF1 LPF2 LPF3 HPF1 HPF2 HPF3 Gather Adder Speaker

  8. Application of Streaming Programming

  9. Model of Computation • Synchronous Dataflow [Lee ‘92] • Graph of autonomous filters • Communicate via FIFO channels • Static I/O rates • Compiler decides on an orderof execution (schedule) • Static estimation of computation A/D Band Pass Duplicate Detect Detect Detect Detect LED LED LED LED

  10. FIR Example StreamIt Filter input 0 1 2 3 4 5 6 7 8 9 10 11 output 0 1 floatfloatfilter FIR (int N,float[N] weights) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] peek(i); } pop(); push(result); } } Stateless

  11. FIR Example StreamIt Filter input 0 1 2 3 4 5 6 7 8 9 10 11 output 0 1 floatfloatfilter FIR (int N, ) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] peek(i); } pop(); push(result); } } float[N] weights (int N) { Stateful ; weights = adaptChannel(weights);

  12. StreamIt is a novel language for streaming Exposes parallelism and communication Architecture independent Modular and composable Simple structures composed to creates complex graphs Malleable Change program behavior with small modifications StreamIt Language Overview filter pipeline may be any StreamIt language construct splitjoin parallel computation splitter joiner feedback loop splitter joiner

  13. Mapping to Multicores • Baseline techniques • 3-phase solution

  14. Baseline 1: Task Parallelism BandPass BandPass Compress Compress Process Process Expand Expand BandStop BandStop • Inherent task parallelism between two processing pipelines • Task Parallel Model: • Only parallelize explicit task parallelism • Fork/join parallelism • Execute this on a 2 core machine ~2x speedup over single core • What about 4, 16, 1024, … cores? Splitter Joiner Adder

  15. Baseline 2: Fine-Grained Data Parallelism Splitter • Each of the filters in the example are stateless • Fine-grained Data Parallel Model: • Fiss each stateless filter N ways (N is number of cores) • Remove scatter/gather if possible • We can introduce data parallelism • Example: 4 cores • Each fission group occupies entire machine Splitter Splitter BandPass BandPass BandPass BandPass BandPass BandPass BandPass BandPass Joiner Joiner Splitter Splitter Compress Compress Compress Compress Compress Compress Compress Compress Joiner Joiner Splitter Splitter Process Process Process Process Process Process Process Process Joiner Joiner Splitter Splitter Expand Expand Expand Expand Expand Expand Expand Expand Joiner Joiner BandStop BandStop Splitter Splitter BandStop BandStop BandStop BandStop BandStop BandStop Joiner Joiner Joiner BandStop Splitter BandStop BandStop Adder Adder Joiner

  16. AdaptDFT AdaptDFT 3-Phase Solution [2006] Splitter 6 6 Joiner Data Parallel 20 RectPolar Splitter Splitter 2 2 Unwrap UnWrap 1 1 Diff Diff Data Parallel, but too little work! 1 1 Amplify Amplify 1 1 Accum Accum Joiner Joiner Data Parallel 20 PolarRect Target a 4 core machine

  17. AdaptDFT AdaptDFT Data Parallelize Splitter 6 6 Joiner RectPolar 5 Splitter RectPolar RectPolar 20 RectPolar Joiner Splitter Splitter 2 2 Unwrap UnWrap 1 1 Diff Diff 1 1 Amplify Amplify 1 1 Accum Accum Joiner Joiner RectPolar 5 Splitter RectPolar RectPolar 20 PolarRect Joiner Target a 4 core machine

  18. Splitter 6 6 Joiner 5 Splitter Joiner Splitter Splitter 2 2 1 1 1 1 1 1 Joiner Joiner 5 Splitter RectPolar Joiner Data + Task Parallel Execution Cores Time 21 Target 4 core machine

  19. Splitter 6 6 Joiner 5 Splitter Joiner Splitter Splitter 2 2 1 1 1 1 1 1 Joiner Joiner 5 Splitter RectPolar Joiner We Can Do Better! Cores 16 Time Target 4 core machine

  20. RectPolar RectPolar RectPolar RectPolar RectPolar RectPolar Phase 3: Coarse-Grained Software Pipelining Prologue New Steady State • New steady-state is free of dependencies • Schedule new steady-state using a greedy partitioning

  21. Greedy Partitioning Cores To Schedule: 16 Time Target 4 core machine

  22. Orchestrating the Execution of Stream Programs on Multicore Platforms (SGMS) [June 08] • An integrated unfolding and partitioning step based on integer linear programming • Then unfolds data parallel actors as needed and maximally packs actors onto cores. • Next, the actors are assigned to pipeline stages in such a way that all communication is maximally overlapped with computation on the cores. • Experimental results (Cell Architecture).

  23. Cell Architecture

  24. Motivation • It is critical that the compiler leverage a synergistic combination of parallelism while avoiding • Structural and resource hazards • Objective is to maximize concurrent execution of actors across multiple cores • While hiding communication overhead to minimize stalls (pause)

  25. SGMS: A Phase-ordered Approach (2 Steps) • First: an integrated actor fission and partitioning step is performed to assign actors to each processor ensuring maximum work balance. • Parallel data actors are selectively replicated and split to increase the opportunities for even work distribution. (Integer Linear Program) • Second: Stage assignment • Each actor is assigned to a pipeline stage • Ensure data dependencies are satisfied • Inter processor communication latency is maximally overlapped with computation

  26. 5 A 2 S B1 B2 C 20 20 10 2 J 5 D Fission and processor assignment Orchestrating the Execution of Stream Programs on Multicore Platforms 5 A 40 B C 10 5 D Maximum speedup (unmodified graph)=60/40=1.5 Original stream graph

  27. 5 A 40 B C 10 5 D Orchestrating the Execution of Stream Programs on Multicore Platforms 5 A 2 S B1 B2 C 20 20 10 2 Proc 1 = 27 J 5 D Proc 2 = 32 Maximum speedup (unmodified graph)=60/40=1.5 Maximum speedup (modified graph)=60/32 ~ 2 Fission and processor assignment Original stream graph

  28. Stage Assignment A 5 A S Stage 0 2 S B2 C 20 20 10 Stage 1 B1 B1 DMA DMA DMA DMA DMA 2 Proc 1 = 27 J 5 D Proc 2 = 32 B2 C Stage 2 J Fission and processor assignment Stage 3 D Stage 4

  29. A A A A S S S S B1 B1 B1 B1 B1toJ B1toJ B1toJ B1toJ StoB2 StoB2 StoB2 StoB2 AtoC AtoC AtoC AtoC JtoD JtoD B2 B2 B2 CtoD CtoD J J J C C C Schedule Running on Cell A S B1 Prologue Steady State SPE1 DMA1 SPE2 DMA2

  30. Naïve Unfolding [1991] PE1 PE2 PE3 DMA DMA A1 A2 A3 DMA B1 C1 B2 C2 B3 C3 DMA DMA D1 E1 D2 E2 D3 E3 DMA F1 F2 F3 State data dependence Stream data dependence

  31. Benchmark Characteristics

  32. Results

  33. Compare

  34. Effect of Exposed DMA Latency

  35. ILP vs Greedy Partitioning

More Related