1 / 13

Comparison of the Various Stanford Streaming Languages

Comparison of the Various Stanford Streaming Languages. Lance Hammond Stanford University October 12, 2001. The Basics: Variables & Operators. Only DSP-C addresses variable-precision integers “Native” short-vector types

Download Presentation

Comparison of the Various Stanford Streaming Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparison of the Various Stanford Streaming Languages Lance Hammond Stanford University October 12, 2001

  2. The Basics: Variables & Operators • Only DSP-C addresses variable-precision integers • “Native” short-vector types • Brook introduces short-vector types (like vec2i) to avoid the use of streaming operations with these types • i/f distinction alone may not be sufficient! • DSP-C just defines these as very short CAAs, which should not cause the generation of kernels • Allows type definition of any kind of short vector you may want using a general language specification • Matrix multiply operations • Brook uses “vector * vector” to mean a matrix multiply • So other “vector OP vector” operations are illegal?? • DSP-C uses “vector OP vector” (including multiply) to mean an element-by-element operation using the vectors • A conventional matrix multiply could be specified as a special operator of its own, however

  3. “Stream” Definition • DSP-C uses arrays • Sections of the arrays are selected for use in calculations using array indices or array range specifications • Stream-C defines 1-D, array-like streams • Arbitrary sections of the base arrays can be selected in advance for streaming into execution kernels • Brook defines much more abstract streams • Dynamic length streams • Could be multidimensional (but only fixed length??) • Entire stream is a monolithic object • Brook problems, as I see them (currently) • No substreams can be defined . . . you have to use the whole thing • No in-place R/W stream usage to save memory when possible • Implicit loop start/end isn’t well defined (Where do you start if you have ±1 usage on an input? What if input streams differ in size?) • Memory allocation of dynamic-length streams? • At the very least, we need to be able to specify when dynamic allocation isn’t actually necessary, to optimize fixed-length code

  4. Parallelism Classification • Independent thread parallelism • Stick with pthreads or other high-level definition • Loop iteration, data-division parallelism • Divide up loop iterations among functional units • Loop iterations must be data-independent (no critical dependencies) • Pipelining of segments of “serial” code • Find places to overlap non-dependent portions of serial code • Ex. 1: Start a later loop before earlier one finishes • Ex. 2: Start different functions on different processors • Harder than loop iteration parallelism because of load balancing • Pipelining between timesteps • Run multiple timesteps in parallel, using a pipeline • Doesn’t necessarily require finding overlap of loops or functions — running them on different timesteps makes them data parallel • StreaMIT is best example of a language designed to optimize for this, and up to now I don’t think any of our proposals have addressed it

  5. Parallelism Classification Graphic • Note: Any parallel blocks without dependencies in the above figure can be considered unordered with respect to each other, and rescheduled accordingly. There is technically no need to specify “unordered” explicitly, as a result.

  6. Kernel Usage/Parallelism • DSP-C extracts kernels from loop nests • Standard C loops or “dsp_for” loops • Can find loop iteration and overlap parallelism, but relies on the compiler to perform compiler analysis to find possible dependencies • Probably OK for some cases, but may be too complex in general • Flexible: Allows compiler to pick “best” loop nest for parallelism, if it is smart enough to make a good choice • We can supply “hints” to overcome some compiler stupidity • Stream-C explicitly calls function-like kernels • Makes most inter-iteration dataflow and parallelism fairly obvious • Stream I/O behavior is clearly stated • Still allows inter-iteration scalar dependencies • Overlapping/combining of kernels requires some Stream-C analysis • Kernels are static and cannot be arbitrarily broken up to fit hardware • Brook also calls function-like kernels • Makes most inter-iteration dataflow and parallelism fairly obvious • Stream inputs are simple, but outputs can vary in length • Inter-iteration dependencies are completely eliminated • Overlapping of kernels can often be determined using “prototypes” • Kernels are static and cannot be arbitrarily broken up to fit hardware

  7. Stream Communication Allowed • DSP-C: Written CAA entries to read CAA entries • Compiler must ensure that all potential dependencies are maintained • Pattern of reading and writing within CAAs must be determined • Static “footprint” of reads and writes per iteration should be calculated to maximize parallelism • Dynamic address calculation limits parallelism for R/W arrays • Much of this analysis must be done anyway to optimize memory layout • May affect compiler’s selection of a loop nest to parallelize at • Stream-C: Written streams to read streams • Kernels normally must be serialized when a kernel reads a stream written by another one • Overlap is possible if the compiler checks substream usage • Brook: Written output to read stream footprint • Similar to stream-C, but read footprint specification in kernel header allows kernel overlap without compiler analysis inside kernel

  8. Scalar Communication Allowed • DSP-C: Anything allowed • Pattern matcher in compiler looks for reduction operations and turns them into parallel reductions with combining code at the end • Other dependencies are left alone and make the loop serial unless they happen to map to specialized target hardware • Short-hop communication would map to the inter-cluster communication hardware on Imagine • Use of CAA input or output ports with “if . . . then port++” usage to target Imagine conditional streams • Another possibility: Allow explicit communication when desired using an extra “parallelism” dimension on CAAs (Nuwan’s idea) • Stream-C: Anything allowed • Uses tricks like prefix calculation to minimize inter-cluster dependencies, when possible (How effective?? How automatic??) • Uses high-bandwidth inter-cluster communication when necessary • Supports high-speed conditional stream pointer hardware in Imagine • Brook: Only certain reductions allowed • Reduction operator can be applied to a single kernel return value

  9. Compiler Analysis in DSP-C • Picking parallel loops • Could be done automatically, but probably would be best done with programmer hints (“parallel for” loops) • Various nesting levels can be considered for parallel splitting • Calculation of memory footprint used by loop array accesses • Try to take advantage of DSP-C register/local/main memory layers • Used to perform local memory (SRF, memory mat, etc.) allocation/buffering of streams during execution • Optional rearrangement of non-dependent loop iterations to improve memory system usage • Analysis of dependencies between loop iterations • Elimination of reductions using reduction optimizations • Marking of required dependencies for communication • Generation of parallel kernels • Unroll loops as desired and recalculate memory footprints • Apply SIMD operations to (partially) parallelize one dimension of the loop nest, if possible • Generate “best” parallel kernel based on all previous analysis (and probably based on profiling of several “good” choices)

  10. CA: Memory Footprint Graphic

  11. CA: Memory Model Graphic

  12. CA: SIMD Extraction Graphic

  13. Pipeline-in-Time Syntax • It might be a good idea to steal some of StreaMIT’s dataflow definition technology to allow pipelining in time • In DSP-C, we must add an optional “time” dimension that can be added to any CAA • Acts like a standard dimension, except that it can only be accessed relative to the current time, and not absolutely • Size of this dimension is unspecified (determined by blocking) • More Stream-C/Brook style kernel encoding of dataflow would probably be better to allow automatic determination of blocking-in-time by the compiler • Need special syntax if we want to handle out-of-flow messages like control signals to interrupt normal time flow • This is my current project!

More Related