1 / 28

‘Stream’-based wireless computing

‘Stream’-based wireless computing. Sridhar Rajagopal Research group meeting December 17, 2002. The figures used in the slides are borrowed from papers at VT and Stanford. Motivation. ‘Stream’-based computing what does it mean? Not a well-defined term

lilly
Download Presentation

‘Stream’-based wireless computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed from papers at VT and Stanford.

  2. Motivation • ‘Stream’-based computing • what does it mean? • Not a well-defined term • ‘computation’ that uses flow of self-guided info. • ‘sequence of data’ • Related to flow of data through architecture • Application to implementing wireless algorithms

  3. Outline • Stallion • reconfigurable computing at Virginia Tech • ‘stream’-based computing #1 • Custom Configurable Machines (CCM) • Imagine • media processing at Stanford • ‘stream’-based computing #2 • programmable architectures

  4. Stallion at VT • Wormhole Run-Time Reconfiguration (RTR) • coarse-grained structure • reconfiguration using ‘streams’

  5. ‘Stream’ packets A stream packet Stream flow through architecture

  6. Functional description of PE

  7. Stream module description 4 States: IDLE – reconf. in progress BUSY – doing work PROGRAM – load reconf. data PASS – meant for next module Need to output packet/cycle VALID – maintain sync. - set INVALID instead of wait states - strip information off stack

  8. Processing layer • Static section • configures the reconf. section • buffers data during reconf. & sends ‘IDLE’ packets • Reconf. Section • processing of the data done here • Higher layers convert algorithm to data and configuration patterns

  9. Cart before the horse Colt before the Stallion Colt architecture (also at VT) IFU Mesh – Mesh of interconnected func. units

  10. Stallion chip 2 4 3 16-bit data 4-control 3 4 2

  11. IFU mesh in Stallion Dash-line –- skip buses Can send operands over 1/more IFUs

  12. IFU details Only left input can do barrel shifting ALU based on LUT Control register – stores control information for reconfiguration Optional Delay Register - provides latency to synchronize path lengths of different pipeline streams Cond. unit Output control unit

  13. Radio testbed at VT Stallion

  14. Worm-hole routing • stream = worm architecture = holes • multiple, independent streams can wind their way through the chip simultaneously • parts of system can be processing, parts could be reconfiguring • GOAL: Layered Software Radio Architecture

  15. ‘Stream’ processing at Stanford • Speeding up media applications • Need lots of computations per memory reference • Lots of data and sub-word parallelism • Current GPP architectures do not have enough ALUs • ‘Stream’ processors to the rescue

  16. Special-purpose processors Lots (100s) of ALUs Fed by dedicated wires/memories

  17. Care and feeding of ALUs Instr. Cache IP Instruction Bandwidth IR Data Bandwidth Regs ‘Feeding’ Structure Dwarfs ALU

  18. Architecture implications • Tremendous opportunities • media problems have lots of parallelism and locality • VLSI technology enables 100s of ALUs/chip (1000s soon) • (in 0.18um 0.1mm2 per integer adder, 0.5mm2 per FP adder) • Challenging problems • locality - global structures won’t work • explicit parallelism - ILP won’t keep 100 ALUs busy • memory - streaming applications don’t cache well • Its time to try some new approaches

  19. Register file organization • Register files functions: • short term storage for intermediate results • communication between multiple function units • Global register files don’t scale with #ALUs • need more registers to hold more results (grows with #ALUs ) • need more ports to connect all of the units (grows with #ALUs 2)

  20. Register files dwarf ALUs

  21. Distributed register files • Distributed register files means: • not all functional units can access all data • each functional unit input/output no longer has a dedicated route from/to all register files

  22. Input Data Kernel Stream Output Data Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Stream processing • Little data reuse (pixels never revisited) • Highly data parallel (output pixels not dependent on other output pixels) • Compute intensive (60 operations per memory reference)

  23. Stream programming • Streams • Communication void main() { Stream<int> a(256); Stream<int> b(256); Stream<int> c(256); Stream<int> d(1024); ... example1(a, b, c); example2(c, d); ... } • Kernels • Computation KERNEL example1(istream<int> a, istream<int> b, ostream<int> c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; } }

  24. Stream Processor • Instructions are Load, Store, and Operate • operands are streams • Operate performs a compound stream operation • read elements from input streams • perform a local computation • append elements to output streams • repeat until input stream is consumed • (e.g., triangle transform)

  25. SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor Imagine

  26. Intercluster Network Local Register File + * * + + / CU To SRF Cross Point From SRF Arithmetic clusters

  27. SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Bandwidth hierarchy • VLIW clusters with shared control • 41.2 32-bit operations per word of memory bandwidth

  28. Conclusions • ‘Streams’ shown to be promising for reconfigurable computing • wireless may need reconfigurability • ‘Streams’ shown to be promising for media processing • wireless may have similar workloads • Important to understand pros and cons of different methodologies for good wireless architectures • Important to have the right tools

More Related