280 likes | 410 Views
This 2002 meeting presentation by Sridhar Rajagopal discusses stream-based wireless computing, focusing on the flow of self-guided information and the importance of sequence data. The outline covers reconfigurable computing at Virginia Tech, Stallion architecture, and programmable structures for wireless algorithms. Key topics include stream packets, performance improvements in media applications, and the need for dedicated architectures to handle increased parallelism. The challenges of locality and memory in computing are also emphasized, along with future opportunities in VLSI technology.
E N D
‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed from papers at VT and Stanford.
Motivation • ‘Stream’-based computing • what does it mean? • Not a well-defined term • ‘computation’ that uses flow of self-guided info. • ‘sequence of data’ • Related to flow of data through architecture • Application to implementing wireless algorithms
Outline • Stallion • reconfigurable computing at Virginia Tech • ‘stream’-based computing #1 • Custom Configurable Machines (CCM) • Imagine • media processing at Stanford • ‘stream’-based computing #2 • programmable architectures
Stallion at VT • Wormhole Run-Time Reconfiguration (RTR) • coarse-grained structure • reconfiguration using ‘streams’
‘Stream’ packets A stream packet Stream flow through architecture
Stream module description 4 States: IDLE – reconf. in progress BUSY – doing work PROGRAM – load reconf. data PASS – meant for next module Need to output packet/cycle VALID – maintain sync. - set INVALID instead of wait states - strip information off stack
Processing layer • Static section • configures the reconf. section • buffers data during reconf. & sends ‘IDLE’ packets • Reconf. Section • processing of the data done here • Higher layers convert algorithm to data and configuration patterns
Cart before the horse Colt before the Stallion Colt architecture (also at VT) IFU Mesh – Mesh of interconnected func. units
Stallion chip 2 4 3 16-bit data 4-control 3 4 2
IFU mesh in Stallion Dash-line –- skip buses Can send operands over 1/more IFUs
IFU details Only left input can do barrel shifting ALU based on LUT Control register – stores control information for reconfiguration Optional Delay Register - provides latency to synchronize path lengths of different pipeline streams Cond. unit Output control unit
Radio testbed at VT Stallion
Worm-hole routing • stream = worm architecture = holes • multiple, independent streams can wind their way through the chip simultaneously • parts of system can be processing, parts could be reconfiguring • GOAL: Layered Software Radio Architecture
‘Stream’ processing at Stanford • Speeding up media applications • Need lots of computations per memory reference • Lots of data and sub-word parallelism • Current GPP architectures do not have enough ALUs • ‘Stream’ processors to the rescue
Special-purpose processors Lots (100s) of ALUs Fed by dedicated wires/memories
Care and feeding of ALUs Instr. Cache IP Instruction Bandwidth IR Data Bandwidth Regs ‘Feeding’ Structure Dwarfs ALU
Architecture implications • Tremendous opportunities • media problems have lots of parallelism and locality • VLSI technology enables 100s of ALUs/chip (1000s soon) • (in 0.18um 0.1mm2 per integer adder, 0.5mm2 per FP adder) • Challenging problems • locality - global structures won’t work • explicit parallelism - ILP won’t keep 100 ALUs busy • memory - streaming applications don’t cache well • Its time to try some new approaches
Register file organization • Register files functions: • short term storage for intermediate results • communication between multiple function units • Global register files don’t scale with #ALUs • need more registers to hold more results (grows with #ALUs ) • need more ports to connect all of the units (grows with #ALUs 2)
Distributed register files • Distributed register files means: • not all functional units can access all data • each functional unit input/output no longer has a dedicated route from/to all register files
Input Data Kernel Stream Output Data Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Stream processing • Little data reuse (pixels never revisited) • Highly data parallel (output pixels not dependent on other output pixels) • Compute intensive (60 operations per memory reference)
Stream programming • Streams • Communication void main() { Stream<int> a(256); Stream<int> b(256); Stream<int> c(256); Stream<int> d(1024); ... example1(a, b, c); example2(c, d); ... } • Kernels • Computation KERNEL example1(istream<int> a, istream<int> b, ostream<int> c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; } }
Stream Processor • Instructions are Load, Store, and Operate • operands are streams • Operate performs a compound stream operation • read elements from input streams • perform a local computation • append elements to output streams • repeat until input stream is consumed • (e.g., triangle transform)
SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor Imagine
Intercluster Network Local Register File + * * + + / CU To SRF Cross Point From SRF Arithmetic clusters
SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Bandwidth hierarchy • VLIW clusters with shared control • 41.2 32-bit operations per word of memory bandwidth
Conclusions • ‘Streams’ shown to be promising for reconfigurable computing • wireless may need reconfigurability • ‘Streams’ shown to be promising for media processing • wireless may have similar workloads • Important to understand pros and cons of different methodologies for good wireless architectures • Important to have the right tools