1 / 22

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric.

vita
Download Presentation

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Raw ArchitectureSignal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat, Ben Greenwald, Paul Johnson,Walter Lee, Albert Ma, Nathan Shnidman, Henry Hoffmann, Arvind Saraf, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal http://www.cag.lcs.mit.edu/raw MITLaboratoryFor ComputerScience

  2. Outline • Motivation • Architecture • Raw Prototype • Networks • Signal Processing Applications • Status

  3. Wire Delay and Tiled Architectures • Problem: The amount of gates we can reach in one cycle is staying constant, but our chips are getting bigger. • Solutions: • Hide wire delay latency in micro-architecture (Clustering/Hidden communication stalls) • Expose the communication to the instruction set level and allow the software exploit locality Fact 1: Number of transistors growing Fact 2: Proportionally wires not getting faster

  4. Wire Delay and Tiled Architectures • Expose the communication to the instruction set level and allow the software exploit locality

  5. Wire Delay and Tiled Architectures • Expose the communication to the instruction set level and allow the software exploit locality Make a tile as big as you can go in one clock cycle, and expose longer communication to the programmer

  6. Wire Delay and Tiled Architectures • Expose the communication to the instruction set level and allow the software exploit locality Make a tile as big as you can go in one clock cycle, and expose longer communication to the programmer

  7. What Are We Building?The Raw Prototype • 16 Replicated Tiles (Processors) • What is in a tile? • 8 stage Pipelined MIPS-like 32-bit processor • Pipelined Floating Point Unit • 32KB Data Cache • 32KB Instruction Memory • Interconnect Routers

  8. Raw’s Networking Resources • 2 Dynamic Networks • Fire and Forget • Header encodes destination • 2 Stage router pipeline • 2 Static Networks • Software configurable crossbar • Interlocked and Flow Controlled • 5 Stage static router pipeline • 3 cycle nearest-neighbor ALU to ALU communication latency • No header overhead, but requires knowledge of communication patterns at compile time

  9. Memory Mapped Communication is Not a First Class Citizen To other tiles, through memory system that happens to go over a network. E M1 M2 A TL TV IF D RF F P U F4 WB

  10. Raw’s First Class Register-Mapped Communication r24 Ex: add r26, r25, r24 r24 r25 r25 r26 r26 r27 r27 Network Output FIFOs Network Input FIFOs E M1 M2 A TL TV IF D RF F P U F4 WB

  11. Signal Processing Applications • Problem: Increase performance of Signal Processing in a scalable fashion • Solution: Exploit parallelism in Signal Processing Applications at all levels

  12. Types of Parallelism in Signal Processing • DSP Filter Style • Fine Grain Dataflow • Instruction Level Parallelism • Data Parallel • Thread Level Parallelism (MPI) Raw Current Architectures

  13. Instruction Level Parallelism • RawCC • Maps dataflow graphs across tiles • ILP across Multiprocessor • Heavily Latency sensitive • Single cycle reconfigurable communication

  14. Fine Grain Dataflow • Ex: Pipelined FIR Filter xn xn-1 xn-1 xn-3 W0 W1 W2 W3     Computation: mul, add Input Operands: xi, l Output Operands: k Cycle count Class First Second Compute 22 Communicate 03 Overall 25

  15. Fine Grain Dataflow Cycle count Class First Second Compute 22 Communicate 03 Overall 25

  16. FFT FFT-1 Down- Sample FFT Frequency Domain Filter FFT-1 FFT FFT-1 FFT FFT-1 DSP Filter Style Off- chip Off- chip

  17. Raw is Composable • Mix and match types of parallelism White balance Aliasing filter White balance mem mem 2-way RawCC Application 4-way Threaded Java Application httpd Zzz.

  18. Raw Status • Stats • IBM SA-27E .15u 6 Layer Copper • 18.2 mm X 18.2 mm die • .122 Billion Transistors • 2048KB SRAM On-chip • 1657 Pin CCGA Package • 1080 HSTL Signal IO Operating at Core Speed • 225MHz • ~25 Watts

  19. The Raw Performance • 16 OPS/FLOPS per cycle (@225MHz = 3.6 GFLOPS) • 230 Gb/s of on-chip “bisection bandwidth” • 201 Gb/s of off-chip I/O bandwidth • 115 Gb/s of on-chip memory bandwidth

  20. Raw Status • Working: • Cycle Accurate Software Simulator • RTL Simulation • Emulation System • RawCC ILP Compiler • Current: • Verification • Backend Completion • Tapeout December 2001 • Chips Back Summer 2002

  21. Summary • Raw’s First Class communication facilitates exploitation of new forms of parallelism in Signal Processing applications

  22. Extra Slides

More Related