1 / 26

The Raw Processor: A Scalable 32-bit Fabric for General Purpose and Embedded Computing

This presentation introduces the Raw Processor, a scalable 32-bit fabric designed for general purpose and embedded computing. It discusses the abstraction layers and resources that Raw exposes, and explores the coordination of these resources. The Raw Processor offers high performance and low network latency, making it suitable for a variety of applications.

mchenault
Download Presentation

The Raw Processor: A Scalable 32-bit Fabric for General Purpose and Embedded Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Raw Processor: A Scalable 32 bit Fabric for General Purpose and Embedded ComputingPresented at Hotchips 13On August 21, 2001byMichael Bedford Taylor (http://cag.lcs.mit.edu/~mtaylor)

  2. The Raw Processor Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat, Ben Greenwald, Paul Johnson,Walter Lee, Albert Ma, Nathan Shnidman, Volker Strumpen, David Wentzlaff, Matt Frank, Saman Amarasinghe, and Anant Agarwal A Scalable 32 bit Fabricfor General Purpose andEmbedded Computing MIT Laboratory for ComputerScience http://cag.lcs.mit.edu/raw

  3. Computer Architecture from 10,000 feet foo(int x) { .. } convenient physical phenomenon class of computation … we use abstractions to make this easier

  4. The Abstraction Layers That Make This Easier foo(int x) { .. } Computation Language / API Compiler / OS ISA Micro Architecture Layout Design Style Design Rules Process Materials Science Physics Fortran IBM 360 /RISC/ Transmeta/x Mead & Conway

  5. Abstractions protect us from change-- but must also change as the world changes Language / API Compiler / OS ISA Micro Architecture Layout Design Style Design Rules Process Materials Science More Resources: Wire Gates Delay Wires Pins Changes in physical constraints

  6. Wire delay is crashing throughthe abstraction layers Language / API Compiler / OS ISA Micro Architecture Layout Design Rules Process Materials Science Partitioning(21264) Pipelining (P4) Timing Driven Placement Fatter wires Deeper wires Cu wires

  7. The future of wire delay handled in micro-architecture DRAM (L4) FPUS L3 L1 spec. control L2 execution core 1 cycle wire

  8. The bottom line Raw handles this change by exposing the underlying resources (e.g. wires) with a scalable, parallel ISA. It orchestrates these resources with spatially-aware compilers. Language / API Compiler / OS ISA Micro Architecture Floorplan / Layout Design Rules Process Materials Science More Resources : Wire Gates Delay Wires Pins

  9. How does Raw expose the resources? We started with a blank sheet of silicon.

  10. Expose the gates Cut the silicon up into an array of 16 identical, programmable tiles.

  11. What’s inside a tile? Tile 8 stage MIPS-like processor pipeline 32 KB IMem 32 KB DCACHE FPU Network Interface Network wires and crossbars

  12. How do we expose the wires? (462 Gb/s @ 225 Mhz) Computation Resources A: Through on-chip networks Registered at input  longest wire = length of tile

  13. DRAM PCI x 2 DRAM PCI x 2 DRAM devices DRAM etc DRAM DRAM How do we expose the pins? 14 7.2 Gb/s channels (201 Gb/s @ 225 Mhz) Routes off the edge of the chip appear on the pins. Gives user direct access to pin bandwidth. raw chipset DRAM

  14. The Raw ISA scales 1 cycle .07u .15u • longest wire • Design complexity • Verification complexity • … are all independent of transistor count. • # tiles, network bandwidth and I/O bandwidth scale • Raw is also backwards-compatible.

  15. Raw Chip (ASIC @225 MHz) 16 OPS/FLOPS per cycle 462 Gb/s of on-chip “bisection bandwidth” 201 Gb/s I/O bandwidth 57 GB/s of on-chip memory bandwidth ... but how are the resources going to be coordinated? How well does Raw expose theresources?

  16. Raw: How we want to use the tiles mem mem mem MPI program 4-way threaded JAVA application Custom Datapath Pipeline httpd

  17. The Raw Tile network support Tile processor Computation Resources 4 32-bit mesh networks 2 static, 2 dynamic 5 stage static router Pipeline 2 stage dynamic router pipeline How does the main pipeline interface to the networks? 64 KB SMem

  18. Memory mapped networks are not first class citizens. To other tiles, through memory system that happens to go over a network. E M1 M2 A TL TV IF D RF F P U F4 WB

  19. Instead, Raw’s networks are tightly coupled into the bypass paths r24 r24 Ex: lb r25, 0x341(r26) r25 r25 r26 r26 r27 r27 Network Input FIFOs Network Output FIFOs E M1 M2 A TL TV IF D RF F P U F4 WB

  20. How the static router works. Goal: flow controlled, in order delivery of operands fadd r5, r3, r24 fmul r24, r3, r4 route P->E route W->P software controlled crossbar software controlled crossbar

  21. tmp0 = (seed*3+2)/2 tmp1 = seed*v1+2 tmp2 = seed*v2 + 2 tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 v3 = tmp3 - v2 RawCC Operation:Parallelizes C codeonto static network seed.0=seed pval5=seed.0*6.0 pval1=seed.0*3.0 pval4=pval5+2.0 Black arrows = Static Network pval0=pval1+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 tmp0.1=pval0/2.0 v3.10=tmp3.6-v2.7 tmp0=tmp0.1 v3=v3.10 v1.2=v1 v2.4=v2 seed.0=seed pval2=seed.0*v1.2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp1.3=pval2+2.0 pval1=seed.0*3.0 v1.2=v1 tmp2=tmp2.5 v2.4=v2 tmp1=tmp1.3 pval5=seed.0*6.0 pval6=tmp1.3-tmp2.5 pval7=tmp1.3+tmp2.5 pval2=seed.0*v1.2 pval0=pval1+2.0 pval3=seed.o*v2.4 v2.7=pval6*5.0 pval4=pval5+2.0 v1.8=pval7*3.0 tmp1.3=pval2+2.0 tmp0.1=pval0/2.0 v2=v2.7 tmp2.5=pval3+2.0 v0.9=tmp0.1-v1.8 tmp3.6=pval4/3.0 tmp1=tmp1.3 v1=v1.8 v0=v0.9 tmp0=tmp0.1 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 Low Network latency important. v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0=v0.9 v2=v2.7 v3=v3.10

  22. Raw Stats IBM SA-27E .15u 6L Cu 18.2mm x 18.2mm die. .122 Billion Transistors 16 Tiles 2048 KB SRAM Onchip 1657 Pin CCGA Package (1080 HSTL signal IO) ~225 MHz ~25 Watts For architectural details, see: http://cag.lcs.mit.edu/pub/raw/documents/RawSpec99.pdf

  23. Raw Board with IKOS logic emulation Currently in timing closure and verification. Tape out: Q4 2001

  24. Enabler: The Raw NetworksThe Raw ISA treats the networks as firstclass citizens, just like registers: software managed, bypassed, encoding space in every instructionStatic Network: 1. routes compiled into static router SMEM 2. Messages arrive in known orderLatency: 2 + # hops Throughput: 1 word/cycle per dir. per network

  25. Summary Raw exposes wire delay at the ISA level. This allows the compiler to explicitly manage gates in a scalable fashion. Raw provides a direct, parallel interface to all of the chip resources: gates, wires, and pins. Raw enables the use of these gates by providing tightly coupled network communication mechanisms in the ISA.

More Related