1 / 34

The Raw Tiled Processor Architecture Is the future of architecture in tiles?

The Raw Tiled Processor Architecture Is the future of architecture in tiles?. Anant Agarwal MIT. http://www.cag.csail.mit.edu/raw. A tiled processor architecture prototype: the Raw microprocessor. October 02. Embedded system: 1020 Element Microphone Array. 1020 Node Beamformer Demo.

melita
Download Presentation

The Raw Tiled Processor Architecture Is the future of architecture in tiles?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Raw Tiled Processor ArchitectureIs the future of architecture in tiles? Anant Agarwal MIT http://www.cag.csail.mit.edu/raw

  2. A tiled processor architecture prototype: the Raw microprocessor October 02

  3. Embedded system:1020 Element Microphone Array

  4. 1020 Node Beamformer Demo • 2 People moving about and talking • Track movement of people using camera (vision group) and display on monitor • Beamformer focuses on speech of one person • Select another person using mouse • Beamformer switches focus to the speech of the other person

  5. The opportunity 20MIPS cpu 1987

  6. The opportunity 2007 The billion transistor chip

  7. Enables seeking out new application domains • Redefine our notion of a “general purpose processor” • Imagine a single-chip handheld that is a speech driven cellphone, camera, PDA, MP3 player, video engine • Imagine a single-chip PC that is also a 10G router, wireless access point, graphics engine • While running the gamut of existing desktop binaries

  8. Other… Encryption Sound Ethernet Wireless X86 Pentium IV Graphics But, where is general purpose computing today? ASICs 0.25TFLOPS

  9. mem mem mem mem How does the ASIC do it? mem • Lots of ALUs, lots of registers, lots of small memories • Hand-routed, short wires • Lower power (everything close by) • Stream data model for high throughput But, not general purpose

  10. Our challenge • How to exploit ASIC-like features • Lots of resources like ALUs and memories • Application-specific routing of short wires • While being “general purpose” • Programmable • And even running ILP-based sequential programs One Approach: Tiled Processor Architecture (TPA)

  11. Tiled Processor Architecture (TPA) IMEM DMEM PC REG FPU ALU • Lots of ALUs and regs SMEM • Short, prog. wires PC SWITCH • Lower power Tile Programmable. Supports ILP and Streams

  12. Raw Switch SMEM A Raw Tile PC IMEM DMEM PC REG FPU ALU SMEM PC SWITCH Packet stream Disk stream Video1 RDRAM A Prototype TPA: The Raw Microprocessor [Billion transistor IEEE Computer Issue ’97] The Raw Chip Tile Software-scheduled interconnects (can use static or dynamic routing – but compiler determines instruction placement and routes)

  13. A Raw Tile IMEM DMEM r24 r24 r25 r25 PC REG r26 0-cycle “local bypass network” r26 r27 r27 FPU Output FIFOs to Static Router Input FIFOs from Static Router ALU E M1 M2 A TL TV IF D RF WB F P U F4 SMEM The Raw Chip PC Point-to-point bypass-integrated compiler-orchestrated on-chip networks SWITCH Tile Packet stream Disk stream Video1 RDRAM Tight integration of interconnect

  14. How to “program the wires” Tile 11 Tile 10 fadd r5, r3, r25 fmul r24, r3, r4 route P->E route W->P software controlled crossbar software controlled crossbar

  15. mem mem mem MPI program C program ILP computation Custom Datapath Pipeline httpd Zzzz The result of orchestrating the wires

  16. Perspective We have replaced Bypass paths, ALU-reg bus, FPU-Int. bus, reg-cache-bus, cache-mem bus, etc. With a general, point-to-point, routed interconnect called: Scalar operand network (SON) Fundamentally new kind of network optimized for both scalar and stream transport

  17. Programming models and software for tiled processor architectures • Conventional scalar programs (C, C++, Java) Or, how to do ILP • Stream programs

  18. Scalar (ILP) program mapping E.g., Start with a C program, and several transformations later: v2.4 = v2 seed.0 = seed v1.2 = v1 pval1 = seed.0 * 3.0 pval0 = pval1 + 2.0 tmp0.1 = pval0 / 2.0 pval2 = seed.0 * v1.2 tmp1.3 = pval2 + 2.0 pval3 = seed.0 * v2.4 tmp2.5 = pval3 + 2.0 pval5 = seed.0 * 6.0 pval4 = pval5 + 2.0 tmp3.6 = pval4 / 3.0 pval6 = tmp1.3 - tmp2.5 v2.7 = pval6 * 5.0 pval7 = tmp1.3 + tmp2.5 v1.8 = pval7 * 3.0 v0.9 = tmp0.1 - v1.8 v3.10 = tmp3.6 - v2.7 tmp2 = tmp2.5 v1 = v1.8; tmp1 = tmp1.3 v0 = v0.9 tmp0 = tmp0.1 v3 = v3.10 tmp3 = tmp3.6 v2 = v2.7 Existing languages will work Lee, Amarasinghe et al, “Space-time scheduling”, ASPLOS ‘98

  19. seed.0=seed pval1=seed.0*3.0 v1.2=v1 v2.4=v2 pval5=seed.0*6.0 pval2=seed.0*v1.2 pval0=pval1+2.0 pval3=seed.o*v2.4 pval4=pval5+2.0 tmp1.3=pval2+2.0 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp1=tmp1.3 tmp0=tmp0.1 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0=v0.9 v2=v2.7 v3=v3.10 Scalar program mapping v2.4 = v2 seed.0 = seed v1.2 = v1 pval1 = seed.0 * 3.0 pval0 = pval1 + 2.0 tmp0.1 = pval0 / 2.0 pval2 = seed.0 * v1.2 tmp1.3 = pval2 + 2.0 pval3 = seed.0 * v2.4 tmp2.5 = pval3 + 2.0 pval5 = seed.0 * 6.0 pval4 = pval5 + 2.0 tmp3.6 = pval4 / 3.0 pval6 = tmp1.3 - tmp2.5 v2.7 = pval6 * 5.0 pval7 = tmp1.3 + tmp2.5 v1.8 = pval7 * 3.0 v0.9 = tmp0.1 - v1.8 v3.10 = tmp3.6 - v2.7 tmp2 = tmp2.5 v1 = v1.8; tmp1 = tmp1.3 v0 = v0.9 tmp0 = tmp0.1 v3 = v3.10 tmp3 = tmp3.6 v2 = v2.7 Graph Program code

  20. seed.0=seed pval1=seed.0*3.0 v1.2=v1 v2.4=v2 pval5=seed.0*6.0 pval2=seed.0*v1.2 pval0=pval1+2.0 pval3=seed.o*v2.4 pval4=pval5+2.0 tmp1.3=pval2+2.0 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp1=tmp1.3 tmp0=tmp0.1 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0=v0.9 v2=v2.7 v3=v3.10 Program graph clustering

  21. seed.0=seed pval5=seed.0*6.0 pval1=seed.0*3.0 pval4=pval5+2.0 tmp3=tmp3.6 pval0=pval1+2.0 tmp3.6=pval4/3.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v3.10=tmp3.6-v2.7 v3=v3.10 v1.2=v1 v2.4=v2 pval2=seed.0*v1.2 pval3=seed.o*v2.4 tmp1.3=pval2+2.0 tmp2.5=pval3+2.0 pval7=tmp1.3+tmp2.5 tmp1=tmp1.3 pval6=tmp1.3-tmp2.5 v0.9=tmp0.1-v1.8 tmp2=tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v0=v0.9 v2=v2.7 v1=v1.8 Placement Tile2 Tile1 Tile4 Tile3

  22. Routing route(W,S,t) seed.0=seed seed.0=recv() send(seed.0) route(W,S) pval5=seed.0*6.0 route(t,E,S) Tile2 pval1=seed.0*3.0 pval4=pval5+2.0 Tile1 pval0=pval1+2.0 tmp3.6=pval4/3.0 route(t,E) route(S,t) tmp0.1=pval0/2.0 tmp3=tmp3.6 send(tmp0.1) v2.7=recv() tmp0=tmp0.1 v3.10=tmp3.6-v2.7 v3=v3.10 v1.2=v1 seed.0=recv() route(N,t) v2.4=v2 pval2=seed.0*v1.2 seed.0=recv(0) route(N,t) tmp1.3=pval2+2.0 pval3=seed.o*v2.4 send(tmp1.3) Tile4 tmp2.5=pval3+2.0 tmp1=tmp1.3 route(t,W) tmp2=tmp2.5 Tile3 tmp2.5=recv() route(W,t) send(tmp2.5) pval7=tmp1.3+tmp2.5 route(t,E) v1.8=pval7*3.0 tmp1.3=recv() route(E,t) v1=v1.8 pval6=tmp1.3-tmp2.5 route(N,t) tmp0.1=recv() v2.7=pval6*5.0 route(W,N) v0.9=tmp0.1-v1.8 Send(v2.7) route(t,E) v0=v0.9 v2=v2.7 Processor code Switch code

  23. time Instruction Scheduling v1.2=v1 seed.0=seed v2.4=v2 send(seed.0) pval1=seed.0*3.0 route(t,E) route(t,E) route(N,t) route(W,t) seed.0=recv(0) seed.0=recv() route(W,S) pval3=seed.o*v2.4 pval5=seed.0*6.0 route(N,t) seed.0=recv() pval0=pval1+2.0 pval2=seed.0*v1.2 tmp0.1=pval0/2.0 pval4=pval5+2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp1.3=pval2+2.0 tmp2=tmp2.5 send(tmp2.5) send(tmp1.3) route(t,E) send(tmp0.1) route(t,W) tmp3=tmp3.6 tmp1=tmp1.3 tmp0=tmp0.1 route(E,t) route(t,E) tmp1.3=recv() route(W,t) route(W,S) tmp2.5=recv() pval6=tmp1.3-tmp2.5 route(W,S) pval7=tmp1.3+tmp2.5 route(N,t) v2.7=pval6*5.0 v1.8=pval7*3.0 Send(v2.7) v1=v1.8 v2=v2.7 route(t,E) tmp0.1=recv() route(W,N) v0.9=tmp0.1-v1.8 route(W,N) route(S,t) v0=v0.9 v2.7=recv() v3.10=tmp3.6-v2.7 v3=v3.10 Tile3 Tile1 Tile4 Tile2

  24. FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter StreamIt: Stream Language and Compiler Amarasinghe et al. e.g., BeamFormer class BeamFormer extends Pipeline { void init(numChannels, numBeams) { add(new SplitJoin() { void init() { setSplitter(Duplicate()); for (int i=0; i<numChannels; i++) { add(new FIR1(N1)); add(new FIR2(N2)); } setJoiner(RoundRobin()); }}); } add(new SplitJoin() { void init() { setSplitter(Duplicate()); for (int i=0; i<numBeams; i++) { add(new VectorMult()); add(new FIR3(N3)); add(new Magnitude()); add(new Detect()); } setJoiner(Null()); }}); } Splitter Joiner Splitter Vector Mult Vector Mult Vector Mult Vector Mult FirFilter FirFilter FirFilter FirFilter Magnitude Magnitude Magnitude Magnitude Detector Detector Detector Detector Joiner

  25. Raw Beamformer Layout (by hand) mem mem mem mem + + + + + + + + + + + + mem Control mem FRM mem mem mem mem mem mem FIR Search

  26. Raw die photo .18 micron process, 16 tiles, 425MHz, 18 Watts (vpenta) Of course, custom IC designed by industrial design team could do much better

  27. Raw motherboard

  28. More Experimental Systems Systems Online or in Pipeline Raw Workstation Raw-based 1020 Microphone Array Raw 802.11a/g wireless system (collab with Engim) Raw Gigabit IP router Raw graphics system Raw supercomputing fabric

  29. Empirical Evaluation Compare to P3 implemented in similar technology Raw #s from cycle-accurate simulator validated against real chip -- FPGA mem controller in Raw -- Raw SW i-caching adds 0-30% ovhd

  30. ~10x parallelism ~ 4x ld/store elim ~ 4x stream mem bw Performance Results Performance Architecture Space [ISCA’04]

  31. Raw IP Router Gb/sec

  32. VersaBench www.cag.csail.mit.edu/versabench Sharing a benchmark set to stress versatility of processors Categories of programs: ILP – Desktop and Scientific Streams Throughput oriented servers Bit-level embedded

  33. Summary Raw: single chip for ILP and streaming Scalar operand network is key to ILP and streamsTiled processor chip demonstrated in 2002 www.cag.csail.mit.edu/raw

More Related