1 / 18

Interconnect-Oriented Architecture and Circuits

Interconnect-Oriented Architecture and Circuits. William J. Dally Computer Systems Laboratory Stanford University February 12, 1998. On-chip wires. 0.0mm. 2.5mm. Minimum width wire in an 0.35 m m process. 5.0mm. 7.5mm. 10.0mm. On-chip wires are getting slower. x 2 = s x 1 0.5x

Download Presentation

Interconnect-Oriented Architecture and Circuits

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Interconnect-OrientedArchitecture and Circuits William J. Dally Computer Systems Laboratory Stanford University February 12, 1998

  2. On-chip wires 0.0mm 2.5mm Minimum width wire in an 0.35mm process 5.0mm 7.5mm 10.0mm

  3. On-chip wires are getting slower x2 = s x1 0.5x R2 = R1/s2 4x C2 = C1 1x tw2 = R2C2y2 = tw1/s2 4x tw2/tg2= tw1/(tg1s3) 8x v = 0.5(tgRC)-1/2 (m/s) v2 = v1s1/2 0.7x vtg = 0.5(tg/RC)1/2 (m/gate) v2tg2 = v1tg1s3/2 0.35x y y x1 x2 tw = RCy2 RCy2 RCy2 tg tg tg

  4. Technology scaling makes communication the scarce resource 1998 2008 0.35mm 64Mb DRAM 16 64b FP Proc 400MHz 0.10mm 4Gb DRAM 1K 64b FP Proc 2.5GHz P 18mm 12,000 tracks 1 clock repeaters every 3mm 32mm 90,000 tracks 20 clocks repeaters every 0.4mm

  5. Architecture Must Evolve to Fit the Landscape 20 Clocks Global operations Low bandwidth High latency & High power 90,000 tracks Local, parallel operations High bandwidth Low latency & Low power

  6. All instructions issued from single global instruction unit All data passes through global register file This won’t work when global accesses cost 20 clocks of latency Architecture Today Depends on Fast Global Communication I-Unit Regs

  7. Multiple elements (clusters) with local instruction dispatch local register files co-located with arithmetic elements Explicit communication between elements through a switch or network Fast synchronization between instruction units Regs IU Regs IU Regs IU Regs IU Tomorrow’s Architectures must Exploit Locality and Expose Communication Switch

  8. Multi-ALU Processor Chip

  9. Crafted-Cell Design Full-Custom Standard Cell 1.0x 1.11x 2.23x IRRDP ADDSUB 1.0x 1.17x 2.7x Crafted-Cell Design Area Crafted-Cell Standard-Cell Full-Custom 80 Different Cells 7 Different Cells 17 Different Cells 1x 1.64x 5.25x Performance -Results courtesy of Andrew Chang

  10. Interconnect: repeaters with switching • Need repeaters every 1mm or less • Easy to insert switching • zero-cost reconfiguration • Can’t afford decision time • static routing • fixed or regular pattern • source routing • on-demand • requires arbitration and fanout • Queuing and flow-control • Pipelining control 1mm 1mm Arb LUT

  11. Bandwidth Hierarchy • Provide lots of bandwidth where its inexpensive • short wires between ALUs • Moderate bandwidth with intermediate cost • local RAM associated with each ALU cluster • Low bandwidth where its expensive • Global RAM with long wires • Very low bandwidth off chip global30mm medium4mm local1mm off chip LocalRAM ALU Cluster Global on-chip RAM LocalRAM ALU Cluster LocalRAM ALU Cluster LocalRAM ALU Cluster

  12. Bandwidth Hierarchy • A key problem is to match the demands of an application to the bandwidth available at each level of the hierarchy • Casting applications in a streaming model exposes much of the locality necessary to exploit the hierarchy LocalRAM ALU Cluster Global on-chip RAM LocalRAM ALU Cluster LocalRAM ALU Cluster LocalRAM ALU Cluster

  13. Processor architecture configuration of ALUs clustered vs distributed method for controlling ALUs distributed control, VLIW, SIMD communication aware instruction sets how to hide details while exposing communication Memory architecture methods for exploiting 2D spatial locality communication aware cache organizations Communication Architecture on-chip interconnection networks the use of repeaters with switching the use of hierarchy and selective ‘fat’ wires Architecture Research Issues

  14. The clock cycle is dominated by wire delay novel circuits to improve effective signal velocity Power is largely used to drive wires low-swing on-chip signaling methods reject rather than overpower noise Its difficult to distribute a global clock locally synchronous design methods fast synchronizers no wait for metastable decay Circuit Challenges of Slow Interconnect

  15. Overdrive gives 3x improvement in RC wire latency

  16. Low-Swing Overdrive Signaling 1V Swing at Source 300mV Swing at Receiver Recovered Signal

  17. ConclusionExploit, Don’t Fight, The Technology • Interconnect is rapidly dominating the delay, power, and area of ICs • Traditional architectures rely on global communication • they are ill-suited for an interconnect-dominated technology • Emerging architectures expose communication and exploit locality • distributed register files and instruction dispatch • bandwidth hierarchy • Novel circuits can mitigate effects of slow wires • overdrive, low-swing signaling, locally synchronous design

More Related