Interconnect-Oriented Architecture and Circuits

Interconnect-OrientedArchitecture and Circuits William J. Dally Computer Systems Laboratory Stanford University February 12, 1998

On-chip wires 0.0mm 2.5mm Minimum width wire in an 0.35mm process 5.0mm 7.5mm 10.0mm

On-chip wires are getting slower x2 = s x1 0.5x R2 = R1/s2 4x C2 = C1 1x tw2 = R2C2y2 = tw1/s2 4x tw2/tg2= tw1/(tg1s3) 8x v = 0.5(tgRC)-1/2 (m/s) v2 = v1s1/2 0.7x vtg = 0.5(tg/RC)1/2 (m/gate) v2tg2 = v1tg1s3/2 0.35x y y x1 x2 tw = RCy2 RCy2 RCy2 tg tg tg

Technology scaling makes communication the scarce resource 1998 2008 0.35mm 64Mb DRAM 16 64b FP Proc 400MHz 0.10mm 4Gb DRAM 1K 64b FP Proc 2.5GHz P 18mm 12,000 tracks 1 clock repeaters every 3mm 32mm 90,000 tracks 20 clocks repeaters every 0.4mm

Architecture Must Evolve to Fit the Landscape 20 Clocks Global operations Low bandwidth High latency & High power 90,000 tracks Local, parallel operations High bandwidth Low latency & Low power

All instructions issued from single global instruction unit All data passes through global register file This won’t work when global accesses cost 20 clocks of latency Architecture Today Depends on Fast Global Communication I-Unit Regs

Multiple elements (clusters) with local instruction dispatch local register files co-located with arithmetic elements Explicit communication between elements through a switch or network Fast synchronization between instruction units Regs IU Regs IU Regs IU Regs IU Tomorrow’s Architectures must Exploit Locality and Expose Communication Switch

Multi-ALU Processor Chip

Crafted-Cell Design Full-Custom Standard Cell 1.0x 1.11x 2.23x IRRDP ADDSUB 1.0x 1.17x 2.7x Crafted-Cell Design Area Crafted-Cell Standard-Cell Full-Custom 80 Different Cells 7 Different Cells 17 Different Cells 1x 1.64x 5.25x Performance -Results courtesy of Andrew Chang

Interconnect: repeaters with switching • Need repeaters every 1mm or less • Easy to insert switching • zero-cost reconfiguration • Can’t afford decision time • static routing • fixed or regular pattern • source routing • on-demand • requires arbitration and fanout • Queuing and flow-control • Pipelining control 1mm 1mm Arb LUT

Bandwidth Hierarchy • Provide lots of bandwidth where its inexpensive • short wires between ALUs • Moderate bandwidth with intermediate cost • local RAM associated with each ALU cluster • Low bandwidth where its expensive • Global RAM with long wires • Very low bandwidth off chip global30mm medium4mm local1mm off chip LocalRAM ALU Cluster Global on-chip RAM LocalRAM ALU Cluster LocalRAM ALU Cluster LocalRAM ALU Cluster

Bandwidth Hierarchy • A key problem is to match the demands of an application to the bandwidth available at each level of the hierarchy • Casting applications in a streaming model exposes much of the locality necessary to exploit the hierarchy LocalRAM ALU Cluster Global on-chip RAM LocalRAM ALU Cluster LocalRAM ALU Cluster LocalRAM ALU Cluster

Processor architecture configuration of ALUs clustered vs distributed method for controlling ALUs distributed control, VLIW, SIMD communication aware instruction sets how to hide details while exposing communication Memory architecture methods for exploiting 2D spatial locality communication aware cache organizations Communication Architecture on-chip interconnection networks the use of repeaters with switching the use of hierarchy and selective ‘fat’ wires Architecture Research Issues

The clock cycle is dominated by wire delay novel circuits to improve effective signal velocity Power is largely used to drive wires low-swing on-chip signaling methods reject rather than overpower noise Its difficult to distribute a global clock locally synchronous design methods fast synchronizers no wait for metastable decay Circuit Challenges of Slow Interconnect

Overdrive gives 3x improvement in RC wire latency

Low-Swing Overdrive Signaling 1V Swing at Source 300mV Swing at Receiver Recovered Signal

ConclusionExploit, Don’t Fight, The Technology • Interconnect is rapidly dominating the delay, power, and area of ICs • Traditional architectures rely on global communication • they are ill-suited for an interconnect-dominated technology • Emerging architectures expose communication and exploit locality • distributed register files and instruction dispatch • bandwidth hierarchy • Novel circuits can mitigate effects of slow wires • overdrive, low-swing signaling, locally synchronous design

Interconnect-Oriented Architecture and Circuits

Interconnect-Oriented Architecture and Circuits

Presentation Transcript

Service Oriented Architecture

Service Oriented Architecture

Service Oriented Architecture

PC System Architecture PCIe Interconnect

Logic Circuits and Computer Architecture

Service Oriented Architecture

Service-Oriented Architecture

Service-oriented Architecture

Service Oriented Architecture

Logic Circuits and Computer Architecture

Architecture, Styles, and Service Oriented Architecture

Service Oriented Architecture

Service Oriented Architecture

Service Oriented Architecture

Service Oriented Architecture

Logic Circuits and Computer Architecture

Service Oriented Architecture

Service oriented architecture