Outline

Communication Modeling for System-Level DesignAndrew B. Kahng#,*abk@cs.ucsd.eduKambiz Samadi*kambiz@vlsicad.ucsd.eduCSE# and ECE* Departments, UCSDNovember 24, 2008

Outline • Motivation • Communication Synthesis for Network-on-Chip • Network-on-Chip Architecture Modeling • Buffered Interconnect Model • Router Power and Area Model • Bus Architecture Modeling • Conclusions

Motivation • Focus of design process is shifting from “computation” to “communication” • Device and interconnect performance scaling mismatches cause breakdown of traditional across-chip communication • System-level designers require accurate, yet simple models to bridge planning and implementation stages • Today’s system-level performance, power modeling suffers: • Ad hoc selection of models • Poor balance between accuracy and simplicity • Lack of model extensibility across future technology nodes • Due to design performance / power constraints, early-stage design exploration has become crucial Our Goal: Develop accurate models that are easily usable by system-level design early in the design cycle

Communication Synthesis for Network-on-Chip Given An input specification as a set of communication constraints A library of communication components An objective function (e.g., power, area, delay) Find A network-on-chip implementation as a composition of library components that Satisfies the specification Minimizes the cost function Communication Synthesis Infrastructure (COSI) Based on the Platform-Based Design methodology Takes specification and library descriptions in XML format Produces a variety of outputs, including a cycle accurate SystemC implementation of the optimal network-on-chip

Point-to-Point Specification On-Chip Communication Library Synthesis Result Constraint-Driven Communication Synthesis Perf. / Cost Abstractions Constraints Propagation Application Implementation Synthesis

Buffered Interconnect Model Components Repeater delay model Separate models for intrinsic delay, output slew, input capacitance Wire delay model Accounts for coupling capacitance impact on wire delay Repeater power model Accounts for sub-threshold and gate leakages Repeater area model Derived from existing cell layouts (can be extrapolated) Wire area model Derived from wire width and spacing (can be extrapolated) Device Interconnect Automatic Extraction Interconnect T HILD Wmin Smin εILD TIERS(L,I,SG,G) Min. Inverter Rd Cin Ioff tintrinsic .lib Automatic Extraction LEF/ITF MASTAR Technology parameter extraction flows. Interconnect Chapter ITRS SPICE Sim. PTM Global Local Semi-global Intermediate • Inputs for repeater delay calculation • Delay and slew values for a set of input slew and load capacitance values (obtained from Liberty / SPICE) • Input capacitance for different repeater size (Liberty, PTM) • Inputs for wire delay calculation • Wire dimensions (ITRS/PTM, LEF, ITF) • Inter-wire spacings for global and intermediate layers (ITRS/PTM, LEF, ITF) • Inputs for power calculation • Input capacitance (Liberty, PTM) • Wire parasitics (computed in wire delay calculation)

Repeater and Wire Models delay = i(slewin) + r(slewin) * CL r(s) = f(size, slewin) slewout = f(slewin,CL) wire delay = Elmore Intrinsic Delay Model – i(slewin) Drive Res Model – r(slewin) • Predictions extend down to 16nm • Delay model is < 15% of PrimeTime • Repeater area and power linear with repeater size

Impact on System-Level Design Testcases VPROC: video processor with 42 cores and 128-bit datawidth dVOPD: dual video object plane decoder with 26 cores and 128-bit datawidth • Original model (Orig.)underestimates power compared to the Proposed Model (Prop.) • Original Model is very optimistic in delay  becomes more critical as technology scales and the chip size becomes larger

ORION2.0: Accurate NoC Router Models circuit implementation & buffering scheme technology parameters • interconnect parameters • device parameters • scaling factors for future • technologies • … • SRAM and register FIFO • MUX-tree and Matrix crossbar • different arbitration scheme • hybrid buffering scheme architectural parameters • # of ports; # of buffers • # of xbar ports; # of VC • voltage, frequency FIFO Arbiter Crossbar Clock Link ORION2.0 –NEW ! Area Leakage Dynamic • Built on top of ORION1.0 • Provides, previously missing, power subcomponents • Provides significant accuracy improvement vs. ORION1.0 • Uses our automatic flows to obtain technology inputs • To appear in DATE-2009 (A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi)

Validation and Significance Assessment • Validation: Two Intel NoC Chips • (1) Intel 80-core Teraflops, and (2) Intel SCC • ORION2.0 offers significant accuracy improvement • System-level Impact: COSI-OCC • ORION2.0 models lead to better-performing NoC: (1) less # hops, and (2) less # routers • Relative power due to additional port not as high in ORION2.0 vs. 1.0

AMBA Models • Signal Bus Modeling: • system-level interconnect model (described earlier) • Logic Modeling (multiplexers, decoders, and arbiter): • Block latency based on gate delay model (cf. Carloni et al. ASPDAC’08) • Dynamic power is computed after measuring the switching capacitance • Leakage power is computed from average device leakages • Area is computed from cell areas of logic gates

AMBA Modeling and Bus vs. NoC Study floorplan transaction • location of all masters, slaves • bit widths of all masters, slaves • optionally, locations of arbiter, decoder, and multiplexers Area Delay • read and write • length • address progression AMBA Model technology & design style Leakage • min. width, spacing, thickness • dielectric thickness, constant • device drive res, cap, leakage • width/spacing, buffering scheme Dynamic • Delay, power, area models within 11%of physical implementation • Functional forms verified against physical implementation of AMBA-AHB controller • Bus vs. NoC study enables design space explorations of heterogeneous communication fabrics

Conclusions and Future Directions • Accurate models can drive effective system-level exploration • Reproducible methodology for extracting inputs to models • Modeling at different levels of abstractions • protocol encapsulation (e.g., hand-shaking for AMBA bus allocation) • buses, pipelined rings (e.g. EIB in IBM Cell) • routers, network interfaces • FIFOs, queues, crossbar switches (ORION2.0) • Extending to other technologies • 3D IC integration (i.e., TSV modeling, multi-layer router modeling, etc.)

Backup Slides

Communication Synthesis Key Elements • Specification of input constraints • Set of IP cores: area and interface • End-to-end communication requirements between pairs of IP cores: latency and throughput • Characterization of library of components • Interface types, max number of ports • Max capacities: bandwidth, latency, max distance • Performance and cost model • Component instantiation and parallel composition • Rename, set parameters of library components • Composition based on algebra on quantities (including type compatibility)

Platform Instance 2 Platform Instance 1 Communication Synthesis Example • Synthesis of optimal network-on-chip • Return valid composition that meets input constraints and • Minimizes the objective function (e.g., power dissipation) (Original Specification)

COSI: Communication Synthesis Infrastructure • COSI is a public-domain software package for NoC synthesis http://embedded.eecs.berkeley.edu/cosi/

Dynamic and Leakage Power Models • Dynamic Power: Switching Capacitance • Clock power: • Pclk = CclkVdd2f • Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cwiring • Physical Links: due to charging and discharging of capacitive load • Pd = CloadVdd2f; Cload = Cground + Ccoupling + Cinput • Register-based FIFO: implemented as shift registers • Other components: we use ORION 1.0 models • Leakage Power:Subthreshold and Gate • From 65nm and beyond gate leakage becomes significant • I’sub(i,s) and I’gate(i,s) are subthreshold and gate leakage currents per unit transistor width for a specific technology • Wsub(i,s) and Wgate(i,s) are the effective widths of component i at input state s for subthreshold and gate leakage, respectively • Key circuit components INVx1, NAND2x1, NOR2x1, and DFF

Area Model Matrix Arbiter • As number of cores increases, the area occupied by communication components becomes significant (19% of total tile area in the Intel 80-core Teraflops Chip) • Gate area model by Yoshida et al. (DAC’04) • Link area model by Carloni et al. (ASPDAC’08) • We model FIFO, crossbar switch, and arbiter areas using the adopted gate area model Areaarbiter = (AreaNOR2x1.2(R-1)R) + (AreaDFF.(R(R-1)/2)) + (AreaINVx1.R)

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: