NoC Symposium’07 Panel Proliferating the Use and Acceptance of NoC Benchmark Standards

NoC Symposium’07 PanelProliferating the Use and Acceptance of NoC Benchmark Standards Timothy M. Pinkston National Science Foundation (NSF) tpinksto@nsf.gov University of Southern California (USC) tpink@usc.edu

Driving Forces Demand for System Functions Performance of System Functions Hardware Functional Blocks Demand for Functional Blocks workloads Arch App’s Architecture Tech defines how system functions are supported Alg & SW Implementation (Circuit) Technology Applications define what system functions should be supported defines the extent to which desired system functions can be implemented in hardware “Trends Towards On-chip Networked Microsystems”, T. Pinkston and J. Shin, IJHPCN. (http://ceng.usc.edu/smart/publications/archives/CENG-2004-17.pdf)

Need for a NoC Benchmark Suite Is There a ? • A sampling of benchmark suites already out there: • Gen-Purpose/PC • Embedded/SoC • Sci-Eng/HPC • SPEC CPU • -2006 • STREAM • HPL • CPU2 • EEMBC • SPLASH • -2 • Netperf • MiBench • LINPACK • Dhry-/Whetstone • MediaBench • LAPACK • BAPCo SYSmark • ALPBench • ScaLAPACK • BYTEmark • GraalBench • NPB (NAS PB) • LMBench • NPCryptBench • LFK (Livermore) • LLCbench • CommBench • SparseBench • DMABench • BioBench • Do we really need yet another benchmark suite?

December 2006 NSF OCIN Workshop Recommendations(www.ece.ucdavis.edu/~ocin06) • A set of standard workloads/benchmarks and evaluation methods are needed to enable realistic evaluation and uniform (fair) comparison between various approaches • Need for cooperation (agreement) between academia and industry • Need for “qualified” performance metrics: latency and bandwidth under power, energy, thermal, reliability, area, etc., constraints • Need for standardization of metrics: clear definition of what is being represented by metrics (e.g., network latency, throughput,...) • Need for effective alternatives to time consuming full-system execution-driven simulation, including use of microbenchmarks, parameterized synthetic traffic/workloads, traces, etc. • Need for accurate characterization and modelling of system traffic behavior across various domains: general-purpose & embedded • Need for analytical methods (complementary to simulation) to explore and quantitatively narrow-down the large design space “Challenges in Computer Architecture Evaluation,” K. Skadron, M. Martonosi, D. August, M. Hill, D. Lilja, V. Pai, in IEEE Computer, pp. 30-36, August 2003.

Meaning of Latency and Throughput • Latency: fabric only, endnode-to-endnode, ave., no-load, saturation? • Throughput: peak, sustained, saturation, best-case, worst-case? Simulation: 3-D Torus, 4,096 nodes (16 х 16 х 16), uniform traffic load, virtual cut-through switching, three-phase arbitration, 2 and 4 virtual channels. Bubble flow control is used in dimension order on one virtual channel; the other virtual channel(s) is supplied in dimension order (deterministic routing) or along any shortest path to destination (adaptive routing).

(cut-through switching) lower bound (contention delay not included) upper bound (contention delay not fully included) r×BWBisection Effective bandwidth = min(N × BWLinkInjection , , s× N × BWLinkReception) g Packet + (d x Header) Latency = Sending latency + TLinkProp x (d+1) + (Tr + Ta + Ts)x d + + Receiving latency Bandwidth BWNetwork Simple (Analytical) Latency and Throughput Models • H&P Int.Net. chapter: ceng.usc.edu/smart/slides/appendixE.html • Network traffic pattern/load determine s & g, traffic-dependent parameters • Topology and switch marchitecture determine d, Tr , Ta , Ts , BWBisection • Routing, switching, FC, march, etc., influence network efficiency factor, r • internal switch speedup & reduction of contention within switches • buffer organizations to mitigate HOL blocking in and across switches • balance load across network links & maximally utilize link bandwidth • r = rL x rR x rA x rS x rmArch x …, architecture-dependent parameters

ρ=38% Modeling Throughput of Cell BE EIB (Worst-Case) BWNetwork = ρ × BWBisection/g BWNetwork = ρ × 204.8/1 GB/s = 78 GB/s(measured) g= 1 Injection bandwidth: 25.6 GB/s per element Reception bandwidth: 25.6 GB/s per element s= 1 Command Bus Bandwidth BWBisection = 8 links = 204.8 GB/s 204.8 GB/s Aggregate bandwidth Network injection Network reception Peak BWNetwork of 25.6 GB/s x 3 x 4 307.2 GB/s (4 rings each with 12 links) (12 Nodes) (12 Nodes) 1,228.8 GB/s (3 transfers per ring) 307.2 GB/s 307.2 GB/s rlimited, at best, to only 50% due to ring interferrence Traffic pattern: determines s & g

Integer Programs Floating-Point Programs Ref: Hennessy & Patterson, “Computer Architecture: A Quantitative Approach, 4th Ed.

In Conclusion: Answers to Panel Questions • What are the hallmarks of successful benchmark suites? • Fairness: represent the proper workload behavior/characteristics • Portability: open, free access, not architecture/vendor-specific • Transparency: yield reproducible performance results (reporting) • Evolutionary: adaptable over time in composition and reporting • How can industry and academia facilitate use? • Establish need/importance for common evaluation “best-practices” • Cross-cutting effort: architects, circuit designers, CAD researchers • Need to place high value on developing and using eval. standards • What are the main obstacles to establishing a de facto NoC standard benchmark suite, and how to address? • Capturing the diversity of NoC applications & computing domains • Red herrings  converge on performance evaluation standards and agree on characteristic traffic loads and/or microbenchmarks • Ultimately, system-level performance is important, not component

NoC Symposium’07 Panel Proliferating the Use and Acceptance of NoC Benchmark Standards