100 likes | 272 Views
NoC Symposium’07 Panel Proliferating the Use and Acceptance of NoC Benchmark Standards. Timothy M. Pinkston National Science Foundation (NSF) tpinksto@nsf.gov University of Southern California (USC) tpink@usc.edu. Driving Forces. Demand for System Functions.
E N D
NoC Symposium’07 PanelProliferating the Use and Acceptance of NoC Benchmark Standards Timothy M. Pinkston National Science Foundation (NSF) tpinksto@nsf.gov University of Southern California (USC) tpink@usc.edu
Driving Forces Demand for System Functions Performance of System Functions Hardware Functional Blocks Demand for Functional Blocks workloads Arch App’s Architecture Tech defines how system functions are supported Alg & SW Implementation (Circuit) Technology Applications define what system functions should be supported defines the extent to which desired system functions can be implemented in hardware “Trends Towards On-chip Networked Microsystems”, T. Pinkston and J. Shin, IJHPCN. (http://ceng.usc.edu/smart/publications/archives/CENG-2004-17.pdf)
Need for a NoC Benchmark Suite Is There a ? • A sampling of benchmark suites already out there: • Gen-Purpose/PC • Embedded/SoC • Sci-Eng/HPC • SPEC CPU • -2006 • STREAM • HPL • CPU2 • EEMBC • SPLASH • -2 • Netperf • MiBench • LINPACK • Dhry-/Whetstone • MediaBench • LAPACK • BAPCo SYSmark • ALPBench • ScaLAPACK • BYTEmark • GraalBench • NPB (NAS PB) • LMBench • NPCryptBench • LFK (Livermore) • LLCbench • CommBench • SparseBench • DMABench • BioBench • Do we really need yet another benchmark suite?
December 2006 NSF OCIN Workshop Recommendations(www.ece.ucdavis.edu/~ocin06) • A set of standard workloads/benchmarks and evaluation methods are needed to enable realistic evaluation and uniform (fair) comparison between various approaches • Need for cooperation (agreement) between academia and industry • Need for “qualified” performance metrics: latency and bandwidth under power, energy, thermal, reliability, area, etc., constraints • Need for standardization of metrics: clear definition of what is being represented by metrics (e.g., network latency, throughput,...) • Need for effective alternatives to time consuming full-system execution-driven simulation, including use of microbenchmarks, parameterized synthetic traffic/workloads, traces, etc. • Need for accurate characterization and modelling of system traffic behavior across various domains: general-purpose & embedded • Need for analytical methods (complementary to simulation) to explore and quantitatively narrow-down the large design space “Challenges in Computer Architecture Evaluation,” K. Skadron, M. Martonosi, D. August, M. Hill, D. Lilja, V. Pai, in IEEE Computer, pp. 30-36, August 2003.
Meaning of Latency and Throughput • Latency: fabric only, endnode-to-endnode, ave., no-load, saturation? • Throughput: peak, sustained, saturation, best-case, worst-case? Simulation: 3-D Torus, 4,096 nodes (16 х 16 х 16), uniform traffic load, virtual cut-through switching, three-phase arbitration, 2 and 4 virtual channels. Bubble flow control is used in dimension order on one virtual channel; the other virtual channel(s) is supplied in dimension order (deterministic routing) or along any shortest path to destination (adaptive routing).
(cut-through switching) lower bound (contention delay not included) upper bound (contention delay not fully included) r×BWBisection Effective bandwidth = min(N × BWLinkInjection , , s× N × BWLinkReception) g Packet + (d x Header) Latency = Sending latency + TLinkProp x (d+1) + (Tr + Ta + Ts)x d + + Receiving latency Bandwidth BWNetwork Simple (Analytical) Latency and Throughput Models • H&P Int.Net. chapter: ceng.usc.edu/smart/slides/appendixE.html • Network traffic pattern/load determine s & g, traffic-dependent parameters • Topology and switch marchitecture determine d, Tr , Ta , Ts , BWBisection • Routing, switching, FC, march, etc., influence network efficiency factor, r • internal switch speedup & reduction of contention within switches • buffer organizations to mitigate HOL blocking in and across switches • balance load across network links & maximally utilize link bandwidth • r = rL x rR x rA x rS x rmArch x …, architecture-dependent parameters
ρ=38% Modeling Throughput of Cell BE EIB (Worst-Case) BWNetwork = ρ × BWBisection/g BWNetwork = ρ × 204.8/1 GB/s = 78 GB/s(measured) g= 1 Injection bandwidth: 25.6 GB/s per element Reception bandwidth: 25.6 GB/s per element s= 1 Command Bus Bandwidth BWBisection = 8 links = 204.8 GB/s 204.8 GB/s Aggregate bandwidth Network injection Network reception Peak BWNetwork of 25.6 GB/s x 3 x 4 307.2 GB/s (4 rings each with 12 links) (12 Nodes) (12 Nodes) 1,228.8 GB/s (3 transfers per ring) 307.2 GB/s 307.2 GB/s rlimited, at best, to only 50% due to ring interferrence Traffic pattern: determines s & g
Integer Programs Floating-Point Programs Ref: Hennessy & Patterson, “Computer Architecture: A Quantitative Approach, 4th Ed.
In Conclusion: Answers to Panel Questions • What are the hallmarks of successful benchmark suites? • Fairness: represent the proper workload behavior/characteristics • Portability: open, free access, not architecture/vendor-specific • Transparency: yield reproducible performance results (reporting) • Evolutionary: adaptable over time in composition and reporting • How can industry and academia facilitate use? • Establish need/importance for common evaluation “best-practices” • Cross-cutting effort: architects, circuit designers, CAD researchers • Need to place high value on developing and using eval. standards • What are the main obstacles to establishing a de facto NoC standard benchmark suite, and how to address? • Capturing the diversity of NoC applications & computing domains • Red herrings converge on performance evaluation standards and agree on characteristic traffic loads and/or microbenchmarks • Ultimately, system-level performance is important, not component