Network-on-Chip benchmarking workgroup, status update

Network-on-Chip benchmarking workgroup, status update April 2009, Erno Salminen

Contents • Background, goals, and benefits • Ongoing activities • HDTV decoding benchmark • Other benchmarks • Workload modeling and Transaction Generator • Future activities

Background • Parallel processing is the key to obtain high-performance with moderate power consumption • Multiprocessor System-on-Chip (MP-SoC) • Parallelism implies communication between components • Network-on-Chip (NoC) paradigm brings the techniques developed for macro-scale, multi-hop networks into a chip • The major goal of communication-centric design methodologies and NoC is to achieve greater design productivity and performance by handling the increasing parallelism, manufacturing complexity, wiring problems,and reliability

Benchmarking NoCs • Multitude of NoCs are proposed in literature • OCP-IP NoC benchmarking workgoup formed in 2007 • NoC benchmarking aims to answer two basic questions: • NoC developer: What gain does my novel feature bring? • System integrator: Which NoC should I choose and how should I configure it?

Benefits of common benchmarks • Improved sharing and comparison of R&D results • Most contemporary NoC benchmark cases are proprietary • A set of academic, synthetic benchmarks can be shared and used without these limitations • Increased healthy competitiveness between R&D • Standardized metrics and measurement methodologies enable fair comparison • Increased reproducibility of results and commonality for comparative purposes • Accelerated development and analysis • Available input data and hardware models can speed-up the initial design and performance estimation phase. • Better scalability compared to application benchmarks • Synthetic benchmarks are more suitable for benchmarking purposes since they can exhibit properties of particular fixed size application benchmarks, but can scale with system size while still retaining these properties

Workgroup’s outcome so far • White papers • An Initiative Towards Open Network-on-Chip Benchmarks • Survey of Network-on-Chip Proposals • Specifications • OCP Network-on-Chip Benchmarks Specification Part 1: Application Modeling and Hardware Description 1.0 • OCP Network-on-Chip Benchmarks Specification Part 2: Micro-Benchmark Specification 1.0 • Articles • Two articles regarding the specs on Embedded.com • One article acepted for publication in IET CDT journal

Ongoing activities: Video decoding benchmark • Specification document ready for OCP-IP member/GSC review • Describes the traffic patterns in HDTV decoding chip • Multiprocessing chip, shared memory communication via external DRAM • Few accelerators in addition to CPUs • Specifies bandwidth, burst size, target distribution etc.

The majority of traffic flows from the different processing engines converge into the DRAM memory The memory subsystem consists of a memory scheduler, a memory controller, and the off-chip DRAM Total bandwidth requirement 1.3 GByte/s- 10.5 Gbyte/s Video decoding benchmark (2)

Video decoding benchmark (3) • Distinct types of processing engines. • Processors • Transport engines • Video decoders • Audio processing units • Graphics engines • Display processors • peripheral display cluster • 1,2,4, or 8 DRAMs

Video decoding benchmark (4) • Processing engine types differ e.g. in • Bandwidth requirement (fraction of total) • CPUs 15% • Display Processors 40-50% • Video Decoders 20-30% • Graphics Processors 10-15% • Other Initiators <5% • Temporal and spatial distribution of transfers (uniform vs. bursty/localized) • Burst length, e.g. 16 Bytes – 384 Bytes • Ratio between read and write operations

Video decoding benchmark (5) • Processing engines operate in a fully pipelined fashion and they are operating as independent initiators • The designer instantiates an abstract SoC system • Specifies the number of initiators and their type, total bandwidth requirement of the system, latency bounds, details about the DRAM architectures and the OCP interfaces • The designer specifies details about each initiator type. • E.g. includes the type of accesses, locality of addresses etc. • Finally, the designer specifies the details about each initiator. • E.g., each initiator specifies the number of outstanding transactions per thread that the initiator can support.

Video decoding benchmark (6) • The bandwidth at each initiator thread should be measured over multiple windows of time • A typical window size is 10 000 clock cycles • Root mean square (RMS) error between the requested bandwidth and the serviced bandwidth should also be reported • If the RMS error is low, it means that the system is able to keep up with the bandwidth requirements of the thread • If the RMS error is high, the system is not able to keep up with the bandwidth requirements at all, or only with high jitter • Further details in the specification due Q2’09

Ongoing activities: Video encoding and medical imaging • Message-passing video encoder • Multiprocessor system implemented in FPGA • Runs MPEG-4 encoder • Initial measurements have been carried out • Detailed profiling in ongoing • Complements the shared mem. video benchmark • Medical image processing • We will investigate how realistic traffic profiles can be obtained with synthetic images

Workload model for NoC benchmarking • Separated into disctinct parts • Application’s workload – process network annotated with operation counts and transferred data amounts • Mapping – where the application tasks are executed • Computation architecture – processing capability (ops/cycle), transfer capability (DMA engine), communication overhead (context switch, transfer initiation delay) • Benchmarked NoC

Transaction Generator (TG) • Executes the NoC benchmarks • Reads the XML workload description • Injects/ejects data to/from network • Checks integrity, collects statistics • Implemented in SystemC • Originally developed at Tampere University of Technology (TUT), Finland • Will be published as open source in Q2-Q3 of 2009

TG (2) • Application • Process network • Separation of computation and communication 28 A B 20 4 F C 25 8 30 D E • Workload model and XML format were derived from the work done with TG • Workload details given in Specification part 1 • Mapping • Defines where tasks are executed I II III IV V Transaction Generator • Computation Architecture • Highly abstracted including characteristic parameters PE0 PE1 PE2 PE3 • Communication Architecture • Network model is cycle-accurate or event-accurate with time estimation Network model Legend: Inter-task transfer Mapping/grouping Initialization event Task Group Processing element (PE)

Transaction Generator (3) • The license will be LGPL • GNU Lesser General Public license • Published changes will be LGPL, can be combined to proprietary SW if unmodified • Open questions • Where the repository will reside? • Detailed schedule • Documentation • Example designs

Workload model delivery format <application> <task graph> <task id=A>...</task> ... <task id=F>...</task </task graph> </application> • XML format readable to • humans • computers • An XML schema describes the type of the XML document • typically expressed in terms of constraints on the structure and content of documents, above and beyond the basic constraints imposed by XML itself. <mapping> <resource id=PE3> <group id=IV> <task id=C/>...</group> <group id=V>...</group> </resource> </mapping> <platform> <resource_list> <resource id=PE3> <connection terminal=3/> ... </resource_list> <noc> <router_list>...</router_list> <link_list>...<link_list> </noc> </platform>

Workload model capture • Workload can be captured be also in UML thta is then converted to XML • Figure shows the stereotype extensions in UML

Near-future activities • Finish the ongoing activities in 2009 • Guideline preparation • How to do measurements (higher priority) • How to create traffic models • Finding more benchmarks • Suggestions are welcome! • Contact admin@ocpip.org

Network-on-Chip benchmarking workgroup, status update