1 / 42

HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures. MPhil/Master dissertation presented by Jaume Joven Murillo and supervised by Dr. Jordi Carrabina Bordoll . Presentation outline. Introduction

muriel
Download Presentation

HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures MPhil/Master dissertation presented by Jaume Joven Murillo and supervised by Dr. Jordi Carrabina Bordoll HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  2. Presentation outline • Introduction • Basic concepts & state of the art in NoCs and MPSoCs • Design framework and working methodology • HW-SW NoC-based MPSoC implementation • Experimental results • Conclusions & future work HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  3. 1. Introduction 1.1 - Introduction & research project analysis 1.2 - Objectives of the research project HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  4. Introduction • The continuous evolution of the technology (Moore’s law) causes that every IC is able to contain a large number (until 2020 according SIA roadmap) • Productivity gap • Adopted solutions • Component reuse (IP cores) • Soft-cores processors • HW-SW co-design • Novel design methodologies • Communication centric • Novel on-chip paradigms • Networks-on-Chips (NoCs) • System-level languages • SystemC™, UML,… • Develop complex ICs with billion of transistors in the near future • Multiprocessor-System-on-Chip (MPSoC) / Multi-cores / Chip-multiprocessors (CMP) • Sea of tiles (IP cores) interconnected by a Network-on-Chip HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  5. Objectives of the research project • Develop a HW-SW co-design framework for parallel distributed computing on-chip applying platform-based design concepts • Performs co-evolution strategy of two concurrent phases (HW-SW) • Hardware architecture • Scalable Distributed-Memory NoC-based MPSoC (NUMA) • Software framework • Software drivers • embedded Message Passing Interface (eMPI) • Run benchmarks & test applications • Explore concurrency and parallelism in on-chip environments HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  6. 2. Basic Concepts and state of the art in NoCs and MPSoCs 2.1 - On-chip communication schemes 2.2 - Basic concepts on NoCs 2.3 - NoC topologies 2.4 - Switching modes & routing schemes 2.5 - Flow control & micro-network stack 2.6 - State of the art in NoCs/MPSoCs HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  7. On-chip communication architectures • Point-to-point • Fixed dedicated wires • Not flexible, Not shared • Null reusability • Bus-based interconnection (OCB) • Shared communication infrastructure • Multi-level, hierarchical or segmented buses • Bus becomes a bottleneck • On-chip network (NoC) • Distributed nature • Maximum flexibility & scalability • Exploits reusability, parallel operations/transactions • Regular geometry • Predictable layout and performance • Best testability & verification time • Must guarantee a certain QoS HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  8. Basic concepts on NoCs • Tile • Computational nodes • Router/Switch • Communication nodes • Switching and routing strategy • Network adapter (NA, NI, NIC) • Decouple computation from communication • Adapts network & tile clock domains (GALS) • Links • Dedicated P2P communication channels • Flow control protocol (Handshake or credit-based) • NoC-based systems • High degree of composition and traffic diversity • It is desired to have good floorplanning & minimal buffer • Conventional/Traditional networks • Homogeneous and coarse grained HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  9. NoC topologies • Typical of multiprocessor systemsbut now on a chip • Regular • Predictable in terms of • Power consumption, • Performance (bandwidth, latency…) • Area usage • Good floorplanning • Non-regular • Mixing regular topologies • Mesh-Torus, Ring-Mesh, Ring-hypercube • Direct • At least one tile attached to each node • Indirect • A subset of nodes are not connected to any core • Its selection is a trade-off between • Network complexity or on-chip area costs • Communication requirements or network performance HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  10. Switching modes & routing schemes • Circuit switching • Involves the establishment & releasing of a circuit between source and destination • Buffer-less switching scheme • Packet switching • Forwards the data to next hop • Buffering is mandatory • Different packet switching modes • Store-and-forward • Stall at two nodes and the link between them • Wormhole • Combines packet switching + circuit switching • Reduce buffer size • Stall at all nodes and links spanned by the packet • Virtual cut-through • Next hop must store the whole packet • Stall at local node • Buffering • Buffer size  width, depth • Location in the router • Shared or distributed • Affects the power consumption & area usage HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  11. XY routing Switching modes & routing schemes • Routing schemes • Deterministic • Path determined by its source & destinations address • Easy to implement • Not optimal under congestion • Adaptive • Path decided on a per-hop basis • Complex in its implementation • Must be a deadlock/livelock free routing • Offers great benefits under congestion HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  12. Flow control & Micro-Network stack • Flow control protocol (ensures the correct transport of packets) • Handshake • Request – acknowledge signals (req, ack/nAck) • Simpler and cheaper than credit-based • Credit-based • All network components keep counters for the available buffer space • Data received  counter-- | Data sent  counter++ | if counter==0  buffer full • Network stack layers • Transport • Network Adapter has to pack/unpack messages into network layer packets • Network • Where & how a packet is transmitted • Data-link • Protocol to transmit a flit/phit • Physical • Number & length of wires HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  13. State of the art in NoCs/MPSoCs • NoC is an emerging & hot topic during last years • Research at all stack levels • System/Application Level • Design methodologies, co-exploration • Programming models & OS support • Network Adapters • Network architecture • Link level • Research on MPSoC • HW-SW interfaces • Implantation of parallel programming models • Shared memory or message passing • ccNUMA MPSoC architecture using NIOS-II • MPSoC using segmented buses (HIBI) HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  14. 3. Design framework and working methodology 3.1 - HW-SW Co-design flow 3.2 - Proposed NoC-based MPSoC architecture 3.3 - Prototyping platform HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  15. HW-SW Co-design flow • System specification • Architecture exploration • P, VLIW, DSP… • NoC routers, busses • NIC interfaces • Architecture designand HW-SW Co-design • RTL architecture • IP core integration (Soft-cores) • Software design • Benchmarks/Applications • embedded MPI (eMPI) • NIC driver • Integration and system-verification • SystemC™ • On-chip co-debugging • Functional prototype Quartus II + SOPC Microsoft Visual Studio & Eclipse IDE for NIOSII ModelSim, GTKwave, Signal-Tap Synplify or QuartusII HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  16. Proposed NoC-based MPSoC architecture • Distributed-memory NoC-based MPSoC components • NoC communication architecture • Soft/Hard IP core processors (Pi) • Distributed memory subsystem (Mi) • Network Interface Controller (NICi) • Driver for Network Interface Controller (NIC driver) • embedded Message Passing Interface (eMPI) HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  17. Proposed NoC-based MPSoC architecture • NoC topology • 2D-Mesh (regular, predictable) • XY Routing • Deterministic, minimal & deadlock-free • Switching mode • Ephemeral Circuit switching • Store & forward • Flow control • 4-phase handshake • Tile composition • NIOS-II Soft-core processor • On-chip RAM or SSRAM controller • NIC interface to NoC • Timer (IRQs, multi-threaded) • UART, JTAG, Performance Counter HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  18. Prototyping platform • Stratix® EP1S25 DSP prototyping/development board • Altera® FPGA Stratix EP1S25F780C5 • Contains 25.660 LEs • Includes 1.944.576 bits of on-chip memory • 224 - M512 RAM blocks (32x18b) • 138 - M4K RAM blocks (128x36b) • 2 - M-RAM blocks • 6 PLLs • 597 maximum user I/O pins • Off-chip memory • 2 Mbytes of SSRAM configuredas two independent banks • 32 Mbits of flash memory • Other I/O • LEDs, RS232, buttons, switches, 7segments HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  19. 4. HW-SW NoC-based MPSoC implementation 4.1 - NoC-based MPSoC block diagram 4.2 - Communication channel 4.3 - Design of the Network Interface Controller 4.4 - Router design 4.5 - Software design 4.6 - Applications and benchmarks HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  20. NoC-based MPSoC block diagram • Distributed-memory NoC-based MPSoC based on NIOS-II soft-core processor • Each NIOS-II Avalon based tile is generated effortlessly through QuartusII+SOPC • Our custom HW design • Implementation of flow control in eachcommunication channel • Design of Network Interface Controller • Design of the router HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  21. Communication channel • Implements full-duplex 4-phase handshake protocol • Between NIC-Router or between routers • 4-phase is not ambiguous • Two independent and synchronous FSM have been designed • Packet definition • The definition of each subfield • XY address, message id, message length, sequence number, flags, priority… • Size of each subfield • Fixes the router and NIC implementation • Our packet format for a 2D-Mesh HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  22. Design of the Network Interface Controller • NIC: interface between tiles and routers of our NoC • Decoupling tile’s computation from the NoC’s communication infrastructure • Important piece to get good packet injection rate over the NoC • Build flits/packets • Bus peripheral (slave) • Polling or IRQs • Register Memory mappings • N+1 bits of addressable bus space • Custom instruction (CI-based NIC) • Attached in the processor datapath • Is not master or slave HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  23. PathSwitchMatrix • Establish/Release the output communication channel • Selects the output according the request received from MeshXYRouting • XY Routing • Generate the signals toNorth, toEast,…,toLocal, where the packet will be forwarded Router design • Circuit switching • Ephemeral circuit switching • Two latency cycles • One for XY routing • Another for PathSwitchMatrix HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  24. XY routing • Essentially the same as before, but without taking into account circuitEstablished signals • PathSwitchMatrix • Save the incoming packet in the FIFO • Transmit packet from the FIFO to next hop • A FIFO controller is needed to perform the 4-phase handshake protocol • FIFOs should be mapped as on-chip RAM memory or using registers Router design • Packet switching • Store and forward • Full or shared/unified output queue • Now, the latency to traverse the router depends on: • FIFO capacity (depth) • Output queue policies • RR, CBQ, priority queues… HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  25. Software design • HW-SW platform stack view of our distributed-memory NIOSII-based MPSOC with 2D Mesh interconnection strategy • Software components • NIC driver: low-level communication API • eMPI: high-level communication API for message passing • Optionally, between HdS (“drivers”) and high-level communication APIs an operation system (OS) might be included HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  26. Software design • The NIC software driver contains 3 basic functions: • Interact transparently with a given NIC component exploiting all HW capabilities volatile int *NIC = (int*) (NIC_BASE); volatile int *NIC_TX = (int*) (NIC_BASE+0x4); • Status register masks • 0x1  dataPending • 0x2  txBusy HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  27. Software design • The eMPI software API will be our high-level design language • Implements message passing over our on-chip network • Steps to create our eMPI • Select a minimal working subset of standard MPI functions • MPI_Init(), MPI_Finalize(), MPI_Comm_size(), MPI_Comm_rank() • MPI_Send(), MPI_Recv() • Porting process from standard defacto MPI to our on-chip network • Lightweight memory overhead message passing interface (~15-20KB) HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  28. Applications and benchmarks • The software framework let us to run parallel applications over the hardware architecture • All applications and benchmarks have been done by using NIC driver instead eMPI software API • COMMS1 & COMMS2 • Ping-pong benchmarks HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  29. Processor1 Processor0 Processor2 Processor3 Applications and benchmarks • Parallelization of Mandelbrot set • Iterative loop using complex numbers • Complex numbers are C=a+bi (a, b are C/C++ float or double) • Ideal to perform a message passing parallelization • Mandelbrot set: eMPI function calls HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  30. 5. Experimental results 5.1 - Hardware costs: area usage 5.2 - Hardware costs: area and power usage 5.3 - Software framework requirements 5.4 - On-chip network: throughput and bandwidth 5.5 - Application results 5.6 - Comparative results HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  31. HW costs: area usage • Router comparison between our Ephemeral Circuit Switching vs.our Packet Switching unified/shared queue • On a 2D-Mesh the number of ports are between 3-5 ports • Ephemeral Circuit Switching is between 2.5-3.8 times smaller than our Packet Switching unified/share output queue HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  32. HW costs: area usage • Evolution of NxN 2D-Mesh NoC-based MPSoC • Ratio of HW resources • CS: 20% comm. / 80% comp. • PS: 45% comm. / 55% comp. • Ephemeral circuit switching is a low cost architecture • Area resources • On-chip memory requirements Ephemeral Circuit Switching Packet switching (Store and forward) Logic elements (LEs) Logic elements (LEs) NxN 2D Mesh NxN 2D Mesh HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  33. HW costs: area and power usage • 2x2 Mesh NoC-based MPSoC with Ephemeral Circuit Switching • Not use any on-chip memory • Communication infrastructure (15%) is extremely small compared to the computational components (85%) • HW resources distribution • Running at 20MHz we can achieve around 60 DMIPS • Overall system metrics • 49,65mW/MHz • 3 DMIPS/MHz • Dynamic power usage • 993,31mW • Static: 548,39 mW • Dynamic: 442,92 mW • The NoC only affects 0.5% HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  34. Software framework requirements • It is necessary a RAM memory for each processor • Distributed-memory architecture • At least 64KB of RAM per processor • To load the software framework • Application data and algorithm • On-chip FPGA memory resources • High throughput (few cycles to access) • Low capacity (~KB) • External SSRAM available on the prototyping board • Low throughput (many cycles to access) • Large capacity (~MB) • Trade-off between capacity and throughput HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  35. Speedup ~4x On-chip Network: throughput & bandwidth • 2x2 Mesh NoC-based MPSoC with Ephemeral Circuit Switching • Maximum channel bandwidth is about 168.84Mbps at 63.24MHz • Bandwidth decrease according the number of hops (end-to-end flow control) HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  36. Application results • Test of the parallelization of Mandelbrot set in several architectures • Sequential execution on Simple NIOS-II monoprocessor • Parallel execution on a Dual-core NIOS-II architecture • Parallel execution on a 2x2 Mesh NoC-based with Ephemeral circuit switching Speedup ~4x Speedup ~2x HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  37. Comparative results HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  38. 6. Conclusions & future work 6.1 Conclusions 6.2 Future work HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  39. Conclusions • I have proposed a complete HW-SW framework for distributed-memory NoC-based MPSoC architecture • eMPI is a viable solution to on-chip parallelism using message passing • The methodology have been formalized as a HW-SW co-design flow • Complete system level design tool chain • Validity tested on a physical platform (FPGA) • Methodology is also valid for ASIC development • This research work let us to perform effortlessly distributed parallel computing on a chip • Useful parallel on-chip platform for many high-performance computing and “low power” emerging applications • Multimedia applications • Smart cams • Software-defined radio • Lack of verification and support tools to create complex MPSoC HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  40. Future work • Long term • Extend this architecture to implement heterogeneous systems • Extend this architecture to an hybrid memory model(shared distributed memory system) • Large memory bank as a tile • Cache coherence • Mechanism to access the shared medium • Should be useful to get a complete SystemC™ simulation model • Evolution of Ephemeral Circuit Switching architecture • Build a wormhole packet switching • Include a NIC queue in our Ephemeral circuit switching architecture • Change the fixed PriorityEncoder within PathSwitchMatrix • Test our architecture with bus-slave NIC with IRQs HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  41. Future work • Evolution of software framework • Improve the NIC software driver functions • Extend the eMPI SW API with other useful message passing collective communication functions • broadcast, scatter, gather, scan, reduce, allreduce, alltoall, reducescatter, barrier synchronization,… • Application-level • Take real application • Coarse grain or fine grain parallelism • Run GALS scheme with multiple clock domains HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

  42. The end…Thank you ! HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures

More Related