1 / 20

Implementing a NoMC on the Gidel platform end-project presentation

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab. Spring 2009. Implementing a NoMC on the Gidel platform end-project presentation. Instructor: Evgeny Fiksman Students: Meir Cohen Daniel Marcovitch. Table of Contents.

kyran
Download Presentation

Implementing a NoMC on the Gidel platform end-project presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Spring 2009 Implementing a NoMC on the Gidel platformend-project presentation Instructor: Evgeny Fiksman Students: Meir Cohen Daniel Marcovitch

  2. Table of Contents

  3. In the previous semester… Problem definition: • In previous semseter we took previous “router” and converted it to work on Altera platform. • In addition we prepared system architecture and microarchitecture. • Implementing a parallel processing system which contains several NoCs, each chip containing several sub-networks of processors. PC forms part of the network using PCI. • Writing an application which utilizes parallel processing. • Measuring system performance

  4. This semester… • Implemented the various HW modules needed for larger scale routing: • Added 5th port to all routers/switches • Fabric router • InterChip GW • PC GW • Implemented asynchronous MPI commands • (MPI commands • were implemented • both for Nios and for PC) • Wrote example application which utilizes the 64 processors to solve problem (heat transfer) • Measured system performance) 4

  5. Putting it all together – a general view of topology • Each local cluster has 4 processors. • Each chip has 4 clusters (comms) • Gidel board has 4 chip – altogether 64 processors • PC is also part of chip – switching between 4 FPGAs is • done in software – i.e if forms a “virtual switch”.

  6. New HW modules(1) – Fabric router • In “Local router” – forwarding is done by rank – i.e rank = port • In “Fabric router” – forwarding table is implemented.

  7. Routing tables chip fabric local Address comm rank • Local router: • Similar comm – routing by rank. • Other comms – to 5th port. • Other routers: • Routing by comm/chip only. • myComm,myChip entry used for PC routing • Implemented using VHDL’s “generate” command to reuse existing modules. • Hex file is created for each router, loaded into ROM using parameter. • Grouping (i.e sub-network prefixes) allows us to use small routing table • (only 8 entries)

  8. New HW modules(2) – IC GW FIFO c Remote credit release Credit counter Remote buffer (inc) (dec) Local buffer Local credit release • Primary/Secondary indicates connectivity rather than implementation • Interchip interface has increased latency – we use buffers and credits to ensure no fifo overrun • Credit counter is initialized with fifo size (i.e 32) as initial #credits • Since fifo size > end 2 end latency – block give 100% throughput

  9. New HW modules(2) – IC routing • IC connectivity itself uses Gidel’s fastest busses: • 1. Neighbour busses between 1-2, 2-3, 3-4 • 2. Main bus between 1-4 • Both busses are wide enough to support bi-directional traffic • i/f : 32 bit data, ctrl, credit_release, push/pop [total: 35 bits X 2]

  10. New HW modules(3) – PC GW ToPCGw FromPCGw • Needed for three reasons: • 1. FromPCGw adds start/finish “ctrl” signal (parses MPI header for “size” field) • 2. Handle PCI idiosyncrasies (minimum messaged length) • 3. Use “Gidel’s (req/ack) simple FIFO protocol rather than • Altera’s fifo protocol (push/pop) 10

  11. Testing and debug • Since the project is multi-layered, debug can be split into several types: • HW (component) issues • Connectivity • SW (NIOS/PC) • Component testing • Small testbenches encompassing single block • Connectivity • Before running main application – we ran connectivity application to check all nios can communicate with each other. • Made Specman-E simulation emulating the router’s operation while loading and parsing the real hex files.

  12. Testing and debug • SW/NIOS • Model Sim was used for logical simulation. • Since system was large and debugging is difficult and multi-layered (debugging application run on NIOS), we added special debug registers. • Each NIOS writes to these registers (PIO – parallel I/O) during application run, publishing its “state”. • In addition, debug registers were attached to main FIFOs to indicate traffic flow (performance counters) • When running on chip itself, • these registers are sampled and displayed during the application to give indication of system state PIO FIFO counters

  13. Application Parallel jacobian algorithm for approximation solution for the equation . Distribute matrix among CPUs. CPUs communicate with neighbors. Uses computation-communication overlapping. Managed by the host PC. iteration compute interior send/receive boundary compute boundary matrix distribution:

  14. Performance – application time vs number of iterations • Measurements done on dual core pentium processor running at 2.4Ghz • Constant offset indicates PCI latency • Running length is #Iterations * (communication + calculation) • Linear equation as expected: • #Iterations * (communication + calculation) + PCI offset

  15. Performance – throughput vs injection rate • For low injection rate – routing isn’t a bottleneck => • output rate almost identical to input • As injection rate increases – router becomes bottleneck • Once maximum throughput of router is met – throughput is constant

  16. Performance – simplified model – delay(congestion) • D(p) – delay(# packets in system) • R – average router delay • L – system latency • λ – injection rate • D(p)=R∙p + L • P=λ∙D(p) [little’s law] • D(p) =λ∙L/(1-λ∙R) R=50, L=80

  17. Performance – packet delay vs number of injection stubs • Few stubs injection – almost no congestion – constant delay • As we approach throughput – congestion increases and delay decreases • For very high injection rate –we approach system saturation • (since fifo sizes are finite (32 entries) there is a maximum number of • packet in the system at any given moment)

  18. Performance – packet delay vs injection rate • For low injection rate – almost no congestion – constant delay • We again see an exponential increase which peters out due to system • saturation

  19. Summary/conclusions: Next steps: • Compare topologies (mesh / fat tree ) • Develop software to automatically create topologies out of building blocks • Simplify router and increase throughput • Original router was robust and easily expanded to support 5th port and routing tables • Debugging software written on this system posed a serious challenge, and required a certain measure of innovation. • Despite being on chip – communication between processors still constitutes a serious factor. Therefore, the overall performance system will improve as the calculation/communication ratio decreases. • For similar reasons, network can be better used if locality between nodes is utilized.

  20. Questions Questions

More Related