1 / 123

The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes. Daniel L. Ly 1 , Manuel Saldaña 2 and Paul Chow 1 1 Department of Electrical and Computer Engineering University of Toronto 2 Arches Computing Systems, Toronto, Canada. Outline. Background and Motivation

koen
Download Presentation

The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly1, Manuel Saldaña2 and Paul Chow1 1Department of Electrical and Computer Engineering University of Toronto 2Arches Computing Systems, Toronto, Canada

  2. Outline Background and Motivation Embedded Processor-Based Optimizations Hardware Engine-Based Optimizations Conclusions and Future Work

  3. Motivation Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems

  4. Motivation Processor 1 Processor 2 Memory Memory Problem: sum of numbers from 1 to 100 for (i = 1; i <= 100; i++) sum += i; Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems

  5. Motivation Processor 1 Processor 2 Memory Memory sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...); Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems

  6. Motivation Processor 1 Processor 2 Memory Memory sum1 = 0; for (i = 0; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...); Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems

  7. Motivation Processor 1 Processor 2 Memory Memory sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...); Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems

  8. Motivation Processor 1 Processor 2 Memory Memory sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...); Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems

  9. Motivation • Strong interest in adapting MPI for embedded designs: • Increasingly difficult to interface heterogeneous resources as FPGA chip size increases • MPI provides key benefits: • Unified protocol • Low weight and overhead • Abstraction of end points (ranks) • Easy prototyping

  10. Motivation

  11. Motivation • Interaction classes arising from heterogeneous designs: • Class I: Software-software interactions • Collections of embedded processors • Thoroughly investigated; will not be discussed • Class II: Software-hardware interactions • Embedded processors with hardware engines • Large variety in processing speed • Class III: Hardware-hardware interactions • Collections of hardware engines • Hardware engines are capable of significant concurrency compared to processors

  12. Background • Work builds on TMD-MPI[1] • Subset implementation of the MPI standard • Allows hardware engines to be part of the message passing network • Ported to Amirix PCI, BEE2, BEE3, Xilinx ACP • Software libraries for MicroBlaze, PowerPC, Intel X86 [1] M. Saldaña et al., “MPI as an abstraction for software-hardware interaction for HPRCs,” HPRCTA, Nov. 2008.

  13. Class II: Processor-based Optimizations Background Direct Memory Access MPI Hardware Engine Non-Interrupting, Non-Blocking Functions Series of MPI Messages Results and Analysis

  14. Class II: Processor-based OptimizationsBackground • Problem 1 • Standard messageparadigm for HPC systems • Plentiful memory but high message latency • Favours combining data into a few, large messages, which are stored in memory and retrieved as needed • Embedded designs provide different trade-off • Little memory but short message latency • ‘Just-in-time’ paradigm is preferred • Sending just enough data for one unit of computation on demand

  15. Class II: Processor-based OptimizationsBackground • Problem 2 • Homogeneity of HPC systems • Each rank has similar processing capabilities • Heterogeneity of FPGA systems • Hardware engines are tailored for a specific set of functions – extremely fast processing • Embedded processors play vital role of control and memory distribution – little processing

  16. Class II: Processor-based OptimizationsBackground • ‘Just-in-time’ + Heterogeneity = producer-consumer model • Processors produce messages for hardware engines to consume • Generally, the message production rate of the processor is the limiting factor

  17. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • Typical MPI implementations use only software • DMA engine offloads time-consuming, message tasks: memory transfers • Frees processor to continue execution • Can implement burst memory transactions • Time required to prepare a message is independent of message length • Allows messages to be queued

  18. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

  19. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

  20. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory

  21. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory

  22. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory

  23. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory

  24. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory

  25. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory

  26. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory

  27. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory

  28. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory

  29. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory

  30. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory

  31. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory

  32. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  33. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  34. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  35. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  36. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  37. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  38. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  39. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  40. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  41. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  42. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  43. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  44. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  45. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

  46. Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • DMA engine is completely transparent to the user • Exact same MPI functions are called • DMA setup is handled by the implementation

  47. Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions • Two types of MPI message functions • Blocking functions: returns only when buffer can be safely reused • Non-blocking functions: returns immediately • Request handle is required so the message status can be checked later • Non-blocking functions are used to overlap communication and computation

  48. Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions Typical HPC non-blocking use case: MPI_Request request; ... MPI_Isend(..., &request); prepare_computation(); MPI_Wait(&request, ...); finish_computation();

  49. Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions • Class II interactions have a different use case • Hardware engines are responsible for computation • Embedded processors only need to send messages as fast as possible • DMA hardware allow messages to be queued • ‘Fire-and-forget’ message model • Message status is not important • Request handles are serviced by expensive, interrupts

  50. Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions Standard MPI protocol provides a mechanism for ‘fire-and-forget’: MPI_Requestrequest_dummy; ... MPI_Isend(..., &request_dummy); MPI_Request_free(&request_dummy);

More Related