1 / 64

SCI low-level programming: SISCI / SMI / CML+

SCI low-level programming: SISCI / SMI / CML+. Joachim Worringen Lehrstuhl für Betriebssysteme RWTH Aachen Martin Schulz Lehrstuhl für Rechnertechnik und Rechnerorganisation Technische Universität München. Goals of this Tutorial & Lab. Introduction into Low-level programming issues

bijan
Download Presentation

SCI low-level programming: SISCI / SMI / CML+

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SCI low-level programming:SISCI / SMI / CML+ Joachim WorringenLehrstuhl für BetriebssystemeRWTH Aachen Martin SchulzLehrstuhl für Rechnertechnik und RechnerorganisationTechnische Universität München

  2. Goals of this Tutorial & Lab • Introduction into Low-level programming issues • What happens in the SCI hardware • What are the special features of SCI • How to use it in your own programs • Present a few available low-level APIs • Direct SCI & Low-level messaging • Discuss example codes • Enable you to use these APIs in your codes • Enable you to exploit the power of SCI Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  3. Why bother with low-level issues? • And of course: is much more fun!!!  • Directly levering on SCI technology • More than just „yet another MPI“ machine • Use distinct advantages of SCI • Can be beneficial in many applications • Learn about internals • Very important to achieve high performance • Complex interactions between hardware & software need to be considered Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  4. What you will learn • Overview of the SCI software stack • What is available & Where does it belong • Details about the principle behind SCI • How the hardware works • How to exploit SCI‘s special features • Basics of three common low-level API • SISCI – THE standard API for SCI systems • SMI – more comfortable setup • CML+ - low-level messaging Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  5. What you will not learn here • Performance tuning • However, low-level knowledge is key for this • Exact details about the implementation • Would go to far for this session • If interested, ask during breaks • Concrete API definitions • Some of it will be in the lab session • Only the ones necessary for the lab • Consult manuals for more details Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  6. Overview • Why bother with low-level programming? • The SCI principle • Typical Software Hierarchy • The SISCI Interface • The SMI Interface • The CML message engine • Outlook on the Lab-Session Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  7. SCI low-level programming:SISCI / SMI / CML+  The SCI principle Typical SCI Software Hierarchy The SISCI API The SMI interface The CML message engine Outlook on the Lab-Session

  8. The Scalable Coherent Interface • IEEE Standard 1596 (since 1992) • Initiated during FutureBus+ specification • Overcome limitations of bus architectures • Maintain bus-like services • Hierarchical ringlets with up to 65536 nodes • Packet based, split transaction protocol • Remote memory access capabilities • Optional cache coherence • Synchronization operations • Connection to host system not standardized Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  9. SCI-based Architectures • Interconnection fabric for CC-NUMA • Utilization of cache coherence protocol • NUMA-Liine (Data General/?), NUMA-Q (IBM) • I/O extension • PCI-to-PCI bridges • Cluster interconnection technology • PCI-SCI bridges (from Dolphin ICS) • Comparable to other SANs like Myrinet • Enables user-level communication Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  10. User level communication Applications z.B. Sockets Setup Routines Comm. TCP/IP UDP/IP Stack Ethernet Driver NIC Driver Hardware / NIC Operating System Device Driver User library • Only setup via kernel • Communication direct • Only setup via kernel • Communication direct • Advantages • No OS Overhead • No Protocols • Direct HW Utilization • Typical performance • Latency < 10 µs • Bandwidth > 80 MB/s(for 32 bit, 33 MHz PCI) 1x Setup Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  11. SCI-based PC clusters PCs with PCI-SCI adapter SCI interconnect Global address space • Bus-like services • Read-, write-, and synchronization transactions • Hardware-supported DSM • Global address space to access any physical memory Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  12. User-level communication Read/write to mapped segment NUMA characteristics High performance No protocol overhead No OS influence Latency: < 2.0 ms Bandwidth: > 300 MB/s Setup of communication Export Map SCIphysical addressspace • Export of physical memory • SCI Physical address space • Mapping into virtual memory SCI Ringlet Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  13. SCI remote memory mappings Processon A Processon B Virtual address space on A Virtual address space on B Physical mem. on A Physical mem. on B PCI addr. on A PCI addr. on B SCI physical address space CPU MMU CPU MMU Node A SCI Bridge ATTs Node B Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  14. Address Translation Table (ATT) • Present on each SCI adapter • Controls outbound communication • 4096 entries available • Each entry controls between 4KB and 512KB • 256 MB total memory  64 KB per entry • Many parameters per ATT entry, e.g. • Prefetching & Buffering • Enable atomic fetch & inc • Control ordering Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  15. Some more details • Inbound communication • No remapping • Offset part of SCI physical addr. = physical addr. • Basic protection mechanism using access window • Outstanding transactions • Up to 16 write streams • Gather write operations to consequent addresses • Read/Write Ordering (optional) • Complete write transactions before read transaction Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  16. Summary • Key feature of SCI: HW DSM • Allows efficient communication • No OS involvement • No protocol overhead • Implemented using remote memory maps • Access to remote memory regions • Based on physical addresses • Mapping done in a two step process • ATT control mapping across nodes • MMU controls mapping within nodes Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  17. SCI low-level programming:SISCI / SMI / CML+ The SCI principle  Typical SCI Software Hierarchy The SISCI API The SMI interface The CML message engine Outlook on the Lab-Session

  18. Software for SCI • Very diverse software environment available • Both major parallel programming paradigmsShared Memory & Message Passing • Most standard APIs or Environments • All levels of abstraction included • Many operating systems involved • Problems • Many sources (industry and academia) • Compatibility and Completeness • Accessibility Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  19. SCI software stack (incomplete) Target applications / Test suites High-levelSoftware Sockets Code Crayshmem ScaMPI SCI- MPICH PVM DistributedJVM OpenMP Distributedthreads Crayshmem TCPUDP/ IPstack Low-levelSoftware ScaFun SMI CML+ HAMSTERModule SCI Low-level SW User/Kernelboundary ScaIP ScaMAC SISCI API SCI Drivers (Dolphin & Scali) SCI-VM &SciOS/SciFS SCI-Hardware: Dolphin adapter (as ring/switch or torus) Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  20. Summary • SCI has a comprehensive software stack • Can be adapted to the individual needs • Joint effort by the whole SCI community • Contributions from academics and industry • “Documented” in the SCI-Europe conference series • Focus in this tutorial: Low-level parts • A few low-level APIs • High-level APIs & Environments • MPI Tutorial • Conference talks Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  21. SCI low-level programming:SISCI / SMI / CML+ The SCI principle Typical SCI Software Hierarchy  The SISCI API The SMI interface The CML message engine Outlook on the Lab-Session

  22. The SISCI API • Standard low-level API for SCI-based systems • Developed within the SISCI Project • Implemented by Dolphin ICS (for many platforms) • Adaptation layer for Scali systems available • Goal: Provide the full SCI functionality • Secure abstraction of user-level communication • Based on the IRM (Interconnect Resource Mng.) • Global resource management • Fault tolerance & configuration mechanisms • Comprehensive user-level API Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  23. Use of SISCI API • Provides raw shared memory functionalities • Based on individual shared memory segments • No task and/or connection management • High complexity due to low-level character • Application area 1: Middleware • E.g. base for efficient message passing layers • Hide the low-level character of SISCI • Application area 2: Special purpose applications • Directly use the raw performance of SCI Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  24. SISCI availability • SISCI API directly available from Dolphin • http://www.dolphinics.com/ • Multiplatform support • Windows NT 4.0 / 2000 • Linux (for 2.2.x & 2.4.x kernels) • Solaris 2.5.1/7/8 (Sparc) & 2.6/7 (X86) • Lynx OS 3.01 (x86 & PPC), VxWorks • Documentation also On-line • Complete user-library specification • Sample codes Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  25. Basic SISCI functionality • Main functionality:Basic shared segment management • Creation of segments • Sharing/Identification of segments • Mapping of segments • Support for Memory transfer • Sequence control for data integrity • Optimized SCI memcpy version • Direct abstraction of the SCI principle Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  26. Advanced SISCI functionality • Other functionality • DMA transfers • SCI-based interrupts • Store-barriers • Implicit resource management • Additional library in SISCI kit • sisci_demolib • Itself based on SISCI • Easier Node/Adapter identification Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  27. SCICreateSegment Allocating and Mapping of Segments Data Transfer Virt. Addr. Phys. Mem. PCI Addr. Virt Addr. SCIPrepareSegment SCIMapLocalSegment SCISetSegmentAvailable SCIConnectSegment SCIMapRemoteSegment SCIUnMapSegment SCIDisconnectSegment SCIUnMapSegment SCIRemoveSegment Node A Node B Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  28. SISCI Deficits • SISCI based on individual segments • Individual segments with own address space • Placed on a single node • No global virtual memory • No task/thread control, minimal synchronization • No full programming environment • Has to rebuilt over and over again • No cluster global abstraction • Configuration and ID management missing Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  29. Summary • SISCI allows access to raw SCI • Direct access to remote memory mappings • Availability • Many platforms • All SISCI versions available at Dolphin´s website • Low-level character • Best possible performance • Suited for Middleware development • Deficits for high-level programming Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  30. SCI low-level programming:SISCI / SMI / CML+ The SCI principle Typical SCI Software Hierarchy The SISCI API  The SMI interface The CML message engine Outlook on the Lab-Session

  31. Agenda • SMI Library • We have SISCI & SMiLE – why SMI? • SMI Programming Paradigma • SMI Functionality Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  32. Why SMI? • Higher abstraction level than SISCI: • Providing application environment • Single function calls for complex operations • Hiding of node & segment IDs • Extension of SISCI functionality • Resource management • Lower abstraction level than SMiLE: • Utilization of multiple PCI-SCI adapters • Utilization of DMA & Interrupts • Full control of memory layout Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  33. History of SMI • Development started in 1996: • SBus-SCI adapters in Sun Sparcstation 20 • no SISCI available • make SCI usage/NUMA programming less painful • Marcus Dormanns until end of 1998: • API for creation of parallel applications on shared memory (SMP/NUMA/cc-NUMA) platforms • Ph.D. thesis: Grid based parallelization techniques • Joachim Worringen since 1998: • extension of SMI as basis for other libraries or services on SCI-SMP-clusters Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  34. SMI Availability • Target Architectures: • SMP systems • NUMA systems (SCI-Clusters) • Hardware platforms: • Intel IA-32, Alpha & Sparc (64 bit) • Software platforms: • Solaris, Linux, Windows NT/2000 • Uses threads, is partly threadsafe • static or shared library Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  35. SMI Programming Paradigma P P P P M M • Basic Model: SPMD • Independent processes form an application • Processes share explicitly created Shared Memory Regions • Multiple processes on each node Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  36. SMI Functionality • Set of ~70 API functions, but: • only 3 function calls to create an application with shared memory • Collective vs. individual functions: • Collective: all processes must call to complete • Individual: process-local completion • Some (intended) similarities to MPI • C/C++ and Fortran 77 bindings • Shared library for Solaris, Linux, Windows Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  37. Initialization/Shutdown • Initialization: collective call to • SMI_Init(int *argc, char ***argv) • Passing references to argc and argv to SMI • Do not touch argc/argv before SMI_Init()! • Finalization: collective call to • SMI_Finalize() • Abort: individual call to • SMI_Abort(int error) • Implicitely frees all resources allocated by SMI Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  38. Information Gathering • Topology information: • Number of processes: SMI_Proc_size (int *sz) • Local process rank: SMI_Proc_rank (int *rank) • Number of nodes: SMI_Node_size (int *sz) • Several more topology functions • System/State information: SMI_Query(smi_query_t q, int arg, void *result) • SCI, SMI and system related information • Timing functions:SMI_Wtime(), SMI_Wticks() Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  39. Basic SMI Program #include „smi.h“ int main (int argc, char *argv[]) { int myrank, numprocs; SMI_Init (&argc, &argv); SMI_Proc_rank (&myrank); SMI_Proc_size (&numprocs); /* do parallel shared-memory stuff */ SMI_Finalize(); } Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  40. Setting up SHM Regions • Creating a shared memory region:SMI_Create_shreg(int type, smi_shreg_info_t *reginfo, int *id, char **addr) • Shared region information: • Size of the shared memory region • PCI-SCI adapter to use (if not default) Information specific to some region types: • Owner of the region: memory is local to the owner • Custom distribution information • Remote Segment information Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  41. SHM-Type UNDIVIDED • Basic Region Type: • one process (owner) exports a segment, all others import it. • FIXED or NON_FIXED, DELAYED • Collective invocation Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  42. SHM-Type BLOCKED • Each process exports one segment • All segment get concanated • Contiguous region is created • Only FIXED, not DELAYED • Collective invocation Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  43. SHM-Type LOCAL • A single process exports a segment • No other process is involved • Local completion semantics • Segment is available for connections • Only NONFIXED Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  44. SHM-Type REMOTE • A single process imports an existing remote segment • No other process is involved • Local completion semantics • Only NONFIXED Remote process needs information! Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  45. SHM-Type RDMA • A single process connects to an existing remote segment • No mapping = much less overhead • Region is not mapped = has no address • Only accessable via RDMA functions (SMI_Put, SMI_Get) • No other process is involved • Local completion semantics Remote process needs information! Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  46. Deleting SHM Regions • Delete a shared memory region:SMI_Free_shreg (int region_id) • Effects: • Disconnects / un-import remote segments • Destroys / dis-allocates local segments • Same completions semantics as for the creation of the region (collective / individual) • Access to an address of a region after it has been free‘d • SIGSEGV Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  47. Memory Management • Dynamic allocation of memory of shared regions (for any contiguous region type) • Region can be used directly or via SMI memory manager – not both! • InitializeMemory Management Unit:SMI_Init_shregMMU(int region_id) • Memory manager works with „buddy“ technique • Fast, but coarse granularity Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  48. Memory Allocation • Individual allocation:SMI_Imalloc(int size, int id, void **addr) • Limited validity of addresses! • Collective allocation:SMI_Cmalloc(int size, int id, char **addr) • Freeing allocated memory:SMI_Ifree(char *addr)SMI_Cfree(char *addr) • Freeing mode must match allocation mode! Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  49. Memory Transfers • Memory transfers possible via load/store operations or memcpy() • why SMI functionality to copy memory? • secure: including sequence check & store barrier • optimized: twice the performance • asynchronous: no CPU utilization • Synchronous copying:SMI_Memcpy(void *dst, void *src, int len, int flags) Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

  50. Asynchronous copying • For mapped regions: • SMI_Imemcpy(dst, src, len, flags, handle) • Source region can be any type of memory • All sizes and adresses possible • Copy direction determined by addresses supplied • For non-mapped regions: • SMI_Get(void *dst, int src_id, int offset, int len) • SMI_Put(int dst_id, int offset, int len, void *src) • Source region needs to be SCI or registered • Alignment of size and offset to 8 • Copy direction determined by function name • Check for completion: SMI_Memwait(), SMI_Memtest() Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

More Related