SCI low-level programming: SISCI / SMI / CML+

SCI low-level programming:SISCI / SMI / CML+ Joachim WorringenLehrstuhl für BetriebssystemeRWTH Aachen Martin SchulzLehrstuhl für Rechnertechnik und RechnerorganisationTechnische Universität München

Goals of this Tutorial & Lab • Introduction into Low-level programming issues • What happens in the SCI hardware • What are the special features of SCI • How to use it in your own programs • Present a few available low-level APIs • Direct SCI & Low-level messaging • Discuss example codes • Enable you to use these APIs in your codes • Enable you to exploit the power of SCI Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Why bother with low-level issues? • And of course: is much more fun!!!  • Directly levering on SCI technology • More than just „yet another MPI“ machine • Use distinct advantages of SCI • Can be beneficial in many applications • Learn about internals • Very important to achieve high performance • Complex interactions between hardware & software need to be considered Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

What you will learn • Overview of the SCI software stack • What is available & Where does it belong • Details about the principle behind SCI • How the hardware works • How to exploit SCI‘s special features • Basics of three common low-level API • SISCI – THE standard API for SCI systems • SMI – more comfortable setup • CML+ - low-level messaging Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

What you will not learn here • Performance tuning • However, low-level knowledge is key for this • Exact details about the implementation • Would go to far for this session • If interested, ask during breaks • Concrete API definitions • Some of it will be in the lab session • Only the ones necessary for the lab • Consult manuals for more details Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Overview • Why bother with low-level programming? • The SCI principle • Typical Software Hierarchy • The SISCI Interface • The SMI Interface • The CML message engine • Outlook on the Lab-Session Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SCI low-level programming:SISCI / SMI / CML+  The SCI principle Typical SCI Software Hierarchy The SISCI API The SMI interface The CML message engine Outlook on the Lab-Session

The Scalable Coherent Interface • IEEE Standard 1596 (since 1992) • Initiated during FutureBus+ specification • Overcome limitations of bus architectures • Maintain bus-like services • Hierarchical ringlets with up to 65536 nodes • Packet based, split transaction protocol • Remote memory access capabilities • Optional cache coherence • Synchronization operations • Connection to host system not standardized Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SCI-based Architectures • Interconnection fabric for CC-NUMA • Utilization of cache coherence protocol • NUMA-Liine (Data General/?), NUMA-Q (IBM) • I/O extension • PCI-to-PCI bridges • Cluster interconnection technology • PCI-SCI bridges (from Dolphin ICS) • Comparable to other SANs like Myrinet • Enables user-level communication Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

User level communication Applications z.B. Sockets Setup Routines Comm. TCP/IP UDP/IP Stack Ethernet Driver NIC Driver Hardware / NIC Operating System Device Driver User library • Only setup via kernel • Communication direct • Only setup via kernel • Communication direct • Advantages • No OS Overhead • No Protocols • Direct HW Utilization • Typical performance • Latency < 10 µs • Bandwidth > 80 MB/s(for 32 bit, 33 MHz PCI) 1x Setup Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SCI-based PC clusters PCs with PCI-SCI adapter SCI interconnect Global address space • Bus-like services • Read-, write-, and synchronization transactions • Hardware-supported DSM • Global address space to access any physical memory Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

User-level communication Read/write to mapped segment NUMA characteristics High performance No protocol overhead No OS influence Latency: < 2.0 ms Bandwidth: > 300 MB/s Setup of communication Export Map SCIphysical addressspace • Export of physical memory • SCI Physical address space • Mapping into virtual memory SCI Ringlet Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SCI remote memory mappings Processon A Processon B Virtual address space on A Virtual address space on B Physical mem. on A Physical mem. on B PCI addr. on A PCI addr. on B SCI physical address space CPU MMU CPU MMU Node A SCI Bridge ATTs Node B Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Address Translation Table (ATT) • Present on each SCI adapter • Controls outbound communication • 4096 entries available • Each entry controls between 4KB and 512KB • 256 MB total memory  64 KB per entry • Many parameters per ATT entry, e.g. • Prefetching & Buffering • Enable atomic fetch & inc • Control ordering Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Some more details • Inbound communication • No remapping • Offset part of SCI physical addr. = physical addr. • Basic protection mechanism using access window • Outstanding transactions • Up to 16 write streams • Gather write operations to consequent addresses • Read/Write Ordering (optional) • Complete write transactions before read transaction Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Summary • Key feature of SCI: HW DSM • Allows efficient communication • No OS involvement • No protocol overhead • Implemented using remote memory maps • Access to remote memory regions • Based on physical addresses • Mapping done in a two step process • ATT control mapping across nodes • MMU controls mapping within nodes Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SCI low-level programming:SISCI / SMI / CML+ The SCI principle  Typical SCI Software Hierarchy The SISCI API The SMI interface The CML message engine Outlook on the Lab-Session

Software for SCI • Very diverse software environment available • Both major parallel programming paradigmsShared Memory & Message Passing • Most standard APIs or Environments • All levels of abstraction included • Many operating systems involved • Problems • Many sources (industry and academia) • Compatibility and Completeness • Accessibility Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SCI software stack (incomplete) Target applications / Test suites High-levelSoftware Sockets Code Crayshmem ScaMPI SCI- MPICH PVM DistributedJVM OpenMP Distributedthreads Crayshmem TCPUDP/ IPstack Low-levelSoftware ScaFun SMI CML+ HAMSTERModule SCI Low-level SW User/Kernelboundary ScaIP ScaMAC SISCI API SCI Drivers (Dolphin & Scali) SCI-VM &SciOS/SciFS SCI-Hardware: Dolphin adapter (as ring/switch or torus) Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Summary • SCI has a comprehensive software stack • Can be adapted to the individual needs • Joint effort by the whole SCI community • Contributions from academics and industry • “Documented” in the SCI-Europe conference series • Focus in this tutorial: Low-level parts • A few low-level APIs • High-level APIs & Environments • MPI Tutorial • Conference talks Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SCI low-level programming:SISCI / SMI / CML+ The SCI principle Typical SCI Software Hierarchy  The SISCI API The SMI interface The CML message engine Outlook on the Lab-Session

The SISCI API • Standard low-level API for SCI-based systems • Developed within the SISCI Project • Implemented by Dolphin ICS (for many platforms) • Adaptation layer for Scali systems available • Goal: Provide the full SCI functionality • Secure abstraction of user-level communication • Based on the IRM (Interconnect Resource Mng.) • Global resource management • Fault tolerance & configuration mechanisms • Comprehensive user-level API Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Use of SISCI API • Provides raw shared memory functionalities • Based on individual shared memory segments • No task and/or connection management • High complexity due to low-level character • Application area 1: Middleware • E.g. base for efficient message passing layers • Hide the low-level character of SISCI • Application area 2: Special purpose applications • Directly use the raw performance of SCI Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SISCI availability • SISCI API directly available from Dolphin • http://www.dolphinics.com/ • Multiplatform support • Windows NT 4.0 / 2000 • Linux (for 2.2.x & 2.4.x kernels) • Solaris 2.5.1/7/8 (Sparc) & 2.6/7 (X86) • Lynx OS 3.01 (x86 & PPC), VxWorks • Documentation also On-line • Complete user-library specification • Sample codes Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Basic SISCI functionality • Main functionality:Basic shared segment management • Creation of segments • Sharing/Identification of segments • Mapping of segments • Support for Memory transfer • Sequence control for data integrity • Optimized SCI memcpy version • Direct abstraction of the SCI principle Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Advanced SISCI functionality • Other functionality • DMA transfers • SCI-based interrupts • Store-barriers • Implicit resource management • Additional library in SISCI kit • sisci_demolib • Itself based on SISCI • Easier Node/Adapter identification Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SCICreateSegment Allocating and Mapping of Segments Data Transfer Virt. Addr. Phys. Mem. PCI Addr. Virt Addr. SCIPrepareSegment SCIMapLocalSegment SCISetSegmentAvailable SCIConnectSegment SCIMapRemoteSegment SCIUnMapSegment SCIDisconnectSegment SCIUnMapSegment SCIRemoveSegment Node A Node B Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SISCI Deficits • SISCI based on individual segments • Individual segments with own address space • Placed on a single node • No global virtual memory • No task/thread control, minimal synchronization • No full programming environment • Has to rebuilt over and over again • No cluster global abstraction • Configuration and ID management missing Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Summary • SISCI allows access to raw SCI • Direct access to remote memory mappings • Availability • Many platforms • All SISCI versions available at Dolphin´s website • Low-level character • Best possible performance • Suited for Middleware development • Deficits for high-level programming Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SCI low-level programming:SISCI / SMI / CML+ The SCI principle Typical SCI Software Hierarchy The SISCI API  The SMI interface The CML message engine Outlook on the Lab-Session

Agenda • SMI Library • We have SISCI & SMiLE – why SMI? • SMI Programming Paradigma • SMI Functionality Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Why SMI? • Higher abstraction level than SISCI: • Providing application environment • Single function calls for complex operations • Hiding of node & segment IDs • Extension of SISCI functionality • Resource management • Lower abstraction level than SMiLE: • Utilization of multiple PCI-SCI adapters • Utilization of DMA & Interrupts • Full control of memory layout Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

History of SMI • Development started in 1996: • SBus-SCI adapters in Sun Sparcstation 20 • no SISCI available • make SCI usage/NUMA programming less painful • Marcus Dormanns until end of 1998: • API for creation of parallel applications on shared memory (SMP/NUMA/cc-NUMA) platforms • Ph.D. thesis: Grid based parallelization techniques • Joachim Worringen since 1998: • extension of SMI as basis for other libraries or services on SCI-SMP-clusters Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SMI Availability • Target Architectures: • SMP systems • NUMA systems (SCI-Clusters) • Hardware platforms: • Intel IA-32, Alpha & Sparc (64 bit) • Software platforms: • Solaris, Linux, Windows NT/2000 • Uses threads, is partly threadsafe • static or shared library Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SMI Programming Paradigma P P P P M M • Basic Model: SPMD • Independent processes form an application • Processes share explicitly created Shared Memory Regions • Multiple processes on each node Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SMI Functionality • Set of ~70 API functions, but: • only 3 function calls to create an application with shared memory • Collective vs. individual functions: • Collective: all processes must call to complete • Individual: process-local completion • Some (intended) similarities to MPI • C/C++ and Fortran 77 bindings • Shared library for Solaris, Linux, Windows Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Initialization/Shutdown • Initialization: collective call to • SMI_Init(int *argc, char ***argv) • Passing references to argc and argv to SMI • Do not touch argc/argv before SMI_Init()! • Finalization: collective call to • SMI_Finalize() • Abort: individual call to • SMI_Abort(int error) • Implicitely frees all resources allocated by SMI Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Information Gathering • Topology information: • Number of processes: SMI_Proc_size (int *sz) • Local process rank: SMI_Proc_rank (int *rank) • Number of nodes: SMI_Node_size (int *sz) • Several more topology functions • System/State information: SMI_Query(smi_query_t q, int arg, void *result) • SCI, SMI and system related information • Timing functions:SMI_Wtime(), SMI_Wticks() Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Basic SMI Program #include „smi.h“ int main (int argc, char *argv[]) { int myrank, numprocs; SMI_Init (&argc, &argv); SMI_Proc_rank (&myrank); SMI_Proc_size (&numprocs); /* do parallel shared-memory stuff */ SMI_Finalize(); } Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Setting up SHM Regions • Creating a shared memory region:SMI_Create_shreg(int type, smi_shreg_info_t *reginfo, int *id, char **addr) • Shared region information: • Size of the shared memory region • PCI-SCI adapter to use (if not default) Information specific to some region types: • Owner of the region: memory is local to the owner • Custom distribution information • Remote Segment information Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SHM-Type UNDIVIDED • Basic Region Type: • one process (owner) exports a segment, all others import it. • FIXED or NON_FIXED, DELAYED • Collective invocation Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SHM-Type BLOCKED • Each process exports one segment • All segment get concanated • Contiguous region is created • Only FIXED, not DELAYED • Collective invocation Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SHM-Type LOCAL • A single process exports a segment • No other process is involved • Local completion semantics • Segment is available for connections • Only NONFIXED Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SHM-Type REMOTE • A single process imports an existing remote segment • No other process is involved • Local completion semantics • Only NONFIXED Remote process needs information! Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SHM-Type RDMA • A single process connects to an existing remote segment • No mapping = much less overhead • Region is not mapped = has no address • Only accessable via RDMA functions (SMI_Put, SMI_Get) • No other process is involved • Local completion semantics Remote process needs information! Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Deleting SHM Regions • Delete a shared memory region:SMI_Free_shreg (int region_id) • Effects: • Disconnects / un-import remote segments • Destroys / dis-allocates local segments • Same completions semantics as for the creation of the region (collective / individual) • Access to an address of a region after it has been free‘d • SIGSEGV Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Memory Management • Dynamic allocation of memory of shared regions (for any contiguous region type) • Region can be used directly or via SMI memory manager – not both! • InitializeMemory Management Unit:SMI_Init_shregMMU(int region_id) • Memory manager works with „buddy“ technique • Fast, but coarse granularity Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Memory Allocation • Individual allocation:SMI_Imalloc(int size, int id, void **addr) • Limited validity of addresses! • Collective allocation:SMI_Cmalloc(int size, int id, char **addr) • Freeing allocated memory:SMI_Ifree(char *addr)SMI_Cfree(char *addr) • Freeing mode must match allocation mode! Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Memory Transfers • Memory transfers possible via load/store operations or memcpy() • why SMI functionality to copy memory? • secure: including sequence check & store barrier • optimized: twice the performance • asynchronous: no CPU utilization • Synchronous copying:SMI_Memcpy(void *dst, void *src, int len, int flags) Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

Asynchronous copying • For mapped regions: • SMI_Imemcpy(dst, src, len, flags, handle) • Source region can be any type of memory • All sizes and adresses possible • Copy direction determined by addresses supplied • For non-mapped regions: • SMI_Get(void *dst, int src_id, int offset, int len) • SMI_Put(int dst_id, int offset, int len, void *src) • Source region needs to be SCI or registered • Alignment of size and offset to 8 • Copy direction determined by function name • Check for completion: SMI_Memwait(), SMI_Memtest() Joachim Worringen & Martin Schulz, SCI Summer School, Oct 1-3, Dublin

SCI low-level programming: SISCI / SMI / CML+