450 likes | 641 Views
SMiLE Shared Memory Programming. Wolfgang Karl & Martin Schulz Lehrstuhl für Rechnertechnik und Rechnerorganisation Technische Universität München 1st SCI Summer School October 2nd – 4th, 2000, Trinity College Dublin. SMiLE software layers. True Shared Memory Programming.
E N D
SMiLE Shared Memory Programming Wolfgang Karl & Martin SchulzLehrstuhl für Rechnertechnik und RechnerorganisationTechnische Universität München 1st SCI Summer SchoolOctober 2nd – 4th, 2000, Trinity College Dublin
SMiLE software layers True Shared MemoryProgramming Target applications / Test suites Target applications / Test suites High-levelSMiLE High-levelSMiLE NTprot.stack NTprot.stack SISALon MuSE SISALon MuSE SISCI PVM SISCI PVM SPMD style model SPMD style model TreadMarkscompatible API TreadMarkscompatible API Low-levelSMiLE Low-levelSMiLE AM 2.0 AM 2.0 SS-lib SS-lib CML CML SCI-VM lib SCI-VM lib SCI-Messaging SCI-Messaging Raw SCI Programming User/Kernelboundary User/Kernelboundary NDISdriver NDISdriver SCI Drivers &SISCI API SCI Drivers & SISCI API SCI-VM SCI-VM SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Overview • SCI hardware DSM principle • Using raw SCI via SISCI • SCI-VM: A global virtual memory for SCI • Shared Memory Programming Models • A sample model: SPMD • Outlook on the Lab-Session Martin Schulz, SCI Summer School, Oct 2-4, Dublin
SMiLE Shared Memory Programming SCI hardware DSM principle Using raw SCI via SISCI SCI-VM: A global virtual memory for SCI Shared Memory Programming Models A sample model: SPMD Outlook on the Lab-Session
SCI-based PC clusters PCs with PCI-SCI adapter SCI interconnect Global address space • Bus-like services • Read-, write-, and synchronization transactions • Hardware-supported DSM • Global address space to access any physical memory Martin Schulz, SCI Summer School, Oct 2-4, Dublin
User-level communication Read/write to mapped segment NUMA characteristics High performance No protocol overhead No OS influence Latency: < 2.0 ms Bandwidth: > 80 MB/s Setup of communication Export Map SCIphysical addressspace • Export of physical memory • SCI Physical address space • Mapping into virtual memory SCI Ringlet Martin Schulz, SCI Summer School, Oct 2-4, Dublin
SCI remote memory mappings Processon A Processon B Virtual address space on A Virtual address space on B Physical mem. on A Physical mem. on B PCI addr. on A PCI addr. on B SCI physical address space CPU MMU CPU MMU Node A SCI Bridge ATTs Node B Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Address Translation Table (ATT) • Present on each SCI adapter • Controls outbound communication • 4096 entries available • Each entry controls between 4KB and 512KB • 256 MB total memory 64 KB per entry • Many parameters per ATT entry, e.g. • Prefetching & Buffering • Enable atomic fetch & inc • Control ordering Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Some more details • Inbound communication • No remapping • Offset part of SCI physical addr. = physical addr. • Basic protection mechanism using access window • Outstanding transactions • Up to 16 write streams • Gather write operations to consequent addresses • Read/Write Ordering • Complete write transactions before read transaction Martin Schulz, SCI Summer School, Oct 2-4, Dublin
SMiLE Shared Memory Programming SCI hardware DSM principle Using raw SCI via SISCI SCI-VM: A global virtual memory for SCI Shared Memory Programming Models A sample model: SPMD Outlook on the Lab-Session
The SISCI API • Standard low-level API for SCI-based systems • Developed within the SISCI Project • Standard Software Infrastructure for SCI • Implemented by Dolphin ICS (for many platforms) • Goal: Provide the full SCI functionality • Secure abstraction of user-level communication • Based on the IRM (Interconnect Resource Mng.) • Global resource management • Fault tolerance & configuration mechanisms • Comprehensive user-level API Martin Schulz, SCI Summer School, Oct 2-4, Dublin
SISCI in the overall infrastructure Target applications / Test suites Target applications / Test suites High-levelSMiLE High-levelSMiLE NTprot.stack NTprot.stack SISALon MuSE SISALon MuSE SISCI PVM SISCI PVM SPMD style model SPMD style model TreadMarkscompatible API TreadMarkscompatible API Low-levelSMiLE Low-levelSMiLE SISCI API / Library AM 2.0 AM 2.0 SS-lib SS-lib CML CML SCI-VM lib SCI-VM lib SCI-Messaging SCI-Messaging User/Kernelboundary User/Kernelboundary NDISdriver NDISdriver SCI Drivers &SISCI API SCI Drivers & SISCI API SCI-VM SCI-VM SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Use of SISCI API • Provides raw shared memory functionalities • Based on individual shared memory segments • No task and/or connection management • High complexity due to low-level character • Application area 1: Middleware • E.g. base for efficient message passing layers • Hide the low-level character of SISCI • Application area 2: Special purpose applications • Directly use the raw performance of SCI Martin Schulz, SCI Summer School, Oct 2-4, Dublin
SISCI availability • SISCI API directly available from Dolphin • http://www.dolphinics.no/ • Multiplatform support • Windows NT 4.0 / 2000 • Linux (currently for 2.2.x kernels) • Solaris 2.5.1/7/8 (Sparc) & 2.6/7 (X86) • Lynx OS 3.01 (x86 & PPC), VxWorks • True64 (first version available) • Documentation also On-line • Complete user-library specification • Sample codes Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Basic SISCI functionality • Main functionality:Basic shared segment management • Creation of segments • Sharing/Identification of segments • Mapping of segments • Support for Memory transfer • Sequence control for data integrity • Optimized SCI memcpy version • Direct abstraction of the SCI principle Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Advanced SISCI functionality • Other functionality • DMA transfers • SCI-based interrupts • Store-barriers • Implicit resource management • Additional library in SISCI kit • sisci_demolib • Itself based on SISCI • Easier Node/Adapter identification Martin Schulz, SCI Summer School, Oct 2-4, Dublin
SCICreateSegment Allocating and Mapping of Segments Data Transfer Virt. Addr. Phys. Mem. PCI Addr. Virt Addr. SCIPrepareSegment SCIMapLocalSegment SCISetSegmentAvailable SCIConnectSegment SCIMapRemoteSegment SCIUnMapSegment SCIDisconnectSegment SCIUnMapSegment SCIRemoveSegment Node A Node B Martin Schulz, SCI Summer School, Oct 2-4, Dublin
SISCI Deficits • SISCI based on individual segments • Individual segments with own address space • Placed on a single node • No global virtual memory • No task/thread control, minimal synchronization • No full programming environment • Has to rebuilt over and over again • No cluster global abstraction • Configuration and ID management missing Martin Schulz, SCI Summer School, Oct 2-4, Dublin
SMiLE Shared Memory Programming SCI hardware DSM principle Using raw SCI via SISCI SCI-VM: Global virtual memory for SCI Shared Memory Programming Models A sample model: SPMD Outlook on the Lab-Session
SCI support for shared memory • SCI provides Inter-process shared memory • Available through SISCI • Shared memory support not sufficient fortrue shared memory programming • No support for virtual memory • Each segment placed on a single node only • Intra-process shared memory required • Based on remote memory operations • Software DSM support necessary Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Shared memory for clusters • Goal: SCI Virtual Memory • Flexible, general global memory abstraction • One global virtual address space across nodes • Total transparency for the user • Direct utilization of SCI HW-DSM • Hybrid DSM based system • HW-DSM for implicit communication • SW-DSM component for system management • Platform: Linux & Windows NT Martin Schulz, SCI Summer School, Oct 2-4, Dublin
SCI-VM design • Global process abstraction • Team on each node host threads • Each team needs identical virtual memory view • Virtual memory distributed at page granularity • Map local pages conventionally via MMU • Map remote pages with the help of SCI • Extension of the virtual memory concept • Use memory resources across nodes transparently • Modified VM-Manager necessary Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Design of the SCI-VM Team on A Team on B Thread Thread Thread Thread Thread Thread Virtual address space on A Virtual address space on B 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Physical mem. on A Physical mem. on B PCI addr. on A PCI addr. on B 4 6 1 1 4 6 4 6 1 SCI physical address space Global process abstraction with SCI Virtual Memory Node A Node B Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Implementation challenges • Integration into virtual memory management • Virtual memory mappings at page granularity • Utilization of paged memory • Integration into SCI driver stack • Enable fast, fine grain mappings of remote pages • Caching of remote memory • SCI over PCI does not support cache consistency • Caching necessary for read performance • Solution: Caching & Relaxing consistency • Very similar to LRC mechanisms Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Implementation design • Modular implementation • Extensibility and Flexibility • VMM extension • Manage MMU mappings directly • Kernel driver required • Synchronization operations • Based on SCI atomic transactions • Consistency enforcing mechanisms • Often in coordination with synchronization points • Flush buffers and caches Martin Schulz, SCI Summer School, Oct 2-4, Dublin
PCI/SCI & OS Problems • Missing transparency of the hardware • SCI adapter and PCI host bridge create errors • Shmem model expects full correctness • Integration • SCI Integration done by IRM extension • Missing OS Integration leads to instabilities • Real OS integration for Linux in the works • Locality problems • Often the cause for bad performance • Easy possibility for optimizations necessary Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Implementation status • SCI-VM prototype running • Windows NT & Linux • Static version (no remapping) • Mostly user level based • VMM extensions for both OSs • Current work • Experiments with more applications • Includes comparison with SW-DSM systems • Adding dynamic version of SCI-VM • Real extension of virtual memory management Martin Schulz, SCI Summer School, Oct 2-4, Dublin
SMiLE Shared Memory Programming SCI hardware DSM principle Using raw SCI via SISCI SCI-VM: A global virtual memory for SCI Shared Memory Programming Models A sample model: SPMD Outlook on the Lab-Session
Multi-Programming Model support • Abundance of shared memory models • In contrast to Message Passing (MPI & PVM) • Few standardization approaches • OpenMP, HPF, Posix threads • Many DSM systems with own API (historically) • Negative impact on portability • Frequent retargeting necessary • Required: Multi-Programming Model support • One system with many faces • Minimize work for support of additional models Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Flexibility of HAMSTER • HAMSTER: Hybrid-dsm based Adaptive and Modular Shared memory archiTEctuRe • Goal: • Support for arbitrary Shmem models • Standard models -> portability of applications • Domain specific models • SCI-VM as efficient DSM core for clusters • Export of various shared memory services • Strongly modularized • Many parameters for easy tuning Martin Schulz, SCI Summer School, Oct 2-4, Dublin
HAMSTER in the infrastructure Target applications / Test suites Target applications / Test suites HAMSTER Hybrid DSM based Adaptive and Modular Shared Memory archiTEctuRe High-levelSMiLE High-levelSMiLE NTprot.stack NTprot.stack SISALon MuSE SISALon MuSE SISCI PVM SISCI PVM SPMD style model SPMD style model TreadMarkscompatible API TreadMarkscompatible API Low-levelSMiLE Low-levelSMiLE AM 2.0 AM 2.0 SS-lib SS-lib CML CML SCI-VM lib SCI-VM lib SCI-Messaging SCI-Messaging User/Kernelboundary User/Kernelboundary NDISdriver NDISdriver SCI Drivers &SISCI API SCI Drivers & SISCI API SCI-VM SCI-VM SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor Martin Schulz, SCI Summer School, Oct 2-4, Dublin
HAMSTER framework Task Mgmt. Sync. Mgmt. SCI-VM: Hybrid DSM for SCI clusters SAN with HW-DSM Cluster built ofcommodity PC hardware Shared Memory application Arbitrary Shared Memory model VI-like comm. access through HW- DSM Clus.Ctrl. Cons. Mgmt. Mem. Mgmt. Standalone OSLinux & WinNT NIC driver Martin Schulz, SCI Summer School, Oct 2-4, Dublin
HAMSTER status • Prototype running • Based on the current SCI-VM • Most modules more or less completed • Available programming models • SPMD (incl. SMP version), Jia-Jia • TreadMarks ™, ANL macros (for SPLASH 2) • POSIX thread API (Linux), Win32 threads (WinNT) • Current work • Further programming models (for comparison) • Evaluation using various applications Martin Schulz, SCI Summer School, Oct 2-4, Dublin
SMiLE Shared Memory Programming SCI hardware DSM principle Using raw SCI via SISCI SCI-VM: A global virtual memory for SCI Shared Memory Programming Models A sample model: SPMD Outlook on the Lab-Session
What is SPMD ? • Single Program Multiple Data • The same code is executed on each node • Each node works on separate data • Characteristics of SPMD • Static task model (no dynamic spawn) • Fixed assignment of node IDs • Main synchronization construct: Barriers • All data stored in a globally accessible memory • Release Consistency Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Task / Node management spmd_getNodeNumber spmd_getNodeCount Memory management spmd_alloc spmd_allocOpt Barrier spmd_allocBarrier spmd_barrier Locks spmd_allocLock spmd_lock spmd_unlock Synchronization routines spmd_sync Atomic counters Misc spmd_getGlobalMem spmd_distInitialData SMPD model Core routines used today • Task / Node management • spmd_getNodeNumber • spmd_getNodeCount • Memory management • spmd_alloc • spmd_allocOpt • Barrier • spmd_allocBarrier • spmd_barrier Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Sample SPMD application • Successive Over-Relaxation (SOR) • Iterative method for solving PDEs • Advantage: Straightforward parallelization • Main concepts of SOR • Dense matrix • For each iteration • Go through whole matrix • Update points Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Task distribution • Dense Matrix • Boundary • Split Matrixe.g. 2 nodes • Local boundaries • Overlap area with communication • Possibility for locality optim. Martin Schulz, SCI Summer School, Oct 2-4, Dublin
SPMD code for SOR application intcount = spmd_getNodeCount();intnumber = spmd_getNodeNumber(); start_row = (N / count) * number;end_row = (N / count) * (number+1); intbarrier = spmd_allocBarrier();float* matr = (float*) spmd_alloc(size); Doiter iterations Work on Matrix start_row – end_row;spmd_barrier(barrier); If (number == 0)Print the result; Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Configuration & Startup • Startup • Install SCI-VM driver and IRM extension • Application execution • Shell on each node to run on • Execute program with identical parameters • Output on the individual shells • Configuration file • Placed in home dir. • SMP description(for certain models) • Configuration file • Placed in home dir. • SMP description(for certain models) ~/.hamster # Name ID CPUs smile1 4 2 smile2 68 2 Martin Schulz, SCI Summer School, Oct 2-4, Dublin
SMiLE Shared Memory Programming SCI hardware DSM principle Using raw SCI via SISCI SCI-VM: A global virtual memory for SCI Shared Memory Programming Models A sample model: SPMD Outlook on the Lab-Session
A warning for the Lab-Session • SCI-VM is still a prototype • First time out of the lab • Remaining problems • Transparency of PCI bridges • OS integration • Hang-Ups may occur – In this case: • Remain calm and do not panic • Follow the instructions of the crew • Emergency exits on both sides • Most of the times: A restart/reboot will do the trick Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Outline • Lab Setup • Experiments with the SISCI API • SISCI sample code • Add segment management code • Experiments with the SPMD model • SCI-VM / HAMSTER installation • Given: seq. SOR application • Goal: Parallelization Martin Schulz, SCI Summer School, Oct 2-4, Dublin
Before the Lab-Session Any Questions ? http://smile.in.tum.de/ http://hamster.in.tum.de/ Martin Schulz, SCI Summer School, Oct 2-4, Dublin