Achieving Isolation in Consolidated Environments

Achieving Isolation in Consolidated Environments Jack Lange Assistant Professor University of Pittsburgh

Consolidated HPC Environments • The future is consolidation of commodity and HPC workloads • HPC users are moving onto cloud platforms • Dedicated HPC systems are moving towards in-situ • Consolidated with visualization and analytics workloads • Can commodity OS/R’s effectively support HPC consolidation? • Commodity Design Goals • Maximized resource utilization • Fairness • Graceful degradation under load

Hardware Partitioning • Current approaches emphasize hardware space sharing • Current systems do support this, but… • Interference still exists inside the system software • Inherent feature of commodity systems Socket 2 Socket 1 Memory Cores Cores Memory 6 2 5 1 8 7 4 3 Commodity Partition HPC Partition

HPC vs. Commodity Systems • Commodity systems have fundamentally different focus than HPC systems • Amdahl’s vs. Gustafson’s laws • Commodity: Optimized for common case • HPC: Common case is not good enough • At large (tightly coupled) scales, percentiles lose meaning • Collective operations must wait for slowest node • 1% of nodes can make 99% suffer • HPC systems must optimize outliers (worst case)

Multi-stack Approach • Dynamic Resource Partitions • Runtime segmentation of underlying hardware resources • Assigned to specific workloads • Dynamic Software Isolation • Prevent interference from other workloads • Execute on separate system software stacks • Remove cross stack dependencies • Implementation • Independent system software running on isolated resources

Least Isolatable Units • Independently managed sets of isolated HW resources • Our Approach: Decompose system into sets of isolatable components • Independent resources that do not interfere with other components • Workloads execute on dedicated collections of LIUs • Units of allocation • CPU, memory, devices • Each are managed by independent system software stacks

Linux Memory Management • Demand Paging • Primary goal is to optimize memory utilization – not performance • Reduce overhead of common application behavior (fork/exec) • Support many concurrent processes • Large Pages • Integrated with overall demand paging architecture • Implications for HPC • Insufficient resource isolation • System noise • Linux large page solutions contribute to these problems IPDPS 2014 Brian Kocoloski and Jack Lange, HPMMAP: Lightweight Memory Management for Commodity Operating Systems

Transparent Huge Pages • Transparent Huge Pages (THP) • Fully automatic large page mechanism – no system administration or application cooperation • (1) Page fault handler uses large pages when possible • (2) khugepaged • khugepaged • Background kernel thread • Periodically allocates a large page • “Merges” large page into address space of any process requesting THP support • Requires global page table lock • Driven by OS heuristics – no knowledge of application workload

Transparent Huge Pages • Large page faults green, small faults delayed by merges blue • Generally periodic, but not synchronized • Variability increases dramatically under additional load

HugeTLBfs • HugeTLBfs • RAM-based filesystem supporting large page allocation • Requires pre-allocated memory pools reserved by system administrator • Access generally managed through libhugetlbfs • Limitations • Cannot back process stacks and other special regions • VMA permission/alignment constraints • Highly susceptible to overhead from system load

HugeTLBfs • Overhead of small page faults increases substantially • Due to memory exhaustion • HugeTLBfs memory is removed from pools available to small page fault handler

HPMMAP Overview • High Performance Memory Mapping and Allocation Platform • Lightweight memory management for unmodified Linux applications • HPMMAP borrows from the Kitten LWK to impose isolated virtual and physical memory management layers • Provide lightweight versions of memory management system calls • Utilize Linux memory offlining to completely manage large contiguous regions • Memory available in no less than 128 MB regions

HPMMAP Application Integration

Results

Evaluation – Multi-Node Scaling • Sandia cluster (8 nodes, 1Gb Ethernet) • One co-located 4-core parallel kernel build per node • No over-committed cores • 32 rank improvement: 12% for HPCCG, 9% for miniFE, 2% for LAMMPS • miniFE • Network overhead past 4 cores • Single node variability translates into worse scaling (3% improvement in single node experiment)

HPC in the cloud • Clouds are starting to look like supercomputers… • But we’re not there yet • Noise issues • Poor isolation • Resource contention • Lack of control over topology • Very bad for tightly coupled parallel apps • Require specialized environments that solve these problems • Approaching convergence • Vision: Dynamically partition cloud resources into HPC and commodity zones

Multi-stack Clouds • Virtualization overhead is not due to hardware costs • Results from underlying Host OS/VMM architectures and policies • Susceptible to performance overhead and Interference • Goal to provide isolated HPC VMs on commodity systems • Each zone optimized for the target applications Commodity VM(s) Isolated VM KVM Palacios VMM Kitten (Lightweight Kernel) Linux Hardware With JiannanOuyang and Brian Kocoloski

Multi-OS Architecture • Goals: • Fully isolated and independent operation • OS Bypass communication • No cross kernel dependencies • Needed Modifications: • Boot process that initializes subset of offline resources • Dynamic resource (re)assignment to the Kitten LWK • Cross stack shared memory communication • Block Driver Interface

Isolatable Hardware • We view system resources as a collection of Isolatable Units • In terms of both Performance and Management • Some hardware makes this easy • PCI (w/MSI, MSI-X) • APIC • Some hardware makes this difficult • SATA • IO-APIC • IOMMU • Some hardware makes this impossible • Legacy IDE • PCI (w/ Legacy PCI-INTx IRQs) • Some hardware cannot be completely isolated • SRIOV PCI devices • HyperThreaded CPU cores

Linux Offline Kitten Memory Memory Socket 2 Socket 1 Cores Cores 6 2 5 1 8 7 4 3 PCI Infiniband NIC SATA

Multi-stack Architecture • Allow multiple dynamically created enclaves • Based on runtime isolation requirements • Provides flexibility of fully independent OS/Rs • Isolated Performance and resource management Commodity VM(s) HPC VM HPC App Commodity Application(s) HPC Application KVM Palacios VMM Linux Kitten (1) Kitten (2) Linux Kitten LWK (Lightweight Kernel) Palacio VMM Hardware Hardware

Performance Evaluation • 8 Node InfinibandCluster • Space shared between commodity and HPC workloads • Commodity: Hadoop • HPC: HPCCG • Infinibandpassthrough for HPC VM • 1Gb Ethernet Passthrough for Commodity VM • Compared Multi-stack (Kitten + Palacios) vs. full Linux environment (KVM) • 10 Experiment runs for each configuration • CAVEAT: VM disks were all accessed from Commodity partition • Suffers significant interference (Current work)

Conclusion • Commodity systems are not designed to support HPC workloads • Different requirements and behaviors than commodity applications • A multi stack approach can provide HPC environments in commodity systems • HPC requirements can be met without separate physical systems • HPC and commodity workloads can dynamically share resources • Isolated system software environments are necessary

Thank you Jack Lange Assistant Professor University of Pittsburgh • jacklange@cs.pitt.edu • http://www.cs.pitt.edu/~jacklange

Multi-stack Operating Systems • Future Exascale Systems are moving towards in situ organization • Applications traditionally have utilized their own platforms • Visualization, storage, analysis, etc • Everything must now collapse onto a single platform

Performance Comparison Occasional Outliers (Large page coalescing) Linux Memory Management Lightweight Memory Management Lowlevel noise

Achieving Isolation in Consolidated Environments

Achieving Isolation in Consolidated Environments

Presentation Transcript

CONSOLIDATED RESULTS

Consolidated Application

ISOLATION

Muon Isolation in 22X

Isolation

Consolidated Messenger

Isolation

China in Isolation

Korea in Isolation

Consolidated Messenger

Isolation

Profiling, Prediction, and Capping of Power in Consolidated Environments

Generating Adaptation Policies for Multi-Tier Applications in Consolidated Server Environments

isolation...

Isolation in Relational Databases

Consolidated Accounts

Achieving under Pressure – Lessons from High Performance Environments

Consolidated Summary

ISOLATION

Isolation Levels in PostgreSQL