PetaScale Single System Image and other stuff

PetaScale Single System Imageand other stuff Collaborators Peter Braam, CFS Steve Reinhardt, SGI Stephen Wheat, Intel Stephen Scott, ORNL

Outline • Project Goals & Approach • Details on work areas • Collaboration ideas • Compromising pictures of Barney

Project goals • Evaluate methods for predictive task migration • Evaluate methods for dynamic superpages • Evaluate Linux at O(100,000) in a Single System Image

Approach • Engineering: Build a suite based off existing technologies that will provide a foundation for further studies. • Research: Test advanced features built upon that foundation.

Single System Image with process migration OpenSSI OpenSSI OpenSSI OpenSSI XenLinux Linux 2.6.10 Lustre XenLinux Linux 2.6.10 Lustre XenLinux Linux 2.6.10 Lustre XenLinux Linux 2.6.10 Lustre Xen Virtual Machine Monitor Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE) The FoundationPetaSSI 1.0 Software Stack – Release as distro July 2005 SSI Software OpenSSI 1.9.1 Filesystem Lustre 1.4.2 Basic OS Linux 2.6.10 Virtualization Xen 2.0.2

Work Areas • Simulate ~10K nodes running OpenSSI via virtualization techniques • System-wide tools and process management for a ~100K processor environment • Testing Linux SMP at higher processor count • The scalability studies of a shared root filesystem scaling up to ~10K nodes • High Availability - Preemptive task migration. • Quantify “OS noise” at scale • Dynamic large page sizes (superpages)

OpenSSI Cluster SW Architecture Install and sysadmin Application monitoring and restart HA Resource Mgmt and Job Scheduling MPI Boot and Init Cluster Membership CLMS Interprocess Communication IPC Lustre client DLM Kernel interface Devices Process load leveling Process Mgmt/ Vproc Cluster Filesystem CFS LVS Remote File Block Use ICS for inter-node communications Internode Communication Subsystem - ICS subsystems interface to CLMS for nodedown/nodeup notification Quadrics Miricom Infiniband Tcp/ip RDMA

Approach for researching PetaScale SSI HA Resource Mgmt and Job Scheduling Install and sysadmin MPI MPI Application monitoring and restart Boot Boot and Init CLMS Lite CLMS Lustre client IPC Lustre client Devices Process load leveling DLM LVS Vproc Vproc Cluster Filesystem CFS Remote File Block Remote File Block ICS ICS Compute Nodes single install; network or local boot; not part of single IP and no connection load balance single root with caching (Lustre);single file system namespace (Lustre); no single IPC namespace (optional); single process space but no process load leveling;no HA participation; scalable (relaxed) membership; inter-node communication channels on demand only Service Nodes single install; local boot (for HA); single IP (LVS)connection load balancing (LVS);single root with HA (Lustre):single file system namespace (Lustre); single IPC namespace; single process space and process load leveling;application HA strong/strict membership;

Approach to scaling OpenSSI • Two or more “service” nodes + optional “compute” nodes • Service nodes provide availability and 2 forms of load balancing • Computation can also be done on service nodes • Compute nodes allow for even larger scaling • No daemons require on compute nodes for job launch or stdio • Vproc for cluster-wide monitoring, e.g., • Process space, process launch and process movement • Lustre for scaling up filesystem story (including root) • Enable diskless node option • Integration with other open source components: • Lustre, LVS, PBS, Maui, Ganglia, SuperMon, SLURM, …

Simulate 10,000 nodes running OpenSSI • Using Virtualization technique to demonstrate basic functionality (booting, etc): • Xen • Trying to quantify how many virtual machines we can have per physical machine. • Simulation enables assessment of relative performance: • Establish performance characteristics of individual OpenSSI components at scale (e.g., Lustre on 10,000 processors) • Exploring hardware testbed for performance characterization: • 786 processor IBM Power (would require port to Power) • 4096 processor Cray XT3 (catamount  Linux) • OS Testbeds are a major issue for Fast-OS projects

Virtualization and HPC • For latency tolerant applications it is possible to run multiple virtual machines on a single physical node to simulate large node counts. • Xen has little overhead when GuestOS’s are not contending for resources. This may provide a path to support multiple OS’s on a single HPC system. Overhead to build a Linux kernel on a GuestOS Xen: 3% VMWare: 27% User Mode Linux: 103%

Pushing nodes to higher processor count, and integration with OpenSSI What is needed to push kernel scalability further: • Continued work to quantify spinlocking bottlenecks in the kernel. • Using Open Source LockMeter • http://oss.sgi.com/projects/lockmeter/ Paper about 2K node linux kernel at SGIUG next week in Germany What is needed for SSI integration • Continued SMP spinlock testing • Move to 2.6 kernel • Application Performance testing • Large page integration Quantifying CC impacts on Fast Multipole Method using 512P Altix Paper at SGIUG next week

Continue SGIs work on single kernel scalability Single Linux Kernel Test the intersection large kernels with software OpenSSI to establish the sweet spot for 100,000 processor Linux environments Continue OpenSSI’s work on SSI scalability Typical SSI Stock Linux Kernel 1 CPU 10,000 Nodes 1 Node OpenSSI Clusters Establish the intersection of OpenSSI cluster and large kernels to get to 100,000+ processors 2048 CPUs 1) Establish scalability baselines 2) Enhance scalability of both approaches 3) Understand intersection of both methods

System-wide tools and process management for a 100,000 processor environment • Study process creation performance • Build tree strategy if necessary • Leverage periodic information collection • Study scalability of utilities like top, ls, ps, etc.

The scalability of a shared root filesystem to 10,000 nodes • Work started to enable Xen. • validation work to date has been with UML • Lustre is being tested and enhanced to be a root filesystem. • Validated functionality with OpenSSI • currently a bug with the tty char device access

Scalable IO Testbed IBM Power4 Cray XT3 SGI Altix Cray X1 Linux GPFS Lustre XFS XFS Lustre Gateway Lustre GPFS Lustre RootFS IBM Power3 Archive Cray XD1 Cray X2

High Availability Strategies - Applications Predictive Failures and Migration • Leverage Intel failure predictive work [NDA required] • OpenSSI supports process migration… hard part is MPI rebinding. • On next global collective: • Don’t return until you have reconnected with indicated client; • Specific client moves and then reconnects and then responds to the collective • Do first for MPI and then adapt to UPC

“OS noise” (stuff that interrupts computation) Problem – even small overhead could have a large impact on large-scale applications that co-ordinate often Investigation: • Identify sources and measure overhead • Interrupts, daemons and kernel threads Solution directions: • Eliminate daemons or minimize OS • Reduce clock overhead • Register noise makers and: • Co-ordinate across the cluster; • Make noise only when the machine is idle; • Tests: • Run Linux on ORNL XT3 to evaluate against Catamount • Run daemons on a different physical node under SSI • Run application and services on different sections of a hypervisor

The use of very large page sizes (superpages) for large address spaces • Increasing cost in TLB miss overhead • growing working sets • TLB size does not grow at same pace • Processors now provide superpages • one TLB entry can map a large region • Most mainstream processors will support superpages in the next few years. • OSs have been slow to harness them • no transparent superpage support for apps

30% TLB miss overhead: 5% 5-10% TLB coverage trend TLB coverage as percentage of main memory Factor of 1000 decrease in 15 years

Other approaches • Reservations • one superpage size only • Relocation • move pages at promotion time • must recover copying costs • Eager superpage creation (IRIX, HP-UX) • size specified by user: non-transparent • Demotion issues not addressed • large pages partially dirty/referenced

Approach to be studied under this projectDynamic superpages Observation: Once an application touches the first page of a memory object then it is likely that it will quickly touch every page of that object Example: array initialization • Opportunistic policy • Go for biggest size that is no larger than the memory object (e.g., file) • If size not available, try preemption before resigning to a smaller size • Speculative demotions • Manage fragmentation Current work has been on Itnaium and Alpha, both running BSD. This project will focus on Linux and we are currently investigating other processors.

Best-case benefits on Itanium • SPEC CPU2000 integer • 12.7% improvement (0 to 37%) • Other benchmarks • FFT (2003 matrix): 13% improvement • 1000x1000 matrix transpose: 415% improvement • 25%+ improvement in 6 out of 20 benchmarks • 5%+ improvement in 16 out of 20 benchmarks

64KB 512KB 4MB All FFT 1% 0% 55% 55% galgel 28% 28% 1% 29% mcf 24% 31% 22% 68% Why multiple superpage sizes Improvements with only one superpage size vs. all sizes on Alpha

Summary • Simulate ~10K nodes running OpenSSI via virtualization techniques • System-wide tools and process management for a ~100K processor environment • Testing Linux SMP at higher processor count • The scalability studies of a shared root filesystem scaling up to ~10K nodes • High Availability - Preemptive task migration. • Quantify “OS noise” at scale • Dynamic large page sizes (superpages)

June 2005 Project StatusWork done since funding started Xen and OpenSSI validation [done] Xen and Lustre validation [done] C/R added to OpenSSI [done] IB port of OpenSSI [done] Lustre Installation Manual [Book] Lockmeter at 2048 CPUs [Paper at SGIUG] CC impacts on apps at 512P [Paper at SGIUG] Cluster Proc hooks [Paper at OLS] Scaling study of Open SSI [Paper at COSET] HA OpenSSI [Submitted Cluster05] OpenSSI Socket Migration [Pending] PetaSSI 1.0 release in July 2005

Collaboration ideas • Linux on XT3 to quantify LWK • Xen and other virtualization techniques • Dynamic vs. Static superpages

Questions

PetaScale Single System Image and other stuff

PetaScale Single System Image and other stuff

Presentation Transcript

Keeping Afloat! EMu and Other Stuff

Logcat , Crashes, and other stuff…

Atoms… and other stuff!

Polygons and Other Stuff…

Microseismic monitoring and other stuff

Single System Image

Case studies and other stuff

Stinking Feet and Other Rotten Stuff

Manchester Other stuff

Sex, Love and Other Stuff

VIC Output Processing (and other stuff)

Other Stuff in the Solar System

FAST-OS: Petascale Single System Image

PetaScale Single System Image and other stuff

Noise, mobility, accessibility, and other stuff

Other New OpenGL Stuff

Other Stuff PPT

Other Important Stuff

Other stuff

Single System Image Clustering

FAST-OS: Petascale Single System Image