Petascale single system image and other stuff l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

PetaScale Single System Image and other stuff PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on
  • Presentation posted in: General

PetaScale Single System Image and other stuff. Collaborators Peter Braam, CFS Steve Reinhardt, SGI Stephen Wheat, Intel Stephen Scott, ORNL. Outline. Project Goals & Approach Details on work areas Collaboration ideas Compromising pictures of Barney . Project goals.

Download Presentation

PetaScale Single System Image and other stuff

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Petascale single system image and other stuff l.jpg

PetaScale Single System Imageand other stuff

Collaborators

Peter Braam, CFS

Steve Reinhardt, SGI

Stephen Wheat, Intel

Stephen Scott, ORNL


Outline l.jpg

Outline

  • Project Goals & Approach

  • Details on work areas

  • Collaboration ideas

  • Compromising pictures of Barney


Project goals l.jpg

Project goals

  • Evaluate methods for predictive task migration

  • Evaluate methods for dynamic superpages

  • Evaluate Linux at O(100,000) in a Single System Image


Approach l.jpg

Approach

  • Engineering: Build a suite based off existing technologies that will provide a foundation for further studies.

  • Research: Test advanced features built upon that foundation.


The foundation petassi 1 0 software stack release as distro july 2005 l.jpg

Single System Image with process migration

OpenSSI

OpenSSI

OpenSSI

OpenSSI

XenLinux

Linux 2.6.10

Lustre

XenLinux

Linux 2.6.10

Lustre

XenLinux

Linux 2.6.10

Lustre

XenLinux

Linux 2.6.10

Lustre

Xen Virtual Machine Monitor

Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)

The FoundationPetaSSI 1.0 Software Stack – Release as distro July 2005

SSI SoftwareOpenSSI1.9.1

FilesystemLustre 1.4.2

Basic OSLinux 2.6.10

Virtualization Xen2.0.2


Work areas l.jpg

Work Areas

  • Simulate ~10K nodes running OpenSSI via virtualization techniques

  • System-wide tools and process management for a ~100K processor environment

  • Testing Linux SMP at higher processor count

  • The scalability studies of a shared root filesystem scaling up to ~10K nodes

  • High Availability - Preemptive task migration.

  • Quantify “OS noise” at scale

  • Dynamic large page sizes (superpages)


Openssi cluster sw architecture l.jpg

OpenSSI Cluster SW Architecture

Install and

sysadmin

Application

monitoring

and restart

HA Resource

Mgmt and

Job

Scheduling

MPI

Boot and Init

Cluster

Membership

CLMS

Interprocess

Communication

IPC

Lustre

client

DLM

Kernel interface

Devices

Process

load leveling

Process Mgmt/

Vproc

Cluster

Filesystem

CFS

LVS

Remote File Block

Use ICS for inter-node communications

Internode Communication Subsystem - ICS

subsystems interface to CLMS for nodedown/nodeup notification

Quadrics

Miricom

Infiniband

Tcp/ip

RDMA


Approach for researching petascale ssi l.jpg

Approach for researching PetaScale SSI

HA Resource

Mgmt and

Job

Scheduling

Install and

sysadmin

MPI

MPI

Application

monitoring

and restart

Boot

Boot and Init

CLMS

Lite

CLMS

Lustre

client

IPC

Lustre

client

Devices

Process

load leveling

DLM

LVS

Vproc

Vproc

Cluster

Filesystem

CFS

Remote File Block

Remote File Block

ICS

ICS

Compute Nodes

single install; network or local boot; not part of single IP and no connection load balance single root with caching (Lustre);single file system namespace (Lustre); no single IPC namespace (optional); single process space but no process load leveling;no HA participation; scalable (relaxed) membership; inter-node communication channels on demand only

Service Nodes

single install; local boot (for HA); single IP (LVS)connection load balancing (LVS);single root with HA (Lustre):single file system namespace (Lustre); single IPC namespace; single process space and process load leveling;application HA strong/strict membership;


Approach to scaling openssi l.jpg

Approach to scaling OpenSSI

  • Two or more “service” nodes + optional “compute” nodes

    • Service nodes provide availability and 2 forms of load balancing

      • Computation can also be done on service nodes

    • Compute nodes allow for even larger scaling

      • No daemons require on compute nodes for job launch or stdio

  • Vproc for cluster-wide monitoring, e.g.,

    • Process space, process launch and process movement

  • Lustre for scaling up filesystem story (including root)

    • Enable diskless node option

  • Integration with other open source components:

    • Lustre, LVS, PBS, Maui, Ganglia, SuperMon, SLURM, …


Simulate 10 000 nodes running openssi l.jpg

Simulate 10,000 nodes running OpenSSI

  • Using Virtualization technique to demonstrate basic functionality (booting, etc):

    • Xen

    • Trying to quantify how many virtual machines we can have per physical machine.

  • Simulation enables assessment of relative performance:

    • Establish performance characteristics of individual OpenSSI components at scale (e.g., Lustre on 10,000 processors)

  • Exploring hardware testbed for performance characterization:

    • 786 processor IBM Power (would require port to Power)

    • 4096 processor Cray XT3 (catamount  Linux)

    • OS Testbeds are a major issue for Fast-OS projects


Virtualization and hpc l.jpg

Virtualization and HPC

  • For latency tolerant applications it is possible to run multiple virtual machines on a single physical node to simulate large node counts.

  • Xen has little overhead when GuestOS’s are not contending for resources. This may provide a path to support multiple OS’s on a single HPC system.

Overhead to build a Linux kernel on a GuestOS

Xen: 3%

VMWare: 27%

User Mode Linux: 103%


Pushing nodes to higher processor count and integration with openssi l.jpg

Pushing nodes to higher processor count, and integration with OpenSSI

What is needed to push kernel scalability further:

  • Continued work to quantify spinlocking bottlenecks in the kernel.

  • Using Open Source LockMeter

    • http://oss.sgi.com/projects/lockmeter/

      Paper about 2K node linux kernel at

      SGIUG next week in Germany

      What is needed for SSI integration

  • Continued SMP spinlock testing

  • Move to 2.6 kernel

  • Application Performance testing

  • Large page integration

Quantifying CC impacts on Fast Multipole Method using 512P Altix

Paper at SGIUG next week


Establish the intersection of openssi cluster and large kernels to get to 100 000 processors l.jpg

Continue SGIs work on single kernel scalability

Single Linux Kernel

Test the intersection large kernels with software OpenSSI to establish the sweet spot for 100,000 processor Linux environments

Continue OpenSSI’s work on SSI scalability

Typical SSI

Stock Linux Kernel

1 CPU

10,000 Nodes

1 Node

OpenSSI Clusters

Establish the intersection of OpenSSI cluster and large kernels to get to 100,000+ processors

2048 CPUs

1) Establish scalability baselines

2) Enhance scalability of both approaches

3) Understand intersection of both methods


System wide tools and process management for a 100 000 processor environment l.jpg

System-wide tools and process management for a 100,000 processor environment

  • Study process creation performance

    • Build tree strategy if necessary

  • Leverage periodic information collection

  • Study scalability of utilities like top, ls, ps, etc.


The scalability of a shared root filesystem to 10 000 nodes l.jpg

The scalability of a shared root filesystem to 10,000 nodes

  • Work started to enable Xen.

    • validation work to date has been with UML

  • Lustre is being tested and enhanced to be a root filesystem.

    • Validated functionality with OpenSSI

    • currently a bug with the tty char device access


Scalable io testbed l.jpg

Scalable IO Testbed

IBM Power4

Cray XT3

SGI Altix

Cray X1

Linux

GPFS

Lustre

XFS

XFS

Lustre

Gateway

Lustre

GPFS

Lustre

RootFS

IBM Power3

Archive

Cray XD1

Cray X2


High availability strategies applications l.jpg

High Availability Strategies - Applications

Predictive Failures and Migration

  • Leverage Intel failure predictive work [NDA required]

  • OpenSSI supports process migration… hard part is MPI rebinding.

    • On next global collective:

      • Don’t return until you have reconnected with indicated client;

      • Specific client moves and then reconnects and then responds to the collective

  • Do first for MPI and then adapt to UPC


Os noise stuff that interrupts computation l.jpg

“OS noise” (stuff that interrupts computation)

Problem – even small overhead could have a large impact on large-scale applications that co-ordinate often

Investigation:

  • Identify sources and measure overhead

    • Interrupts, daemons and kernel threads

      Solution directions:

  • Eliminate daemons or minimize OS

  • Reduce clock overhead

  • Register noise makers and:

    • Co-ordinate across the cluster;

    • Make noise only when the machine is idle;

  • Tests:

    • Run Linux on ORNL XT3 to evaluate against Catamount

    • Run daemons on a different physical node under SSI

    • Run application and services on different sections of a hypervisor


The use of very large page sizes superpages for large address spaces l.jpg

The use of very large page sizes (superpages) for large address spaces

  • Increasing cost in TLB miss overhead

    • growing working sets

    • TLB size does not grow at same pace

  • Processors now provide superpages

    • one TLB entry can map a large region

    • Most mainstream processors will support superpages in the next few years.

  • OSs have been slow to harness them

    • no transparent superpage support for apps


Tlb coverage trend l.jpg

30%

TLB miss overhead:

5%

5-10%

TLB coverage trend

TLB coverage as percentage of main memory

Factor of 1000

decrease in

15 years


Other approaches l.jpg

Other approaches

  • Reservations

    • one superpage size only

  • Relocation

    • move pages at promotion time

    • must recover copying costs

  • Eager superpage creation (IRIX, HP-UX)

    • size specified by user: non-transparent

  • Demotion issues not addressed

    • large pages partially dirty/referenced


Approach to be studied under this project dynamic superpages l.jpg

Approach to be studied under this projectDynamic superpages

Observation: Once an application touches the first page of a memory object then it is likely that it will quickly touch every page of that object

Example: array initialization

  • Opportunistic policy

    • Go for biggest size that is no larger than the memory object (e.g., file)

    • If size not available, try preemption before resigning to a smaller size

  • Speculative demotions

  • Manage fragmentation

    Current work has been on Itnaium and Alpha, both running BSD. This project will focus on Linux and we are currently investigating other processors.


Best case benefits on itanium l.jpg

Best-case benefits on Itanium

  • SPEC CPU2000 integer

    • 12.7% improvement (0 to 37%)

  • Other benchmarks

    • FFT (2003 matrix): 13% improvement

    • 1000x1000 matrix transpose: 415% improvement

  • 25%+ improvement in 6 out of 20 benchmarks

    • 5%+ improvement in 16 out of 20 benchmarks


Why multiple superpage sizes l.jpg

64KB

512KB

4MB

All

FFT

1%

0%

55%

55%

galgel

28%

28%

1%

29%

mcf

24%

31%

22%

68%

Why multiple superpage sizes

Improvements with only one superpage size vs. all sizes on Alpha


Summary l.jpg

Summary

  • Simulate ~10K nodes running OpenSSI via virtualization techniques

  • System-wide tools and process management for a ~100K processor environment

  • Testing Linux SMP at higher processor count

  • The scalability studies of a shared root filesystem scaling up to ~10K nodes

  • High Availability - Preemptive task migration.

  • Quantify “OS noise” at scale

  • Dynamic large page sizes (superpages)


June 2005 project status work done since funding started l.jpg

June 2005 Project StatusWork done since funding started

Xen and OpenSSI validation [done]

Xen and Lustre validation [done]

C/R added to OpenSSI[done]

IB port of OpenSSI[done]

Lustre Installation Manual [Book]

Lockmeter at 2048 CPUs[Paper at SGIUG]

CC impacts on apps at 512P[Paper at SGIUG]

Cluster Proc hooks [Paper at OLS]

Scaling study of Open SSI [Paper at COSET]

HA OpenSSI [Submitted Cluster05]

OpenSSI Socket Migration[Pending]

PetaSSI 1.0 release in July 2005


Collaboration ideas l.jpg

Collaboration ideas

  • Linux on XT3 to quantify LWK

  • Xen and other virtualization techniques

  • Dynamic vs. Static superpages


Questions l.jpg

Questions


  • Login