scalable fault tolerance xen virtualization for pgas models on high performance networks
Download
Skip this Video
Download Presentation
Scalable Fault Tolerance: Xen Virtualization for PGAs Models on High-Performance Networks

Loading in 2 Seconds...

play fullscreen
1 / 23

Scalable Fault Tolerance: Xen Virtualization for PGAs Models on High-Performance Networks - PowerPoint PPT Presentation


  • 45 Views
  • Uploaded on

Daniele Scarpazza, Oreste Villa, Fabrizio Petrini, Jarek Nieplocha, Vinod Tipparaju, Manoj Krishnan Pacific Northwest National Laboratory Radu Teodoresci, Jun Nakanom Josep Torrellas University of Illinois Duncan Roweth Quadrics In collaboration with Patrick Mullaney, Novell

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Scalable Fault Tolerance: Xen Virtualization for PGAs Models on High-Performance Networks' - miranda-finch


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
scalable fault tolerance xen virtualization for pgas models on high performance networks

Daniele Scarpazza, Oreste Villa, Fabrizio Petrini, Jarek Nieplocha, Vinod Tipparaju, Manoj Krishnan

Pacific Northwest National Laboratory

Radu Teodoresci, Jun Nakanom Josep Torrellas

University of Illinois

Duncan Roweth

Quadrics

In collaboration with

Patrick Mullaney, Novell

Wayne Augsburger, Mellanox

Scalable Fault Tolerance: Xen Virtualization for PGAs Models on High-Performance Networks
project motivation
Project Motivation
  • Component count in high-end systems has been growing
  • How do we utilize large (103-5 processor) systems for solving complex science problems?
  • Fundamental problems
    • Scalability to massive processor counts
    • Application performance on single processor given the increasingly complex memory hierarchy
    • Hardware and software failures

MTBF as a function of system size

multiple ft techniques
Multiple FT Techniques
  • Application drivers
    • Multidisciplinary, multiresolution, and multiscale nature of scientific problems drive the demand for high end systems
    • Applications place increasingly differing demands on the system resources: disk, network, memory, and CPU
    • Some of them have natural fault resiliency and require very little support
  • System drivers
    • Different I/O configurations, programmable or simple/commodity NICs, proprietary/custom/commodity operating systems
  • Tradeoffs between acceptable rates of failure and cost
    • Cost effectiveness is the main constraint in HPC
  • Therefore, it is not cost-effective or practical to rely on a single fault tolerance approach for all applications and systems
key elements of sft
Key Elements of SFT
  • Buffered Coscheduling provides global coordination
  • of system activities, communication, CR

BCS

ReVive and ReVive I/O provide efficient CR capability

for shared memory servers (cluster node):

ReVive (I/O)

Virtualization of High Performance Network Interfaces

and Protocols

IBA/QsNET

Fault-Tolerance module for ARMCI runtime system

FT ARMCI

Focus of the talk

Hypervisor to enable virtualization of compute node

environment including OS (external dependency)

XEN

transparent system level cr of pga applications on infiniband and qsnet
Transparent System-level CR of PGA Applications on Infiniband and QsNET
  • We explored a new approach to cluster fault-tolerance by integrating Xen with the latest generations of Infiniband and Quadrics high-performance networks
  • Focus on Partitioned Global Address Space (PGAs) programming models
    • Most of existing work focused on MPI
  • Design Goals
    • low-overhead and
    • transparent migration
main contributions
Main Contributions
  • Integration of Xen and Infiniband
    • Enhanced Xen’s kernel modules to fully support user-level Infiniband protocols and IP over IB with minimal overhead
  • Support for Partitioned Global Address space programming models (PGAs)
    • Emphasis on ARMCI
  • Automatic Detection of a Global Recovery Line and Coordinated Migration
    • Perform a live migration without any change to user applications
  • Experimental evaluation
xen hypervisor
Xen Hypervisor
  • On each machine Xen allows the creation of a privileged virtual machine (Dom0) and and one or more non-privileged VMs (DomUs)
  • Xen provides the ability to pause, un-pause, checkpoint and resume DomUs
  • Xen employs para-virtualization
    • Non-privileged domains run a modified operating system featuring guest device drivers
    • Their requests are forwarded to the native device driver in Dom0 using a split driver model
infiniband device driver
Infiniband Device Driver
  • The driver is implemented in two sections
    • A paravirtualized section for slow path control operations (e.g., q-pair creation) and
    • A direct access section for fast path data operations (transmit/receive)
  • Based on Ohio State/IBM implementation
  • Driver was extended to support additional CPU architectures and Infiniband adapters
    • Added proxy layer to allow Subnet and Connection management from guest VMs
    • Propagate suspend/resume to the applications not only kernel modules
    • Several stability improvements
parallel programming models
Parallel Programming Models
  • Single Threaded
    • Data Parallel, e.g. HPF
  • Multiple Processes
    • Partitioned-Local Data Access
      • MPI
    • Uniform-Global-Shared Data Access
      • OpenMP
    • Partitioned-Global-Shared Data Access
      • Co-Array Fortran
    • Uniform-Global-Shared + Partitioned Data Access
      • UPC, Global Arrays, X10
fault tolerance in pgas models
Fault Tolerance In PGAs Models
  • Implementation considerations
    • 1-sided communication, perhaps some 2-sided and collectives
    • Special considerations in implementation of global recovery line
    • Memory operations need to be synchronized for checkpoint/restart
  • Memory is a combination of local and global (globally visible)
    • Global memory could be shared from OS view
    • Pinned and registered with network adapter

G

G

G

G

L

L

L

L

P

P

P

P

SMP node n

SMP node 0

Network

xen enabled armci
Xen-enabled ARMCI

Fundamental Communication

Models in HPC

  • Runtime system - one-sided communication
    • Global Arrays, Rice Co-Array Fortran, GPSHEM, IBM X10 port under way
  • Portable high performance remote memory copy interface
  • Asynchronous remote memory access (RMA)
  • Fast Collective Operations
  • Zero-copy protocols, explicit NIC support
  • “Pure” non-blocking communication - 99.9% overlap
  • Data Locality
    • Shared-memory within SMP node and RMA across nodes
  • High performance delivered on wide range of platforms
    • Multi-protocol and multi-method implementation

A

B

receive

send

P1

P0

message passing

2-sided model

A

B

put

P0

P1

remote memory access (RMA)

1-sided model

A

B

A=B

P0

P1

shared memory load/stores

0-sided model

Examples of data

transfers optimized

in ARMCI

global recovery lines grls
Global Recovery Lines (GRLs)
  • A GRL is required before each checkpoint / migration
  • A GRL is required for Infiniband networks because
    • IBA does not allow location-independent layer 2 and 3 addresses
    • IBA hardware maintains stateful connections not accessible by software
  • The protocol that enforces a GRL has
    • A drain phase, which completes any ongoing communication
    • Followed by a global silence, where it is possible to perform node migration
    • And a resume phase, where the processing nodes acquire knowledge of the new network topology
experimental evaluation
Experimental Evaluation
  • Our experimental testbed is a cluster of 8 Dell PowerEdge 1950
  • Each node has two dual-core Intel Xeon Woodcrest running @ 3.0 GHz, with 8 Gbytes of memory
  • The cluster is interconnected by a Mellanox Infinihost III 4X HCA adapters
  • Suse Linux Enterprise Server 10.0
  • Xen 3.02
conclusion
Conclusion
  • We have presented a novel software infrastructure that allows completely transparent checkpoint/restart
  • We have implemented a device driver that enhances the existing Xen/Infiniband drivers
  • Support for PGAs programming models
  • Minimal overhead, 10s of milliseconds
    • Most of the time is spent saving/restoring the node image
ad