Openmosix approach to build scalable hpc farms with an easy management infrastructure
1 / 26

OpenMOSIX approach to build scalable HPC farms with an easy management infrastructure - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

OpenMOSIX approach to build scalable HPC farms with an easy management infrastructure. Rosario Esposito 1 Paolo Mastroserio 1 Francesco Maria Taurino 1,2 Gennaro Tortone 1. INFN - Napoli 1 INFM - UDR Napoli 2 CHEP 2003 – La Jolla (San Diego). Index. Introduction OpenMosix overview

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

OpenMOSIX approach to build scalable HPC farms with an easy management infrastructure

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

OpenMOSIX approach to build scalable HPC farms with an easy management infrastructure

Rosario Esposito1

Paolo Mastroserio1

Francesco Maria Taurino1,2

Gennaro Tortone1

INFN - Napoli1INFM - UDR Napoli2

CHEP 2003 – La Jolla (San Diego)


  • Introduction

  • OpenMosix overview

  • Farm setup

  • Use cases

  • Conclusions

CHEP 2003 – La Jolla

What makes clusters hard ?

Setup (administrator)

  • setting up a 16 node farm by hand is prone to errors

    Maintenance (administrator)

  • ever tried to update a package on every node in the farm?

    Running jobs (users)

  • running a parallel program or set of sequential programs requires the users to figure out which hosts are available and manually assign tasks to the nodes, or use software tools based on static process allocation (queue managers)

CHEP 2003 – La Jolla

What is OpenMosix ?


OpenMosix is an OpenSource enhancement to the Linux kernel providing adaptive (on-line) load-balancing between x86 Linux machines. It uses preemptive process migration to assign and reassign the processes among the nodes to take the best advantage of the available resources

OpenMosix moves processes around the Linux farm to balance the load, using less loaded machines first


CHEP 2003 – La Jolla

OpenMosix introduction

Execution environment

  • farm of [diskless] x86 based nodes both UP and SMP that are connected by standard or high-speed LAN

    Implementation level

  • Linux kernel (no library to link with sources)

    System image model

  • virtual machine with a lot of memory and CPU


  • Process


  • improve the overall (cluster-wide) performance and create a convenient multi-user, time-sharing environment for the execution of both sequential and parallel applications

CHEP 2003 – La Jolla

OpenMosix architecture (1/5)

Network transparency

the interactive user and the application level programs are provided by a virtual machine that looks like a single MP machine

Preemptive process migration

any user’s process, trasparently and at any time, can migrate to any available node.

The migrating process is divided into two contexts:

  • system context (deputy) that may not be migrated from “home” workstation (UHN);

  • user context (remote) that can be migrated on a diskless node;

CHEP 2003 – La Jolla

OpenMosix architecture (2/5)

Preemptive process migration

master node

diskless node

CHEP 2003 – La Jolla

OpenMosix architecture (3/5)

Dynamic load balancing

  • initiates process migrations in order to balance the load of farm

  • responds to variations in the load of the nodes, runtime characteristics of the processes, number of nodes and their speeds

  • makes continuous attempts to reduce the load differences between pairs of nodes and dynamically migrating processes from nodes with higher load to nodes with a lower load

  • the policy is symmetrical and decentralized; all of the nodes execute the same algorithm and the reduction of the load differences is performed indipendently by any pair of nodes

CHEP 2003 – La Jolla

OpenMosix architecture (4/5)

Memory sharing

  • places the maximal number of processes in the farm main memory, even if it implies an uneven load distribution among the nodes

  • delays as much as possible swapping out of pages

  • makes the decision of which process to migrate and where to migrate it is based on the knoweldge of the amount of free memory in other nodes

    Efficient kernel communication

  • is specifically developed to reduce the overhead of the internal kernel communications (e.g. between the process and its home site, when it is executing in a remote site)

  • fast and reliable protocol with low startup latency and high throughput

CHEP 2003 – La Jolla

OpenMosix architecture (5/5)

Probabilistic information dissemination algorithms

  • provide each node with sufficient knowledge about available resources in other nodes, without polling

  • measure the amount of the available resources on each node

  • receive the resources indices that each node send at regular intervals to a randomly chosen subset of nodes

  • the use of randomly chosen subset of nodes is due for support of dynamic configuration and to overcome partial nodes failures

    Decentralized control and autonomy

  • each node makes its own control decisions independently and there is no master-slave relationship between nodes

  • each node is capable of operating as an independent system; this property allows a dynamic configuration, where nodes may join or leave the farm with minimal disruption

CHEP 2003 – La Jolla

Farm setup: PXE & ClusterNFS

  • diskless nodes

    • low cost

    • eliminates install/upgrade of hardware, software on diskless client side

    • backups are centralized in one single main server

    • zero administration at diskless client side

CHEP 2003 – La Jolla

Diskless farm setup traditional method (1/2)

Traditional method

  • Server

    • BOOTP server

    • NFS server

    • separate root directory for each client

  • Client

    • BOOTP to obtain IP

    • TFTP to load “tagged kernel” image

    • rootNFS to load root filesystem

CHEP 2003 – La Jolla

Diskless farm setup traditional method (2/2)

Traditional method – Problems

separate root directory structure for each node

  • hard to set up

    • lots of directories with slightlydifferent contents

  • difficult to maintain

    • changes must be propagated to each directory

CHEP 2003 – La Jolla



cNFS is a patch to the standard Universal-NFS server code that “parses” file request to determine an appropriate match on the server


when client machine foo2 asks for file /etc/hostname it gets the contents of /etc/hostname$$HOST=foo2$$


CHEP 2003 – La Jolla

ClusterNFS features

ClusterNFS allows all machines (including server) to share the root filesystem

  • all files are shared by default

  • files for all clients are named filename$$CLIENT$$

  • files for specific client are namedfilename$$$$ orfilename$$$$

CHEP 2003 – La Jolla

Diskless farm setup with ClusterNFS (1/2)

ClusterNFS method

  • Server

    • DHCP and TFTP server

    • ClusterNFS server

    • single root directory for server and clients

  • Clients

    • DHCP to obtain IP

    • TFTP to load PXE boot loader and then kernel image

    • rootNFS to load root filesystem

CHEP 2003 – La Jolla

Diskless farm setup with ClusterNFS (2/2)

ClusterNFS method – Advantages

  • easy to set up

    • just copy (or create) the files that need to be different

  • easy to maintain

    • changes to shared files are global

    • easy to add nodes

      A node can be added to a running farm in 1 minute

CHEP 2003 – La Jolla

VIRGO experiment (Jun 2001) (1/4)

VIRGO is the collaboration between Italian and French research teams, for the realization of an interferometric gravitational wave detector;

The main goal of the VIRGO project is the first direct detection of gravitational waves emitted by astrophysical sources;

Interferometric gravitational wave detectors produce a large amount of “raw” data that require a significant computing power to be analysed.

To satisfy such a strong requirement of computing power we decided to build a Linux cluster running MOSIX (and now OpenMosix)

CHEP 2003 – La Jolla

VIRGO experiment (Jun 2001) (2/4)


Farm nodesSuperMicro6010H- Dual Pentium III 1Ghz- RAM: 512Mbyte- HD: 18Gbyte- 2 Fast Ethernet interfaces- 1 Gbit Ethernet interface- (only on master-node)StorageAlpha Server 4100HD: 144GB

VIRGO experiment (Jun 2001) (3/4)

The Linux farm has been strongly tested by executing intensive data analysis procedures, based on the Matched Filter algorithm, one of the best ways to search for known waveforms within a signal affected by background noise.

Matched Filter analysis requires a high computational cost as the method consists in an exhaustive comparison between the source signal and a set of known waveforms, called “templates”, to find possible matches. Using a large number of templates the quality of known signals identification gets better and better but a great amount of floating points operations has to be performed.

Running Matched Filter test procedures on the OpenMosix cluster have shown a progressive reduction of execution times, due to a high scalability of the computing nodes and an efficient dynamic load distribution;

CHEP 2003 – La Jolla

VIRGO experiment (Jun 2001) (4/4)

speed-up of repeated Matched Filter executions

The increase of computing speed respect to the number of processors doesn’t follow an exactly linear curve; this is mainly due to the growth of communication time, spent by the computing nodes to transmit data over the local area network.

CHEP 2003 – La Jolla

ARGO experiment (Jan 2002) (1/3)

The aim of the ARGO-YBJ experiment is to study cosmic rays, mainly cosmic gamma-radiation, at an energy threshold of ~100 GeV, by means of the detection of small size air showers.

This goal will be achieved by operating a full coverage array in theYangbajing Laboratory (Tibet, P.R. China) at 4300m a.s.l.

As we have seen for the Virgo experiment, the analysis of data produced by Argo requires a significant amount of computing power. To satisfy this requirement we decided to implement an OpenMOSIX cluster.

CHEP 2003 – La Jolla

ARGO experiment (Jan 2002) (2/3)

  • currently Argo researchers are using a small Linux farm, located in Naples, constituted by:

    • 5 machines (dual 1Ghz Pentium III with 1 Gbyte RAM) running RedHat 7.2 + openmosix 2.4.13.

    • 1 file server with 1 Tbyte of disk space

CHEP 2003 – La Jolla

ARGO experiment (Jan 2002) (3/3)

At this time the Argo OpenMOSIX farm is mainly used to run Monte Carlo simulations using “Corsika”, a Fortran application developed to simulate and analyse extensive air showers.

The farm is also used to run other applications such as GEANT to simulate the behaviour of the Argo detector.

The OpenMOSIX farm is responding very well to the researchers’ computing requirements and we already decided to upgrade the cluster in the near future, adding more computing nodes and starting the analysis of real data produced by Argo.

Currently ARGO researchers in Naples have produced ~400 Gbytes of simulated data with this OpenMOSIX cluster

CHEP 2003 – La Jolla

Conclusions (1/2)

  • the most noticeable features of OpenMOSIX are its load-balancing and process migration algorithms, which implies that users need not have knowledge of the current state of the nodes

  • this is most useful in time-sharing, multi-user environments, where users do not have means (and usually are not interested) in the status (e.g. load of the nodes)

  • parallel application can be executed by forking many processes, just like in an SMP, where OpenMOSIX continuously attempts to optimize the resource allocation

CHEP 2003 – La Jolla

Conclusions (2/2)

  • Building up farms with the “OpenMosix+ClusterNFS” approach requires no more than 2 hours

  • With this approach management of a farm = management of a single server

  • This solution has proven to be scalable in farms up to 32 nodes

CHEP 2003 – La Jolla

  • Login