OpenMOSIX approach to build scalable HPC farms with an easy management infrastructure

OpenMOSIX approach to build scalable HPC farms with an easy management infrastructure Rosario Esposito1 Paolo Mastroserio1 Francesco Maria Taurino1,2 Gennaro Tortone1 INFN - Napoli1INFM - UDR Napoli2 CHEP 2003 – La Jolla (San Diego)

Index • Introduction • OpenMosix overview • Farm setup • Use cases • Conclusions CHEP 2003 – La Jolla

What makes clusters hard ? Setup (administrator) • setting up a 16 node farm by hand is prone to errors Maintenance (administrator) • ever tried to update a package on every node in the farm? Running jobs (users) • running a parallel program or set of sequential programs requires the users to figure out which hosts are available and manually assign tasks to the nodes, or use software tools based on static process allocation (queue managers) CHEP 2003 – La Jolla

What is OpenMosix ? Description OpenMosix is an OpenSource enhancement to the Linux kernel providing adaptive (on-line) load-balancing between x86 Linux machines. It uses preemptive process migration to assign and reassign the processes among the nodes to take the best advantage of the available resources OpenMosix moves processes around the Linux farm to balance the load, using less loaded machines first URL http://www.openmosix.org CHEP 2003 – La Jolla

OpenMosix introduction Execution environment • farm of [diskless] x86 based nodes both UP and SMP that are connected by standard or high-speed LAN Implementation level • Linux kernel (no library to link with sources) System image model • virtual machine with a lot of memory and CPU Granularity • Process Goal • improve the overall (cluster-wide) performance and create a convenient multi-user, time-sharing environment for the execution of both sequential and parallel applications CHEP 2003 – La Jolla

OpenMosix architecture (1/5) Network transparency the interactive user and the application level programs are provided by a virtual machine that looks like a single MP machine Preemptive process migration any user’s process, trasparently and at any time, can migrate to any available node. The migrating process is divided into two contexts: • system context (deputy) that may not be migrated from “home” workstation (UHN); • user context (remote) that can be migrated on a diskless node; CHEP 2003 – La Jolla

OpenMosix architecture (2/5) Preemptive process migration master node diskless node CHEP 2003 – La Jolla

OpenMosix architecture (3/5) Dynamic load balancing • initiates process migrations in order to balance the load of farm • responds to variations in the load of the nodes, runtime characteristics of the processes, number of nodes and their speeds • makes continuous attempts to reduce the load differences between pairs of nodes and dynamically migrating processes from nodes with higher load to nodes with a lower load • the policy is symmetrical and decentralized; all of the nodes execute the same algorithm and the reduction of the load differences is performed indipendently by any pair of nodes CHEP 2003 – La Jolla

OpenMosix architecture (4/5) Memory sharing • places the maximal number of processes in the farm main memory, even if it implies an uneven load distribution among the nodes • delays as much as possible swapping out of pages • makes the decision of which process to migrate and where to migrate it is based on the knoweldge of the amount of free memory in other nodes Efficient kernel communication • is specifically developed to reduce the overhead of the internal kernel communications (e.g. between the process and its home site, when it is executing in a remote site) • fast and reliable protocol with low startup latency and high throughput CHEP 2003 – La Jolla

OpenMosix architecture (5/5) Probabilistic information dissemination algorithms • provide each node with sufficient knowledge about available resources in other nodes, without polling • measure the amount of the available resources on each node • receive the resources indices that each node send at regular intervals to a randomly chosen subset of nodes • the use of randomly chosen subset of nodes is due for support of dynamic configuration and to overcome partial nodes failures Decentralized control and autonomy • each node makes its own control decisions independently and there is no master-slave relationship between nodes • each node is capable of operating as an independent system; this property allows a dynamic configuration, where nodes may join or leave the farm with minimal disruption CHEP 2003 – La Jolla

Farm setup: PXE & ClusterNFS • diskless nodes • low cost • eliminates install/upgrade of hardware, software on diskless client side • backups are centralized in one single main server • zero administration at diskless client side CHEP 2003 – La Jolla

Diskless farm setup traditional method (1/2) Traditional method • Server • BOOTP server • NFS server • separate root directory for each client • Client • BOOTP to obtain IP • TFTP to load “tagged kernel” image • rootNFS to load root filesystem CHEP 2003 – La Jolla

Diskless farm setup traditional method (2/2) Traditional method – Problems separate root directory structure for each node • hard to set up • lots of directories with slightlydifferent contents • difficult to maintain • changes must be propagated to each directory CHEP 2003 – La Jolla

ClusterNFS Description cNFS is a patch to the standard Universal-NFS server code that “parses” file request to determine an appropriate match on the server Example when client machine foo2 asks for file /etc/hostname it gets the contents of /etc/hostname$$HOST=foo2$$ URL https://sourceforge.net/projects/clusternfs CHEP 2003 – La Jolla

ClusterNFS features ClusterNFS allows all machines (including server) to share the root filesystem • all files are shared by default • files for all clients are named filename$$CLIENT$$ • files for specific client are namedfilename$$IP=xxx.xxx.xxx.xxx$$ orfilename$$HOST=host.domain.com$$ CHEP 2003 – La Jolla

Diskless farm setup with ClusterNFS (1/2) ClusterNFS method • Server • DHCP and TFTP server • ClusterNFS server • single root directory for server and clients • Clients • DHCP to obtain IP • TFTP to load PXE boot loader and then kernel image • rootNFS to load root filesystem CHEP 2003 – La Jolla

Diskless farm setup with ClusterNFS (2/2) ClusterNFS method – Advantages • easy to set up • just copy (or create) the files that need to be different • easy to maintain • changes to shared files are global • easy to add nodes A node can be added to a running farm in 1 minute CHEP 2003 – La Jolla

VIRGO experiment (Jun 2001) (1/4) VIRGO is the collaboration between Italian and French research teams, for the realization of an interferometric gravitational wave detector; The main goal of the VIRGO project is the first direct detection of gravitational waves emitted by astrophysical sources; Interferometric gravitational wave detectors produce a large amount of “raw” data that require a significant computing power to be analysed. To satisfy such a strong requirement of computing power we decided to build a Linux cluster running MOSIX (and now OpenMosix) CHEP 2003 – La Jolla

VIRGO experiment (Jun 2001) (2/4) Hardware Farm nodesSuperMicro6010H- Dual Pentium III 1Ghz- RAM: 512Mbyte- HD: 18Gbyte- 2 Fast Ethernet interfaces- 1 Gbit Ethernet interface- (only on master-node)StorageAlpha Server 4100HD: 144GB

VIRGO experiment (Jun 2001) (3/4) The Linux farm has been strongly tested by executing intensive data analysis procedures, based on the Matched Filter algorithm, one of the best ways to search for known waveforms within a signal affected by background noise. Matched Filter analysis requires a high computational cost as the method consists in an exhaustive comparison between the source signal and a set of known waveforms, called “templates”, to find possible matches. Using a large number of templates the quality of known signals identification gets better and better but a great amount of floating points operations has to be performed. Running Matched Filter test procedures on the OpenMosix cluster have shown a progressive reduction of execution times, due to a high scalability of the computing nodes and an efficient dynamic load distribution; CHEP 2003 – La Jolla

VIRGO experiment (Jun 2001) (4/4) speed-up of repeated Matched Filter executions The increase of computing speed respect to the number of processors doesn’t follow an exactly linear curve; this is mainly due to the growth of communication time, spent by the computing nodes to transmit data over the local area network. CHEP 2003 – La Jolla

ARGO experiment (Jan 2002) (1/3) The aim of the ARGO-YBJ experiment is to study cosmic rays, mainly cosmic gamma-radiation, at an energy threshold of ~100 GeV, by means of the detection of small size air showers. This goal will be achieved by operating a full coverage array in theYangbajing Laboratory (Tibet, P.R. China) at 4300m a.s.l. As we have seen for the Virgo experiment, the analysis of data produced by Argo requires a significant amount of computing power. To satisfy this requirement we decided to implement an OpenMOSIX cluster. CHEP 2003 – La Jolla

ARGO experiment (Jan 2002) (2/3) • currently Argo researchers are using a small Linux farm, located in Naples, constituted by: • 5 machines (dual 1Ghz Pentium III with 1 Gbyte RAM) running RedHat 7.2 + openmosix 2.4.13. • 1 file server with 1 Tbyte of disk space CHEP 2003 – La Jolla

ARGO experiment (Jan 2002) (3/3) At this time the Argo OpenMOSIX farm is mainly used to run Monte Carlo simulations using “Corsika”, a Fortran application developed to simulate and analyse extensive air showers. The farm is also used to run other applications such as GEANT to simulate the behaviour of the Argo detector. The OpenMOSIX farm is responding very well to the researchers’ computing requirements and we already decided to upgrade the cluster in the near future, adding more computing nodes and starting the analysis of real data produced by Argo. Currently ARGO researchers in Naples have produced ~400 Gbytes of simulated data with this OpenMOSIX cluster CHEP 2003 – La Jolla

Conclusions (1/2) • the most noticeable features of OpenMOSIX are its load-balancing and process migration algorithms, which implies that users need not have knowledge of the current state of the nodes • this is most useful in time-sharing, multi-user environments, where users do not have means (and usually are not interested) in the status (e.g. load of the nodes) • parallel application can be executed by forking many processes, just like in an SMP, where OpenMOSIX continuously attempts to optimize the resource allocation CHEP 2003 – La Jolla

Conclusions (2/2) • Building up farms with the “OpenMosix+ClusterNFS” approach requires no more than 2 hours • With this approach management of a farm = management of a single server • This solution has proven to be scalable in farms up to 32 nodes CHEP 2003 – La Jolla

OpenMOSIX approach to build scalable HPC farms with an easy management infrastructure

OpenMOSIX approach to build scalable HPC farms with an easy management infrastructure

Presentation Transcript

An End-to-End Approach to Globally Scalable Network Storage

Energy Management The Infrastructure Approach

An Approach to Management

Enhanced HPC Approach 2020

Cluster Resource Management: A Scalable Approach

An ABCD approach to office management

incremental approach to infrastructure

Consent Management: A Scalable Nationwide Approach

An Integrated Approach to Security Management

Creating an edge with HPC

An End-to-End Approach to Globally Scalable Programmable Networking

XtreemOS Application Execution Management: A Scalable Approach

Scalable Network Infrastructure

New approach to infrastructure

An Integrative Approach to Emergency Management

An Approach to Buffer Management in Java HPC Messaging

Using Gossip to Build Scalable Services

Clustera: A data-centric approach to scalable cluster management

An Institutional Approach to Developing Research Data Management Infrastructure

An End-to-End Approach to Globally Scalable Network Storage

An Approach to Management

Easy To Build Fences with Home Fencing Supplies