Experiences with running MATLAB jobs on a power-saving Condor Pool
1 / 15

Ian C. Smith - PowerPoint PPT Presentation

  • Uploaded on

Experiences with running MATLAB jobs on a power-saving Condor Pool. Ian C. Smith. University of Liverpool Condor Pool. Contains around 300 machines running the University’s Managed Windows (XP) Service.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Ian C. Smith' - pisces

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

University of liverpool condor pool l.jpg
University of Liverpool Condor Pool Condor Pool

  • Contains around 300 machines running the University’s Managed Windows (XP) Service.

  • Most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine.

  • Software updates via a weekly re-imaging process.

  • Single combined submit host / central manager running on Sun V440 SMP server.

  • Restricted access to submit host for registered Condor users.

  • Currently running Condor 7.0.2 (moving to 7.2.x soon).

  • Policy is to run jobs only if a least 10 minutes of inactivity and low load average during office hours and at anytime outside of office hours.

Matlab advantages l.jpg
MATLAB advantages Condor Pool

  • Originally developed for linear algebra algorithm development but now contains many built-functions geared to different disciplines divided into toolboxes.

  • Intuitive interactive environment allows rapid code development.

  • Simple but powerful file I/O: save <filename>, load <filename> (useful for checkpointing).

  • Allows users to create their own functions stored as M-files.

  • “Standalone” applications can be built from M-files:

    • can run on platforms without MATLAB installed

    • do not need a licence to be able to run

    • can include all toolbox functions

  • APIs available for FORTRAN and C codes (“MEX files”)

Matlab disadvantages l.jpg
MATLAB disadvantages Condor Pool

  • Even standalone applications can run slower than equivalent C or FORTRAN implementations.

  • Standalone applications aren’t quite what they may seem:

    • more than just an .exe – several files need to be packaged and deployed

    • need access to MATLAB run-time libraries usually via MATLAB Component Runtime (150 MB self-extracting .exe)

    • luckily we have MATLAB pre-installed on all PCs in Condor pool (originally used a network drive)

  • Run-time errors can be difficult to trace when MATLAB jobs are run under Condor:

    • need to run under Condor on local PC

    • configure with USE_VISIBLE_DESKTOP=True to see pop-up messages

  • Jobs submitted in a UNIX environment but code developed under Windows.

Minor matlab irritations l.jpg
Minor MATLAB irritations Condor Pool

  • Output files occasionally go missing:

    • specify all required files using transfer_output_files

    • identify problem jobs with condor_q –held

    • resubmit with condor_release –all

  • Jobs sometimes run “forever”:

    • use condor_vacate to move job to another machine

    • less of a problem during term time as jobs usually get evicted by logins

  • Difficult to reproduce these problems:

    • happen quite rarely ( < 1 in ~1000 jobs)

    • many jobs based on stochastic methods

Matlab research applications l.jpg
MATLAB Research Applications Condor Pool

  • Predicting the spread of avian influenza outbreaks in poultry flocks (Veterinary Clinical Science).

  • Modelling of E-Coli propagation in dairy cattle (Veterinary Clinical Science).

  • Testing of parallel genetic algorithms in a complex classification system (Electrical Engineering and Electronics).

  • Simulation of the infection of a bacterial cell by a virus (Mathematical Sciences).

  • Modelling the effects of radiotherapy on normal tissue using 3D voxel arrays (Medical Imaging and Radiotherapy).

Power saving at liverpool l.jpg
Power-saving at Liverpool Condor Pool

  • Have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations.

  • Original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 10 minutes of inactivity

  • Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.

  • Makes extensive use of PowerMAN system from Data Synergy comprising:

    • service which forces machines into a low-power state and reports machine activity to Management Reporting Platform

    • Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser

Adapting condor for use with power saving pcs l.jpg
Adapting Condor for use with power-saving PCs Condor Pool

  • Two main problems:

    • how to ensure Condor jobs are not evicted by hibernating/powered-off PCs

    • how to wake up dormant PCs to run Condor jobs on-demand

  • Originally used Microsoft system service to power-down PCs after 30 min inactivity:

    • runs .bat file which checks if a user is logged in and shuts machine down if not

    • doesn’t detect owner of Condor job as a logged-in user

    • need to check for presence of condor_exe.bat

  • PowerMAN service now prevents job eviction:

    • can provide PowerMAN with a list of “protected programs”

    • ensures that system remains active if a protected program is running

    • include condor_starter process as a protected program (only present while a Condor job is running).

Adapting condor for use with a power saving pcs l.jpg
Adapting Condor for use with a power-saving PCs Condor Pool

  • Wake-on-LAN (“WoL”) used to bring hibernating machines back to full power:

    • NICs must be remain powered-up during hibernation/power-off

    • NICs must be capable of waking machines on receipt of a “magic packet”

    • network must be able to route “magic packets”

  • cron runs on the submit host which examines state of queue (condor_q) and pool (condor_status):

    • if more idle jobs in queue than Unclaimed machines then need to wake up hibernating machines

    • find number of powered up machines machines in each “teaching centre” (classroom)

    • estimate the number of hibernating machines in each teaching centre from total number of machines in each

    • sort centres from highest number of available machines to lowest

    • wake up centres in turn until sufficient machines woken to meet the demand (or all centres woken up)

    • MAC addresses of machines are stored in files sorted according to teaching centre (needed for Wake-on-LAN)

Automatic wake up issues l.jpg
Automatic wake up issues Condor Pool

  • Assumes that any job can run on any machine:

    • users cannot choose particular teaching centres or machines in their job Requirements

    • ideally, pool needs to be homogenous

    • errors in Requirements specification can cause severe problems (machines repeatedly wake up then hibernate)

    • cron now includes a “sanity check” for this

  • Large clusters of jobs can cause condor scheduler to become overloaded:

    • condor_q times out so cron cannot determine queue state

    • only a transient problem – load eventually drops off and condor_q responds again

  • Can only estimate number of hibernating machines in each centre

  • May wake up more machines than needed

Recent and future developments l.jpg
Recent and Future Developments statistics

  • Recently moved to a policy of hibernating machines after 10 minutes of inactivity

    • submit host / central manager needs to work harder to get jobs running before recently woken machines go back to hibernation

    • move execute hosts from Owner to Unclaimed state after just 5 minutes idle

    • update activity timer every 1 minute (default is 5 minutes)

    • increase number of scheduler and negotiator cycles using SCHEDD_INTERVAL=60, NEGOTIATOR_INTERVAL=60

    • around 25 % machines still hibernate after first wakeup

    • see a ramp up in machines running Condor jobs over about an hour

    • little impact on Condor users

    • energy wastage offset by savings with user logouts

Recent and future developments14 l.jpg
Recent and Future Developments statistics

  • Migrating to Condor 7.2 shortly

    • Has some interesting power-management features

    • Automatic power-down on execute hosts could provide a useful “safety net” but PowerMAN likely to remain primary power management tool

    • Can retain records of ClassAds of machines in low-power state

      • could be useful in matchmaking jobs to powered-down machines

      • matchmaking logic already in Condor

      • nice if Condor could use this to provide a list of machines to wake-up on demand

      • ... and wake them up with condor_wakeup ?

      • would like to ensure that powered-down machines are still out there (not broken, permanently turned off, not listening etc)

      • also useful to see powered-off machines represented in condor_status output

  • Couple of extra “wishes”

    • allow jobs to claim all slots on a machine (useful if they have large memory requirements)

    • provide a “logged-in user” machine ClassAd attribute

Further information l.jpg
Further Information statistics


[email protected]