slide1 l.
Download
Skip this Video
Download Presentation
Ian C. Smith

Loading in 2 Seconds...

play fullscreen
1 / 15

Ian C. Smith - PowerPoint PPT Presentation


  • 111 Views
  • Uploaded on

Experiences with running MATLAB jobs on a power-saving Condor Pool. Ian C. Smith. University of Liverpool Condor Pool. Contains around 300 machines running the University’s Managed Windows (XP) Service.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Ian C. Smith' - pisces


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
university of liverpool condor pool
University of Liverpool Condor Pool
  • Contains around 300 machines running the University’s Managed Windows (XP) Service.
  • Most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine.
  • Software updates via a weekly re-imaging process.
  • Single combined submit host / central manager running on Sun V440 SMP server.
  • Restricted access to submit host for registered Condor users.
  • Currently running Condor 7.0.2 (moving to 7.2.x soon).
  • Policy is to run jobs only if a least 10 minutes of inactivity and low load average during office hours and at anytime outside of office hours.
matlab advantages
MATLAB advantages
  • Originally developed for linear algebra algorithm development but now contains many built-functions geared to different disciplines divided into toolboxes.
  • Intuitive interactive environment allows rapid code development.
  • Simple but powerful file I/O: save <filename>, load <filename> (useful for checkpointing).
  • Allows users to create their own functions stored as M-files.
  • “Standalone” applications can be built from M-files:
    • can run on platforms without MATLAB installed
    • do not need a licence to be able to run
    • can include all toolbox functions
  • APIs available for FORTRAN and C codes (“MEX files”)
matlab disadvantages
MATLAB disadvantages
  • Even standalone applications can run slower than equivalent C or FORTRAN implementations.
  • Standalone applications aren’t quite what they may seem:
    • more than just an .exe – several files need to be packaged and deployed
    • need access to MATLAB run-time libraries usually via MATLAB Component Runtime (150 MB self-extracting .exe)
    • luckily we have MATLAB pre-installed on all PCs in Condor pool (originally used a network drive)
  • Run-time errors can be difficult to trace when MATLAB jobs are run under Condor:
    • need to run under Condor on local PC
    • configure with USE_VISIBLE_DESKTOP=True to see pop-up messages
  • Jobs submitted in a UNIX environment but code developed under Windows.
minor matlab irritations
Minor MATLAB irritations
  • Output files occasionally go missing:
    • specify all required files using transfer_output_files
    • identify problem jobs with condor_q –held
    • resubmit with condor_release –all
  • Jobs sometimes run “forever”:
    • use condor_vacate to move job to another machine
    • less of a problem during term time as jobs usually get evicted by logins
  • Difficult to reproduce these problems:
    • happen quite rarely ( < 1 in ~1000 jobs)
    • many jobs based on stochastic methods
matlab research applications
MATLAB Research Applications
  • Predicting the spread of avian influenza outbreaks in poultry flocks (Veterinary Clinical Science).
  • Modelling of E-Coli propagation in dairy cattle (Veterinary Clinical Science).
  • Testing of parallel genetic algorithms in a complex classification system (Electrical Engineering and Electronics).
  • Simulation of the infection of a bacterial cell by a virus (Mathematical Sciences).
  • Modelling the effects of radiotherapy on normal tissue using 3D voxel arrays (Medical Imaging and Radiotherapy).
power saving at liverpool
Power-saving at Liverpool
  • Have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations.
  • Original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 10 minutes of inactivity
  • Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.
  • Makes extensive use of PowerMAN system from Data Synergy comprising:
    • service which forces machines into a low-power state and reports machine activity to Management Reporting Platform
    • Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser
adapting condor for use with power saving pcs
Adapting Condor for use with power-saving PCs
  • Two main problems:
    • how to ensure Condor jobs are not evicted by hibernating/powered-off PCs
    • how to wake up dormant PCs to run Condor jobs on-demand
  • Originally used Microsoft system service to power-down PCs after 30 min inactivity:
    • runs .bat file which checks if a user is logged in and shuts machine down if not
    • doesn’t detect owner of Condor job as a logged-in user
    • need to check for presence of condor_exe.bat
  • PowerMAN service now prevents job eviction:
    • can provide PowerMAN with a list of “protected programs”
    • ensures that system remains active if a protected program is running
    • include condor_starter process as a protected program (only present while a Condor job is running).
adapting condor for use with a power saving pcs
Adapting Condor for use with a power-saving PCs
  • Wake-on-LAN (“WoL”) used to bring hibernating machines back to full power:
    • NICs must be remain powered-up during hibernation/power-off
    • NICs must be capable of waking machines on receipt of a “magic packet”
    • network must be able to route “magic packets”
  • cron runs on the submit host which examines state of queue (condor_q) and pool (condor_status):
    • if more idle jobs in queue than Unclaimed machines then need to wake up hibernating machines
    • find number of powered up machines machines in each “teaching centre” (classroom)
    • estimate the number of hibernating machines in each teaching centre from total number of machines in each
    • sort centres from highest number of available machines to lowest
    • wake up centres in turn until sufficient machines woken to meet the demand (or all centres woken up)
    • MAC addresses of machines are stored in files sorted according to teaching centre (needed for Wake-on-LAN)
automatic wake up issues
Automatic wake up issues
  • Assumes that any job can run on any machine:
    • users cannot choose particular teaching centres or machines in their job Requirements
    • ideally, pool needs to be homogenous
    • errors in Requirements specification can cause severe problems (machines repeatedly wake up then hibernate)
    • cron now includes a “sanity check” for this
  • Large clusters of jobs can cause condor scheduler to become overloaded:
    • condor_q times out so cron cannot determine queue state
    • only a transient problem – load eventually drops off and condor_q responds again
  • Can only estimate number of hibernating machines in each centre
  • May wake up more machines than needed
recent and future developments
Recent and Future Developments
  • Recently moved to a policy of hibernating machines after 10 minutes of inactivity
    • submit host / central manager needs to work harder to get jobs running before recently woken machines go back to hibernation
    • move execute hosts from Owner to Unclaimed state after just 5 minutes idle
    • update activity timer every 1 minute (default is 5 minutes)
    • increase number of scheduler and negotiator cycles using SCHEDD_INTERVAL=60, NEGOTIATOR_INTERVAL=60
    • around 25 % machines still hibernate after first wakeup
    • see a ramp up in machines running Condor jobs over about an hour
    • little impact on Condor users
    • energy wastage offset by savings with user logouts
recent and future developments14
Recent and Future Developments
  • Migrating to Condor 7.2 shortly
    • Has some interesting power-management features
    • Automatic power-down on execute hosts could provide a useful “safety net” but PowerMAN likely to remain primary power management tool
    • Can retain records of ClassAds of machines in low-power state
      • could be useful in matchmaking jobs to powered-down machines
      • matchmaking logic already in Condor
      • nice if Condor could use this to provide a list of machines to wake-up on demand
      • ... and wake them up with condor_wakeup ?
      • would like to ensure that powered-down machines are still out there (not broken, permanently turned off, not listening etc)
      • also useful to see powered-off machines represented in condor_status output
  • Couple of extra “wishes”
    • allow jobs to claim all slots on a machine (useful if they have large memory requirements)
    • provide a “logged-in user” machine ClassAd attribute
further information
Further Information

http://www.liv.ac.uk/e-science/condor

i.c.smith@liverpool.ac.uk

ad