180 likes | 331 Views
The University of Liverpool Condor Pool. Ian C. Smith Computing Services Dept. University of Liverpool Condor Pool. contains around 300 machines running the University’s Managed Windows (XP) Service
E N D
The University of Liverpool Condor Pool Ian C. Smith Computing Services Dept
University of Liverpool Condor Pool • contains around 300 machines running the University’s Managed Windows (XP) Service • most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine • single submission point for Condor jobs provided by Sun Solaris V440 SMP server • restricted access to submit host for registered Condor users • policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours • job will be killed off if running when a user logs in to a PC
Condor service caveats • aimed at high throughput computing not high performance computing • only suitable for DOS-based applications running in batch mode • no communication between processes possible (“pleasantly parallel” applications only) • statically linked executables work best (although can cope with DLLs) • all files needed by application must be present on local disk (cannot access network drives) • no built-in check-pointing or standard output/error streaming • shorter jobs more likely to run to completion (10-20 min seems to work best) • very long running jobs can accommodated using Condor DAGMan or user level check-pointing (details available soon on the Condor website)
Submitting jobs • login to Condor server (condor.liv.ac.uk) using secure shell (ssh) – available in PuTTy on MWS • upload any input files using secure FTP (sftp) – available in CoreFTP Lite on MWS • create job description file • submit job(s) using: • $ condor_submit <job_description_file> • Condor jobs should be submitted from the Condor data filesystem i.e. under /condor_data/<your_username> • sits on a large (1.2 TB) fast RAID system • filestore is not backed up !
Typical job description file universe = vanilla transfer_files = always executable = example.exe output = stdout.out$(PROCESS) error = stderr.out$(PROCESS) log = mylog.log$(PROCESS) transfer_input_files = common.txt, myinput$(PROCESS).txt transfer_files = always requirements = ( Arch=="Intel") && ( OpSys=="WINNT51" ) notification = never queue 10
Simplified job description file input_files = common.txt indexed_input_files = myinput.txt executable = example.exe indexed_stdout = stdout.out indexed_stderr = stderr.out indexed_log = mylog.log total_jobs = 10 • submit job(s) using: • $ mws_submit <job_description_file>
Monitoring progress • to see the current state of the pool: • $ condor_status • to see what jobs are doing: • $ condor_q # all jobs • $ condor_q <your_username> # just your own • (may take several minutes before first job runs) • to find out why jobs aren’t running: • $ condor_q –analyze <jobID> • to cancel/remove jobs: • $ condor_rm <jobID> # single one • $ condor_rm –all # all of them ! • $ condor_rm –f <jobID> # forcibly remove jobs quickly
Power-saving at Liverpool • have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations. • original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 10 minutes of inactivity • policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a. • makes extensive use of PowerMAN system from Data Synergy comprising: • service which forces machines into a low-power state and reports machine activity to Management Reporting Platform • Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser
MATLAB advantages • originally developed for development of linear algebra algorithms but now contains many built-in functions geared to different disciplines divided into toolboxes • intuitive interactive environment allows rapid code development • simple but powerful file I/O: save <filename>, load <filename> (useful for check-pointing). • allows users to create their own functions stored as M-files • “standalone” applications can be built from M-files: • can run on platforms without MATLAB installed • do not need a licence to be able to run • can include all toolbox functions • APIs available for FORTRAN and C codes (“MEX files”)
MATLAB disadvantages • even standalone applications can run slower than equivalent C or FORTRAN implementations. • standalone applications aren’t quite what they may seem: • more than just an .exe – “manifest” file needed to locate run-time libraries • need access to MATLAB run-time libraries usually via MATLAB Component Runtime (150 MB self-extracting .exe) • luckily we have MATLAB pre-installed on all PCs in Condor pool (originally used a network drive) • run-time errors can be difficult to trace when MATLAB jobs are run under Condor
Local tools for MATLAB • to run MATLAB on Condor server without the GUI: • $ matlab_run <M-file> • should be used sparingly ! • to build standalone application using the pool: • $ matlab_build <M-file> • to run a simple M-file job on the pool: • $ m_file_submit <simplified_job_description_file> • run standalone applications on the pool • $ matlab_submit <simplified_job_description_file> • ideal for large clusters of jobs
Power-saving at Liverpool • Have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations. • Original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 15 minutes of inactivity • Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a. • Great news for sustainability, but not so good for Condor ...
Adapting Condor for use with power-saving PCs • two main problems: • how to ensure Condor jobs are not evicted by hibernating PCs • how to wake up dormant PCs to run Condor jobs on-demand • power-saving software configured not to cause hibernation if a Condor job is running • PCs are woken up using new features of Condor • will only attempt to wake up PCs which match users’ requirements • number of machines woken up is ramped up to avoid server overload
Pool status with power-saving smithic(ulgp4)smithic$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTim LSTC-01.livad.liv. WINNT51 INTEL Owner Idle 0.160 2045 0+00:21:1 LSTC-01.livad.liv. WINNT51 INTEL Unclaimed Idle 0.000 2045 [Unknown LSTC-02.livad.liv. WINNT51 INTEL Unclaimed Idle 0.000 2045 [Unknown LSTC-24.livad.liv. WINNT51 INTEL Unclaimed Idle 0.000 2045 [Unknown LSTC-25.livad.liv. WINNT51 INTEL Owner Idle 0.030 2045 0+00:15:1 LSTC-26.livad.liv. WINNT51 INTEL Owner Idle 0.040 2045 0+00:35:1 LSTC-26.livad.liv. WINNT51 INTEL Unclaimed Idle 0.000 2045 [Unknown LSTC-27.livad.liv. WINNT51 INTEL Unclaimed Idle 0.000 2045 [Unknown slot1@ARC2-12.liva WINNT51 INTEL Unclaimed Idle 0.000 1006 [Unknown slot1@ARC2-13.liva WINNT51 INTEL Owner Idle 0.100 1006 0+00:07:1 slot1@ARC2-15.liva WINNT51 INTEL Unclaimed Idle 0.010 1006 0+00:08:0 slot1@ARC2-16.liva WINNT51 INTEL Owner Idle 0.520 1006 0+00:05:1 slot1@ARC2-17.liva WINNT51 INTEL Unclaimed Idle 0.000 1006 [Unknown slot1@ARC2-17.liva WINNT51 INTEL Owner Idle 0.010 1006 0+00:03:1 slot2@MSC2-12.liva WINNT51 INTEL Unclaimed Idle 0.000 1006 0+00:00:0 ... Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/WINNT51 608 171 1 436 0 0 0 Total 608 171 1 436 0 0 0 } offline machines } offline machines
Pool status with power-saving • to see all offline machines • $ condor_status –constraint Offline==True • to see all powered-up machines • $ condor_status –constraint Offline=!=True • offline machines also appear in condor_q –analyse: 1238.999: Run analysis summary. Of 613 machines, 0 are rejected by your job's requirements 246 reject your job because of their own requirements 21 match but are serving users with a better priority in the pool 16 match but reject the job for unknown reasons 0 match but will not currently preempt their existing job 330 match but are currently offline 0 are available to run your job
Registering for the Condor Service • will need an account on the Sun UNIX service first (contact the CSD helpdesk: helpdesk@liverpool.ac.uk) • contact Ian Smith in CSD ( i.c.smith@liverpool.ac.uk ) with your UNIX username and brief details of your project • Condor account should go live within a few days
More information • there is much more background information on the CSD Condor Pool website: http://www.liv.ac.uk/e-science/condor