1 / 19

CamGrid

CamGrid. Mark Calleja Cambridge eScience Centre. What is it?. A number of like minded groups and departments (10), each running their own Condor pool(s), which federate their resources (12). Coordinated by the Cambridge eScience Centre (CeSC), but no overall control.

takoda
Download Presentation

CamGrid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CamGrid Mark Calleja Cambridge eScience Centre

  2. What is it? • A number of like minded groups and departments (10), each running their own Condor pool(s), which federate their resources (12). • Coordinated by the Cambridge eScience Centre (CeSC), but no overall control. • Been running now for ~2.5 years, ~70+ users. • Currently have ~950 processors/cores available. • “All” linux (various), mostly x86_64, running 24/7. • Mostly Dell PowerEdge 1950 (like HPCF), four cores with 8GB. • Around 2M CPU hours to date.

  3. Some details • Pools run the latest stable version of Condor (currently 6.8.6). • All machines get an (extra) IP address in a CUDN-only routeable range for Condor. • Each pool sets its own policies, but these must be visible to other users of CamGrid. • Currently we see vanilla, standard and parallel (MPI) universe jobs. • Users get accounts on a machine in their local pool; jobs are then distributed around the grid by Condor using its flocking mechanism. • MPI jobs on single SMP machines have proved very useful.

  4. NTE of Ag3[Co(CN)6] with SMP/MPI sweep

  5. Monitoring Tools • A number of web based tools provided to monitor the state of the grid and of jobs. • CamGrid is based on trust, so must make sure that machines are fairly configured. • The university gave us £450k (~$950k) to buy new hardware; need to ensure that it’s online as promised.

  6. CamGrid’s file viewer • Standard universe uses RPCs to echo I/O operations back to submit host. • What about other universes? How can I check the health of my long running simulation? • We’ve provided our own facility, which involves an agent installed on each execute node and accessed via a web interface. • Works with vanilla and parallel (MPI) jobs. • Requires local sysadmins to install and run it.

  7. CamGrid’s file viewer

  8. Checkpointable vanilla universe • Standard universe is fine, if you can link to Condor’s libraries (Pete Keller – “getting harder”). • Investigating using BLCR (Berkeley Lab Checkpoint/Restart) kernel modules for linux. • Uses kernel resources, and can thus restore resources that user-level libraries cannot. • Supported by some flavours of MPI (late LAM, OpenMPI). • The idea was to use Parrot’s user-space FS to wrap a vanilla job and save the job’s state on a chirp server. • However, currently Parrot breaks some BLCR functionality.

  9. What doesn’t work so well… • Each pool is run by local sysadmin(s), but these are of variable quality/commitment. • We’ve set up mailing lists for users and sysadmins: hardly ever used (don’t want to advertise ignorance?). • Some pools have used SRIF hardware to redeploy machines committed earlier. Naughty… • Don’t get me started on merger with UCS’s central resource (~400 nodes).

  10. But generally we’re happy bunnies • “CamGrid was an invaluable tool allowing us to reliably sample the large parameter space in a reasonable amount of time. A half-year's worth of CPU running was collected in a week." -- Dr. Ben Allanach • “CamGrid was essential in order for us to be able to run the different codes in real time.” -- Prof. Fernando Quevedo • “I needed to run simulations that took a couple of weeks each. Without access to the processors on CamGrid, it would have taken a couple of years to get enough results for a publication.“ -- Dr. Karen Lipkow

  11. Current issues • Protecting resources on execute nodes; Condor seems lax at this, e.g. memory, disk space. • Increasingly interested in VMs (i.e. Xen). Some pools run it, but not concerted (effects on SMP MPI jobs?). • Green issues: will we be forced to buy WoL cards in the near future? • Altruistic computing: a recent wave of interest for BOINC/backfill jobs for medical, protein folding, etc., but who runs the jobs? Audit trail? • How do we interact with outsiders? Ideally keep it to Condor (some Globus, toyed with VPNs). Most CamGrid stakeholders just dish out conventional, ssh-accessible accounts.

  12. Finally… • CamGrid: http://www.escience.cam.ac.uk/projects/camgrid/ • Contact: mc321@cam.ac.uk Questions?

More Related