1 / 24

Farms Users meeting 4/27/2005

Farms Users meeting 4/27/2005. http://www-oss.fnal.gov/scs/farms/farm_users. Agenda. Events on farm past two weeks Scheduled downtimes New Users M. Kostin, Accel. Division A. Lebedev, E907/MIPP Existing User reports

alijah
Download Presentation

Farms Users meeting 4/27/2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Farms Users meeting4/27/2005 http://www-oss.fnal.gov/scs/farms/farm_users Farms User Meeting April 27 2005--Steven Timm

  2. Agenda • Events on farm past two weeks • Scheduled downtimes • New Users • M. Kostin, Accel. Division • A. Lebedev, E907/MIPP • Existing User reports • Special Presentation: Upcoming Transition of General Purpose Farms to Condor and Grid Farms User Meeting April 27 2005--Steven Timm

  3. Issues in last 2 weeks • Thermal problems in LCC over weekend, no nodes went down. • Down nodes on CDF farm: 1 out of 98 FBSNG, 1 out of 72 condor/CAF • Down nodes on D0 farm— 12 out of 444 nodes • Down nodes GP farm—0 out of 102 • GP Farms networking was upgraded to gigabit on all nodes that are capable Farms User Meeting April 27 2005--Steven Timm

  4. Downtimes • GP Farms—none scheduled • D0 farms—moving 3 racks of worker nodes to GCC, to be scheduled • CDF Farms, upgrade of condor/CAF nodes to SLF304, in progress Farms User Meeting April 27 2005--Steven Timm

  5. Farms User Meeting April 27 2005--Steven Timm

  6. General Purpose Farms Allocations Farms User Meeting April 27 2005--Steven Timm

  7. Farms User Meeting April 27 2005--Steven Timm

  8. Farms User Meeting April 27 2005--Steven Timm

  9. Farms User Meeting April 27 2005--Steven Timm

  10. GRID on General Purpose Farms • Executive Summary: • A 14-node test cluster is available for testing Condor and grid jobs now • Plan tentatively to add new nodes to Condor/grid cluster this summer • Hope to complete transition to Condor batch system by end of calendar year 2005 • Local and grid submissions will still be allowed on General Purpose Farms • Existing GP Farms users will have same priority whether submitting via grid or locally • We will make sure appropriate training, documentation and support is available to help users with the transition. • Testing currently ongoing with first grid-enabled user SDSS/DES Farms User Meeting April 27 2005--Steven Timm

  11. Outline: • Why use the Grid? • Why use Condor • Virtual Organizations • The Open Science Grid • GP Farms on the Open Science Grid • Fermigrid • Access to mass storage Farms User Meeting April 27 2005--Steven Timm

  12. Why the Grid? • General Purpose Farms have limited resources and equipment budget • All Fermilab CD resources have mandate from division to interoperate • Adding a grid interface to the farms enables us to interoperate with the larger clusters at Fermilab (specifically CMS, CDF) and make use of extra resources. • Negotiation to use resources of the Open Science Grid off-site is in progress as well. Farms User Meeting April 27 2005--Steven Timm

  13. Why Condor? • Free software (but you can buy support). • Supported by large team at U. of Wisconsin (and not by Fermilab programmers) • Widely deployed in multi-hundred node clusters at Fermilab (CDF, CMS). • New versions of Condor allow Kerberos 5 and x509 authentication • Comes with Condor-G which simplifies submission of grid jobs • Condor-C components allow for interoperation of independent Condor pools • Some of our grid-enabled users take advantage of the extended Condor features, so it is the fastest way to get our users on the grid. Farms User Meeting April 27 2005--Steven Timm

  14. Virtual Organizations • Each experiment is a Virtual Organization • Membership is managed by VOMS software (Virtual Organization Management Service) and VOMRS software (Virtual Organization Management Registration Service) • Virtual Organizations have already been created for all major user groups on the General Purpose Farms as part of Fermigrid project. • We need at least one responsible person from each user group that is using the farms to say who should be members of their virtual organization. • Groups we have identified: • sdss, ktev, miniboone, hypercp, minos, numi, accelerator, ppd_astro, ppd_theory, patriot (run2mc),auger Farms User Meeting April 27 2005--Steven Timm

  15. Open Science Grid • Continuation of efforts that were begun in Grid3. • Integration testing has been ongoing since February • Provisioning and deployment is occurring as we speak. • General Purpose Farms and CMS will both be Fermilab presences on the Open Science Grid • 10 Virtual Organizations so far, mostly US-based: • USATLAS • USCMS • SDSS • fMRI (functional Magnetic Resonance Imaging, based at Dartmouth) • GADU (Applied Genomics, based at Argonne) • GRASE (Engineering applications, based at SUNY Buffalo) • LIGO • CDF • STAR • iVDGL • http://www.opensciencegrid.org Farms User Meeting April 27 2005--Steven Timm

  16. Current Fermi GP farmsOSG presence • Node fngp-osg as gatekeeper and condor master • (Dell dual Xeon 3.6 GHz) • Software comes from the Virtual Data Toolkit • http://www.cs.wisc.edu/vdt • 14 worker nodes as condor pool (fnpc201-214) • Can successfully run batch jobs submitted locally via Condor and across the grid via Condor-G • Has passed all validation tests of the Open Science Grid • Using the extended privilege authorization from the VO Privilege Project • Each group can define different roles for their users. • We can map whole group to one userid, several userids, or a pool of userid’s. Farms User Meeting April 27 2005--Steven Timm

  17. Current Architecture • All home directories and staging areas are served off of FNSFO and will be accessible as before • All OSG sites have $app and $data directories for applications and data transfer, these are served off of fngp-osg by NFS • All VDT-related software (globus, condor, etc) served off of fngp-osg • Grid jobs come in directly to fngp-osg and are farmed out to the 14 condor nodes. Farms User Meeting April 27 2005--Steven Timm

  18. Goals for GP Farms Grid Deployment • GP Farms is very busy > 90% • Two big productions about to start • Need to preserve lions share of CPU cycles for existing users • Jobs from groups that are not GP Farms users will have only opportunistic use of the farms. • Run at lowest priority (10-6 of regular priority) • Limited in how many jobs they can start at once. • At the moment OSG jobs confined to condor pool of 14 slow nodes that weren’t otherwise getting used at all. • GP Farms users will be able to access allocated share of resources whether they come in via grid or not. Farms User Meeting April 27 2005--Steven Timm

  19. Current Farms Configuration FNSFO FBSNG HEAD NODE ENSTORE ENCP NFS RAID FBS Submit GP Farms FBSNG Worker Nodes 102 currently Farms User Meeting April 27 2005--Steven Timm

  20. Configuration with Grid Job from OSG Fermigrid1 Site gatekeeper FNGP-OSG Gate-keeper ENSTORE FNPCSRV1FBSNG HEAD NODE Job from Fermilab NFS RAID FBS Submit Condor submit New Condor WN 40 (coming this summer) Condor WN 14 currently GP Farms FBSNG Worker Nodes 102 currently Farms User Meeting April 27 2005--Steven Timm

  21. Fermigrid Interface • Fermigrid is providing common site services for virtual organization management (VOMS) and user mapping (GUMS) • These services expected to be online in next month or two. • All non-Fermi jobs will eventually go through site Fermigrid gatekeeper and be farmed out to the other clusters. Farms User Meeting April 27 2005--Steven Timm

  22. Access to mass storage • Study currently under way. • Encp access to Enstore will remain available from the head node. • Want to open dccp, gridftp, srmcp interfaces to dCache • Before this is done, more study needed on • Authentication mechanisms—can we access mass storage from the worker nodes • Resource load—public dCache would need to expand its disk pool if the demand increases significantly. Farms User Meeting April 27 2005--Steven Timm

  23. Support and Documentation • http://grid.fnal.gov/fermigrid • http://www-oss.fnal.gov/scs/public/farms/grid/ • http://www.ivdgl.org/osg-int/ • http://plone.opensciencegrid.org/ • http://www.opensciencegrid.org/ • http://www.cs.wisc.edu/vdt • http://www.cs.wisc.edu/condor Farms User Meeting April 27 2005--Steven Timm

  24. Things to watch and try • http://www-oss.fnal.gov/scs/public/farms/grid/ being continuously updated as we know more about what works. • Hope to add sample Condor jobs shortly • Those familiar with Condor can log into fngp-osg and try to submit local test jobs now. • Source /export/osg/grid/setup.csh to get all the software setup • Grid job submission won’t work until we get the virtual organizations populated (except for SDSS). • More presentations coming at these meetings in weeks to come • Hope to organize a workshop this summer. Farms User Meeting April 27 2005--Steven Timm

More Related