Version 1.0 (meeting edition) 19 March 2009 Rob Kennedy and Adam Lyon Attending: RDK, AL, …. D0 Grid Data Production Initiative: Coordination Mtg. Overview. News and Summary System mostly OK PBS Upgrade Issues resolved All-CAB2 begun Fri 13 th March
19 March 2009
Rob Kennedy and Adam Lyon
Attending: RDK, AL, …D0 Grid Data Production Initiative:Coordination Mtg
D0 Grid Data Production
Not expected to present at meeting
Regarding temporarily using the whole CAB2 for the production, D0 management has made a decision that from March 10, we will temporarily expand the d0 farm queue to be the whole CAB2. The purpose is to catch up the backlog in data production for the summer conference. This configuration is temporary. We will change it back to the current configuration when one of the following condition happens:
- when the backlog has been reduced to be less than one week of data; or - May 1, 2009, or - when there is an analysis need for more CPUs than CAB1 can provide. Although the configuration change will be done by FEF (thanks to FEF!), the SamGrid team may need to plan to adjust related parameters to handle a much larger production farm. The current d0 farm queue has 1800 job slots. The new d0 farm queue will have 1800+1400 job slots, temporarily. Thank you, Qizhong
Configure d0cabosg[1,2] to have the capability to manage jobs on both d0cabsrv[1,2]. (Yes, do in Phase 2, by FGS, at low priority)
Requires a bit of “hackery” in the jobmanager-pbs script, but is doable. Both d0cabosg[1,2] would have jobmanager-d0cabsrv[1,2]. jobmanager-pbs would become a symlink to the appropriate jobmanager-d0cabsrv[1,2]. This would be in addition to work being performed by FermiGrid on high availability gatekeepers.
Open up additional slots for opportunistic use on both clusters. (cab1 first)
Ideally make all pbs job slots available for opportunistic use by Grid.
Research/develop automatic “eviction” policy for pbs when slot is needed by dzero (as with condor).
(requires D0 CPB input, should go through QL. This is in-kind for D0 opportunistic use)
Review Globus Errors: Adam’s text table of held reasons.
Conduct review(s) covering the following topics, plan for future work:
Consider extra queues to segregate D0 MC production (J. Snow) from other VO opportunistic usage. Consider adding extra roles in dzero VO (such as /Role=mc or /Role=monte-carlo).
Investigate if special-purpose d0farm nodes still needed. FEF, FGS, Grid Dev. If so, do the coding that should have been done long ago to report them accurately.
Review the layout of D0 resources advertising to ReSS, in order to see if it can be done in a more uniform way as opposed to the special-case hackery for CAB that is done now. (related to above 2 bullets). Do we want to have VOMS:/dzero/users access rules?
Can d0cabsrv1 worker nodes be increased to have 10GB of scratch (like d0cabsrv2 worker nodes already have) instead of 4GB? Irrelevant due to retirements?
A “higher RAM per node” pool for large memory jobs?
Eliminate specialized queues in favor of priorities to allow greater CPU utilization.
Load-balancing/combining CAB1/CAB2 (avoid user’s manually load balancing)
Also on cache nodes: 0-bias skim, LCG cache
Initiated by Reco Job
Other data destined for Tape Storage
Initiated by Merge Job, Via gridftp
In2p3 remote uploads
Durable Storage and Stager Space are on separate partitions
Shared w/Analysis Users
No automated failover between ‘63, ‘65