1 / 34

Upgrade D0 farm

Upgrade D0 farm. Reasons for upgrade. RedHat 7 needed for D0 software New versions of ups/upd v4_6 fbsng v1_3f+p2_1 sam Use of farm for MC and analysis Integration in farm network. MC production on farm. Input: requests Request translated in mc_runjob macro Stages:

Download Presentation

Upgrade D0 farm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Upgrade D0 farm

  2. Reasons for upgrade • RedHat 7 needed for D0 software • New versions of • ups/upd v4_6 • fbsng v1_3f+p2_1 • sam • Use of farm for MC and analysis • Integration in farm network

  3. MC production on farm • Input: requests • Request translated in mc_runjob macro • Stages: • mc_runjob on batch server (hoeve) • MC job on node • SAM store on file server (schuur)

  4. 1.2 TB mcc request fbs(rcp,sam) farm server SAM DB file server fbs job: 1 mcc 2 rcp 3 sam fbs(mcc) datastore mcc input FNAL SARA mcc output node 100 cpu’s control 40 GB data metadata

  5. cron: sam 1.2 TB mcc request fbs(rcp[,sam]) farm server SAM DB file server fbs job: 1 mcc 2 rcp fbs(mcc) datastore mcc input FNAL SARA mcc output node 100 cpu’s control 40 GB data metadata

  6. hoeve node schuur fbsuser: mc_runjob cron fbs submit fbsuser:cp fbsuser:mcc fbs submit fbsuser: rcp willem:sam data control

  7. SECTION mcc EXEC=/d0gstar/curr/minbias-02073214824/batch NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/curr/minbias-02073214824/stdout STDERR=/d0gstar/curr/minbias-02073214824/stdout SECTION rcp EXEC=/d0gstar/curr/minbias-02073214824/batch_rcp NUMPROC=1 QUEUE=IOQ DEPEND=done(mcc) STDOUT=/d0gstar/curr/minbias-02073214824/stdout_rcp STDERR=/d0gstar/curr/minbias-02073214824/stdout_rcp

  8. #!/bin/sh . /usr/products/etc/setups.sh cd /d0gstar/mcc/mcc-dist . mcc_dist_setup.sh mkdir -p /data/curr/minbias-02073214824 cd /data/curr/minbias-02073214824 cp -r /d0gstar/curr/minbias-02073214824/* . touch /d0gstar/curr/minbias-02073214824/.`uname -n` sh minbias-02073214824.sh `pwd` > log touch /d0gstar/curr/minbias-02073214824/`uname -n` /d0gstar/bin/check minbias-02073214824 batch_rcp runs on schuur #!/bin/sh i=minbias-02073214824 if [ -f /d0gstar/curr/$i/OK ];then mkdir -p /data/disk2/sam_cache/$i cd /data/disk2/sam_cache/$i node=`ls /d0gstar/curr/$i/node*` node=`basename $node` job=`echo $i | awk '{print substr($0,length-8,9)}'` rcp -pr $node:/data/dest/d0reco/reco*${job}* . rcp -pr $node:/data/dest/reco_analyze/rAtpl*${job}* . rcp -pr $node:/data/curr/$i/Metadata/*.params . rcp -pr $node:/data/curr/$i/Metadata/*.py . rsh -n $node rm -rf /data/curr/$i rsh -n $node rm -rf /data/dest/*/*${job}* touch /d0gstar/curr/$i/RCP fi batch runs on node

  9. runs on schuur called by fbs or cron #!/bin/sh locate(){ file=`grep "import =" import_${1}_${job}.py | awk -F \" '{print $2}'` sam locate $file | fgrep -q [ return $? } . /usr/products/etc/setups.sh setup sam SAM_STATION=hoeve export SAM_STATION tosam=$1 LIST=`cat $tosam` for job in $LIST do cd /data/disk2/sam_cache/${job} list='gen d0g sim' for i in $list do until locate $i || (sam declare import_${i}_${job}.py && locate ${i}) do sleep 60; done done list='reco recoanalyze' for i in $list do sam store --descrip=import_${i}_${job}.py --source=`pwd` return=$? echo Return code sam store $return done done echo Job finished ... declare gen, d0g, sim store reco, recoanalyze

  10. Filestream • Fetch input from sam • Read input file from schuur • Process data on node • Copy output to schuur

  11. hoeve node schuur attach filestream mc_runjob cron fbs submit rcp d0exe rcp fbs submit sam data control

  12. Analysis on farm • Stages: • Read files from sam • Copy files to node(s) • Perform analysis on node • Copy files to file server • Store files in sam

  13. 1.2 TB fbs(1), fbs(3) farm server SAM DB file server • sam + rcp • analyze • rcp + sam fbs(2) datastore FNAL SARA node 100 cpu’s control (fbs) 40 GB data metadata

  14. triviaal node-2 willem:sam input fbsuser:rcp fbsuser: analysis program output fbsuser:rcp willem:sam

  15. batch.jdf SECTION sam EXEC=/home/willem/batch_sam NUMPROC=1 QUEUE=IOQ STDOUT=/home/willem/stdout STDERR=/home/willem/stdout batch_sam #!/bin/sh . /usr/products/etc/setups.sh setup sam SAM_STATION=triviaal export SAM_STATION sam run project get_file.py --interactive > log /usr/bin/rsh -n -l fbsuser triviaal rcp -r /stage/triviaal/sam_cache/boo node-2:/data/test >> log

  16. 1.2 TB fbs(1), fbs(3) farm server SAM DB file server fbs(2) • sam • rcp + analyze + rcp • rcp + sam datastore FNAL SARA node 100 cpu’s control (fbs) 40 GB data metadata

  17. triviaal node-2 willem:sam fbsuser:fbs submit fbsuser: rcp analysis program rcp input output willem:sam

  18. rsh -l fbsuser triviaal fbs submit ~willem/batch_node.jdf SECTION sam EXEC=/d0gstar/batch_node NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/stdout STDERR=/d0gstar/stdout #!/bin/sh uname -a date

  19. SECTION ana EXEC=/d0gstar/batch_node NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/stdout STDERR=/d0gstar/stdout SECTION sam EXEC=/home/willem/batch NUMPROC=1 QUEUE=IOQ STDOUT=/home/willem/stdout STDERR=/home/willem/stdout #!/bin/sh . /usr/products/etc/setups.sh setup fbsng setup sam SAM_STATION=triviaal export SAM_STATION sam run project get_file.py --interactive > log /usr/bin/rsh -n -l fbsuser triviaal fbs submit /home/willem/batch_node.jdf #!/bin/sh rcp -pr server:/stage/triviaal/sam_cache/boo /data/test . /d0/fnal/ups/etc/setups.sh setup root -q KCC_4_0:exception:opt:thread setup kailib root -b -q /d0gstar/test.C { gSystem->cd("/data/test/boo"); gSystem->Exec("pwd"); gSystem->Exec("ls -l"); }

  20. # # This file sets up and runs a SAM project. # import os, sys, string, time, signal from re import * from globals import * import run_project from commands import * ######################################### # # Set the following variables to appropriate values # Consult database for valid choices sam_station = "triviaal" # Consult Database for valid choices project_definition = "op_moriond_p1014" # A particular snapshot version, last or new snapshot_version = 'new' # Consult database for valid choices appname = "test" version = "1" group = "test" get_file.py # The maximum number of files to get from sam max_file_amt = 5 # for additional debug info use "--verbose" #verbosity = "--verbose" verbosity = "" # Give up on all exceptions give_up = 1 def file_ready(filename): # Replace this python subroutine with whatever # you want to do # to process the file that was retrieved. # This function will only be called in the event of # a successful delivery. print "File ",filename," has been delivered!" # os.system('cp '+filename+' /stage/triviaal/sam') return

  21. /ups /db /etc /prd Disk partitioning hoeve /d0 /mcc /fnal /fbsng /mcc-dist /mc_runjob /d0usr /d0dist /curr /fnal -> /d0/fnal /d0usr -> /fnal/d0usr /d0dist -> /fnal/d0dist /usr/products -> /fnal/ups

  22. ana_runjob • Is analogous to mc_runjob • Creates and submits analysis jobs • Input • get_file.py with SAM project name • Project defines files to be processed • analysis script

  23. Integration with grid (1) • At present separate clusters: • D0, LHCb, Alice, DAS cluster • hoeve and schuur in farm network

  24. Present network layout ajax hefnet schuur hoeve router switch surfnet node node node NFS

  25. New network layout hefnet ajax lambda farmrouter booder switch switch switch hoeve schuur LHCb D0 alice NFS

  26. New network layout hefnet ajax lambda farmrouter das-2 booder switch switch switch hoeve schuur LHCb D0 alice NFS

  27. Server tasks • hoeve • software server • farm server • schuur • fileserver • sam node • booder • home directory server • in backup scheme

  28. Integration with grid (2) • Replace fbs with pbs or condor • pbs on Alice and LHCb nodes • condor on das cluster • Use EDG installation tool LCGF • Install d0 software with rpm • Problem with sam (uses ups/upd)

  29. Integration with grid (3) • Package mcc in rpm • Separate programs from working space • Use cfg commands to steer mc_runjob • Find better place for card files • Input structure now created on node

  30. Grid job PBS job submit #!/bin/sh macro=$1 pwd=`pwd` cd /opt/fnal/d0/mcc/mcc-dist . mcc_dist_setup.sh cd $pwd dir=/opt/fnal/d0/mcc/mc_runjob/py_script python $dir/Linker.py script=$macro [willem@tbn09 willem]$ cat test.pbs # PBS batch job script #PBS -o /home/willem/out #PBS -e /home/willem/err #PBS -l nodes=1 # Changing to directory as requested by user cd /home/willem # Executing job as requested by user ./submit minbias.macro

  31. RunJob class for grid class RunJob_farm(RunJob_batch) : def __init__(self,name=None) : RunJob_batch.__init__(self,name) self.myType="runjob_farm" def Run(self) : self.jobname = self.linker.CurrentJob() self.jobnaam = string.splitfields(self.jobname,'/')[-1] comm = 'chmod +x ' + self.jobname commands.getoutput(comm) if self.tdconf['RunOption'] == 'RunInBackground' : RunJob_batch.Run(self) else : bq = self.tdconf['BatchQueue'] dirn = os.path.dirname(self.jobname) print dirn comm = 'cd ' + dirn + '; sh ' + self.jobnaam + ' `pwd` >& stdout' print comm runcommand(comm)

  32. To be decided • Location of minimum bias files • Location of MC output

  33. Job status • Job status is recorded in • fbs • /d0/mcc/curr/<job_name> • /data/mcc/curr/<job_name>

  34. SAM servers • On master node: • station • fss • On master and worker nodes: • stager • bbftp

More Related