PBS Professional Administration training Rajiv Jaisankar Technical Specialist Altair APAC
Chapter One: Understanding PBS Professional Chapter One • What is PBS Professional? • History of PBS Professional • PBS Works Online Store • PBS Professional Documentation • Altair Global Offices & Technical Support • Broad Hardware and Operating System Support • Supported MPI Libraries • PBS Professional Components & Roles
What is PBS Professional? • Workload management solution that maximizes the efficiency and utilization of high-performance computing (HPC) resources and improves job turnaround. • Advanced Scheduling Algorithms • Resource-based scheduling • Preemptive scheduling • Optimized node sorting • Enhanced job placement • Advance & standing reservations • Cycle harvesting across workstations • Scheduling across multiple complex • Network topology scheduling • Manages both batch and interactive work • Robust Workload Management • Floating flex-based licenses • Scalability, with flexible queues • Job arrays • User and administrator interface • Job suspend/resume • Application checkpoint/restart • Automatic file staging • Accounting logs • Access control lists • Reliability, Availability and Scalability • Server failover feature • Automatic job recovery • Provides system monitoring • Provides integration with MPI solutions • Tested to manage 1,000,000+ jobs per day
History of PBS Professional • 1993-97: Developed for NASA to replace NQS • 2000:Veridian formed commercial version of PBS; Released PBS Professional 5.0 • 2003: Altair acquired PBS Professional technology and engineering; Released PBS Professional 5.3 • 2004: Released PBS Professional 5.4 • 2005: Released PBS Professional 7.0 and 7.1 • 2006: Released PBS Professional 8.0 • 2007: Released PBS Professional 9.0 and 9.1 • 2008: Released PBS Professional 9.2 • 2008: Released PBS Professional 10.0 • 2009: Released PBS Professional 10.1 • 2009: Released PBS Professional 10.2 • 2010: Released PBS Professional 10.4 • 2010: Released PBS Professional 11.0 • 2011: Released PBS Professional 11.1
Broad Hardware & Operating System Support • AMD-Linux and Windows • Intel-Linux and Windows • IBM AIX on Power • IBM Linux on Power • HP-UX on Itanium 2 • Cray X2, XT, XT3, XT4, XT5, and XT6 • SGI AltixICE, XI, and UV • SUN Solaris on SPARC • Windows 7, XP, Vista, • Server 2003, and Server 2008 • Red Hat Enterprise 4, 5, and 6 • SLES 9, 10, and 11 Note: For a detailed list of supported systems & OS please refer to the latest release notes
Supported MPI Libraries • Currently supported MPI libraries integrated with PBS: • MPICH 1.2.5, 1.2.6 on Linux 2.4 on, x86, AMD64, EM64T, Itanium2 • MPICH 1.2.5, 1.2.6 on Linux 2.6 on x86, AMD64, EM64T • MPICH 1.2.7 on x86 Linux • MPICH-GM on Linux • Intel MPI 2.0.22 on Linux • MPICH2 1.0.3, 1.0.5, 1.0.7 on Linux • IBM POE on AIX 5.x, and 6.x , including HPS support • HP MPI 1.08.03 on HP-UX 11 on Itanium 2 • HP MPI 2.0.0 on Linux 2.4 & 2.6 on x86, AMD64, EM64T, Itanium 2 • LAM/MPI 6.5.9/7.0.6/7.1.1 on Linux 2.4/2.6 on x86, AMD64, EM64T, Itanium 2 • SGI MPI (MPT) on Linux on Altix / Itanium 2/x86_64 and XE • SGI MPI (MPT) over Infiniband • MVAPICH 1.2.7/2.0 on Linux • OpenMPI 1.4.2 on Linux
PBS Professional Components & Roles • referred to as the PBS Server • central focus for a PBS complex • routes job to compute host * • processes all PBS related commands * • provides the basic batch services * • server maintains its own server and queue settings * • daemon executes as pbs_server.bin • Batch Server • referred to as the PBS Scheduler • queries list of running and queued jobs from the PBS Server * • queries queue, server, and node properties * • queries resource consumption and availability from the PBS MOM * • sorts available jobs according to local scheduling policies • determines which job is eligible to run next • daemon executing as pbs_sched • Scheduler • referred to as the PBS MOM • executes jobs at request of PBS Scheduler • monitors resource usage of running jobs • enforces resource limits on jobs • reports system resource limits, configuration * • daemon executing aspbs_mom • MOM *This information is for debugging purposes only. It may change in future releases and should not be relied upon.
Complex Configurations • Single Execution System Server MOM Scheduler All 3 PBS components on a single host.
Complex Configurations, cont. • Multiple Execution System MOM Server Front End System MOM Scheduler MOM Note: PBS Server machine maybe a different architecture (UNIX/LINUX) from the execution hosts A PBS complex can be either UNIX/Linux or Windows, but not both.
Chapter Two - Installation of PBS Professional Chapter Two • Pre-Installation • Basic Installation • Post-Installation • PBS Installed Directory Structure
Post Installation – PBS Configuration File • How does the PBS init script determine which services to invoke? • The init script reads the configuration file: “/etc/pbs.conf” • Format of a pbs.conf file: PBS_EXEC=/opt/pbs/default PBS_HOME=/var/spool/PBS PBS_START_SERVER=1 PBS_START_MOM=1 PBS_START_SCHED=1 PBS_SERVER=traintb16 PBS_DATA_SERVICE_USER=pbsuser01 0 will prevent init from starting or stopping the daemon 1 will have init start or stop the daemon
File System and File Transfer • Sites will need to determine how users will access data files • Most common file sharing methods used by PBS customers: • NFS Network File System (most widely used) • GFS Global File System • What method of file copy will be used? • rcp remote copy (default used by PBS) • scp secure copy • cp Linux/Unix copy
User’s PBS Environment • Delivery of STDOUT/STDERR files • PBS should be able to copy user’s STDOUT and STDERR files to the appropriate directory without password challenge • Stage input/output files • Users may need to import/export files related to the job before/after execution • Users’ Data Transfer • Users should be able to transfer data without having to supply password, (e.g. rcp/scp) • Users must have a valid account • Users should be able to log onto execution host(s) and should have a valid username and group
Altair LM-X License Management • PBS Professional 11.0 is now licensed by Altair License Management System (ALM) based on X-Formation’s LM-X license management system • Altair’s ALM package for PBS can be downloaded from: • https://secure.altair.com/UserArea/ • We recommend that Altair’s ALM be installed and configured before installing PBS Professional v11.0 • For additional information on Altair’s ALM refer to the Altair License Manager System 11.0 Installation and Operations Guide
Chapter Three - PBS Administration Chapter Seven • Process flow of a PBS job • PBS installed directory structure • Directory structure of $PBS_HOME • Directory structure of $PBS_EXEC • Understanding the PBS configuration file • Manually starting and stopping PBS daemons • Impact of PBS daemons restarts on running jobs • Network ports used by PBS • Status of PBS complex
Process Flow of a PBS Job – User Level 6.traintb16 6.traintb16 PBS SERVER 6.traintb16 on HOST A 1. User submits job 2. PBS Server returns a job ID 3. PBS Scheduler requests a list of resources from the Server * PBS SCHEDULER 4. PBS Scheduler sorts all the resources and jobs * HOST A HOST C HOST B 5. PBS Scheduler informs PBS Server which host(s) that job can run on * ncpus mem host 6. PBS Server pushes job script to execution host(s) 7. PBS MOM executes job script 8. PBS MOM periodically reports resource usage back to PBS Server * 9. When job is completed PBS MOM kills the job script Note: * This information is for debugging purposes only. It may change in future releases. 10. PBS Server de-queues job from PBS complex
PBS Installed Directory Structure • PBS Professional software is installed in two separate directories • $PBS_EXEC “/opt/pbs/default” contains: PBS daemons Libraries Man pages Support tools Administrator and user PBS commands • $PBS_HOME “/var/spool/PBS” contains: PBS daemon configurations PBS daemon logs Other various file-related directories
PBS Directory Structure - PBS_HOME • Directory structure of $PBS_HOME PBS_HOME server_priv mom_priv sched_priv daemon configuration directories server_logs mom_logs sched_logs daemon log directories spool undelivered checkpoint aux pbs_environment pbs_version datastore misc directories/files This information is for debugging purposes only. It may change in future releases.
PBS_EXEC bin sbin binaries of PBS daemons and user/admin PBS commands lib man include etc tcltk unsupported python pgsql libraries, manual pages, and header files PBS Directory Structure - PBS_EXEC • Directory structure of $PBS_EXEC This information is for debugging purposes only. It may change in future releases.
Directory Structure of $PBS_HOME /server_priv server_priv accounting directory containing daily accounting logs • Detailed structure of $PBS_HOME/server_priv * database password - encrypted db_password directory containing custom hook definitions hooks jobs directory containing users’ job scripts prov_tracking OS provisioning directory PBS server PID lock file server.lock used for failover configuration svrlive PBS license related file tracking PBS license related file usedlic * This information is for debugging purposes only. It may change in future releases.
PBS Configuration File – pbs.conf • PBS installs a configuration file “pbs.conf” located in “/etc/” directory. This configuration file is used by PBS to determine: • Which daemons to start/stop • What PBS server to communicate with • What file copy mechanism to use • Each server/scheduler, execution, and client host has a pbs.conf file installed • Refer to Administrator’s Guide; Chapter 13; Section 13.1.3; pages 715-716 for a complete listing of configuration file variables Default contents of pbs.conf PBS_EXEC=/opt/pbs/default PBS_HOME=/var/spool/PBS PBS_START_SERVER=1 PBS_START_MOM=1 PBS_START_SCHED=1 PBS_SERVER=hostname.domain PBS_DATA_SERVICE_USER=pbsuser01
PBS Configuration File – pbs.conf, cont. • How pbs.conf differs between the PBS Server and PBS MOM hosts: PBS EXECUTION HOST PBS SERVER HOST PBS_EXEC=/opt/pbs/default PBS_HOME=/var/spool/PBS PBS_START_SERVER=0 PBS_START_MOM=1 PBS_START_SCHED=0 PBS_SERVER=traintb16 PBS_EXEC=/opt/pbs/default PBS_HOME=/var/spool/PBS PBS_START_SERVER=1 PBS_START_MOM=0 PBS_START_SCHED=1 PBS_SERVER=traintb16 PBS_DATA_SERVICE_USER=pbsuser01 Note: Only 1 active instance of a PBS Server and PBS Scheduler can be running within a PBS complex
PBS Configuration File – pbs.conf, cont. • The variable PBS_START_<daemon> sets which daemon should be allowed to start when the “/etc/init.d/pbs” script runs. For example: /etc/pbs.conf This is the expected behavior when executing “/etc/init.d/pbs start”: pbs_server daemon will be invoked pbs_mom daemon will not be invoked pbs_sched daemon will be invoked PBS_EXEC=/opt/pbs/default PBS_HOME=/var/spool/PBS PBS_START_SERVER=1 PBS_START_MOM=0 PBS_START_SCHED=1 PBS_SERVER=traintb16 PBS_DATA_SERVICE_USER=pbsuser01
Starting/Stopping PBS Using Start/Stop Script • Starting/stopping PBS • Why use start/stop script? • Vnode definitions are created only when the start script is used; they are not created when the daemons are started manually • Vnode definitions are required if PBS is to manage cpusets on a machine • The pbs_mom daemon on the Altix and the Cray must be started via the start script • Using the pbs start/stop script to stop PBS will preserve jobs (the server gets a ‘qterm -t quick’) • Location of start/stop script (Linux) /etc/init.d/pbs start
Status of PBS Complex Server: traintb16 server_state = Active server_host = traintb16.prog.altair.com scheduling = True total_jobs = 0 state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun :0 default_queue = workq log_events = 511 mail_from = adm query_other_jobs = True resources_default.ncpus = 1 default_chunk.ncpus = 1 scheduler_iteration = 600 FLicenses = 33 resv_enable = True node_fail_requeue = 310 max_array_size = 10000 pbs_license_info = 7788@localhost pbs_license_min = 1 pbs_license_max = 2147483647 pbs_license_linger_time = 3600 license_count = Avail_Global:32 Avail_Local:1 Used:0 High_Use:0 pbs_version = PBSPro_22.214.171.124450 eligible_time_enable = False max_concurrent_provision = 5 • Use qstat -Bf to view the status of a PBS complex
Manually Starting/Stopping PBS Daemons Manually starting/stopping PBS daemons PBS Server Start $PBS_EXEC/sbin/pbs_server Stop $PBS_EXEC/bin/qterm –t [quick|delay|immediate] PBS Scheduler Start $PBS_EXEC/bin/pbs_sched Stop $PBS_EXEC/bin/qterm –s kill –INT <pbs_sched_pid> PBS MOM Start $PBS_EXEC/sbin/pbs_mom Stop $PBS_EXEC/bin/qterm –m This will shut down all the MOMs kill –INT <pbs_mom_pid>
Network Ports Used By PBS Daemons • UNIX/Linux network ports
Chapter Four - Job Management Chapter Four • Defining a Job Script • Types of Jobs • Submitting Jobs • Process Flow of a PBS Job • Querying PBS Jobs • Setting Job Attributes • Requesting Job Resources • Default Job Attributes • Order of Default Resources Assigned to Jobs • Job Exit Codes
Defining a Job Script • What is a job script? • A file that contains a set of instructions to execute a series of commands. Also known as a “batch job”. Example of a job script: #!/bin/bash sleep 5 /home/altair/scripts/optistruct –cpu 2 handlebar.fem Shell interpreter commands
Submitting Jobs - Using “qsub” • Submitting a job script to PBS • Using “qsub” command Usage: qsub <job_attributes/resources> <job_script> Example: qsub –l select=1:ncpus=1 test_script • If the job is accepted by PBS, a job identifier is returned. This job identifier is comprised of the job number and the submitted server host name: 0.traintb16 • Note: - If a job is rejected it will not return a job identifier, but it will increment the job ID - Largest possible job ID is 7 digits: 9,999,999. Once reached it will reset to zero
Requesting Job Resources – Built in Resources Note: For complete listing refer to PBS Reference Documentation Guide pages 336-340
Types of Jobs • There are two types of PBS jobs • Batch Job • A script that contains commands or tasks to execute site specific applications • Interactive Job • Runs like a batch job, but when it runs, the user’s terminal input and output are connected to the execution host; similar to a login session. • Allows users to debug a job script • Verify a new application properly runs
Setting Job Attributes – Using PBS Directives • Job attributes can be set in 2 different ways: • Method 1: on the qsub command line qsub –N <job_name> <job_script> • Method 2: within a job script as a PBS directive #!/bin/bash #PBS –N test_run_01 #PBS –l select=4:ncpus=4:mem=16GB #PBS –l place=scatter #PBS- j oe #PBS –o /home/pbsuser01/OUTPUTS optistruct –ncpu 2 handlebar.fem Note: - PBS expects the directives to begin on the second line, and be on consecutive lines thereafter. Once started, the interpreter stops processing directives at the first line that contains an executable line. It will ignore comment lines. - Command line arguments will override PBS directives.
Requesting Job Resources – Understanding Resources • What are job resources? • Applications sometimes need certain types and amounts of system resources such as: • memory • ncpus • scratch space • During job submission, required resources can be requested • How can these resources be requested within PBS? • PBS defines these resources as chunks or as job-wide resources • What are “chunks”? • set of resources that are allocated as a unit to a job • smallest set of resources that are allocated to a job • for example: ncpus, mem • requested in a “select” statement qsub –l select=<#>:ncpus=<#>:mem=<#> • What are “job-wide resources”? • resources that are associated with the entire job • for example: placement of jobs, walltime
Requesting Job Resources – Using Chunks & Select • Requesting resources in chunks • Resources which are to be allocated as a unit to a job • Smallest set of resources to be allocated to a single job • Host/Vnode level request Syntax: qsub –l select=[ N: ] chunk[ + [N:] chunk….] For example: • Job requesting: 3 chunks with 2 CPUs per chunks: qsub –l select=3:ncpus=2 • Job requesting: 2 chunks with 1 CPU each and 10GB each and another set of 3 chunks with 2 CPUs each and 8GB each of memory qsub –l select=2:ncpus=1:mem=10gb+3:ncpus=2:mem=8gb
Requesting Job Resources – Job Placement • Placing jobs on hosts/Vnodes • Users can specify how their multi-node job is placed within a PBS complex based on the resources requested • Place statement controls how the job is placed on the hosts/vnodes from which resources may be allocated for the job • Using the “place” statement: Usage: qsub –l place= <type>| <sharing> | <group> Example: qsub –l select=1:ncpus=2:mem=100MB –l place=pack script
Requesting Job Resources – Job Wide Resources • Requesting job-wide limits • Resources that are requested outside a select statement • Such as walltime, or cput • Requesting resources at server or queue level • Resources that are not tied to specific host(s)/vnode(s) For example: qsub –l select=1:ncpus=1:mem=100MB –l walltime=01:00:00 myscript
Requesting Job Resources – SMP Jobs • SMP jobs are meant to run on a single execution host • Submitting an SMP PBS job qsub –l select=x:ncpus=x –l place=pack Note: all chunks will be placed on a single host • Additional options • Place a job on a host that already has a job running on it qsub –l select=1:ncpus=2 –l place=pack:shared • Place a job on a host on which no other jobs are running and make that host exclusive to it qsub –l select=1:ncpus=2 –l place=pack:excl
Requesting Job Resources – MPI Jobs • MPI jobs run on multiple hosts, using an MPI application • PBS has tightly integrated wrapper scripts for various MPI implementations • Allows PBS to track spawned MPI processes • More accurate tracking of all resources being consumed across all the hosts • Accurately record CPU accounting utilization on all nodes • Accurately enforce requested job limits • Automatically "clean up" stray MPI processes on all nodes • Require no changes other than wrapping
1777 ? Ss 0:00 /opt/pbs/default/sbin/pbs_mom • 1779 ? Ss 0:00 \_ -bash • 1810 ? S 0:00 \_ /bin/sh /var/spool/PBS_10.4.0.101257/mom_priv/jobs/1746.rhel5.lab.altair.com.SC • 1812 ? S 0:00 \_ /opt/mpich2-install/bin/mpirun -f /var/spool/PBS_10.4.0.101257/aux/1746.rhel5.lab.altair.com /usr/local/gromacs_mpich2-1.3.2p1/bi • 1813 ? S 0:00 \_ /opt/mpich2-install/bin/hydra_pmi_proxy --control-port rhel54:37470 --demux poll --pgid 0 --proxy-id 0 • 1814 ? R 0:14 \_ /usr/local/gromacs_mpich2-1.3.2p1/bin/mdrun -f /test/bench/d.dppc/grompp.mdp -c /test/bench/d.dppc/conf.gro -p /test/benc
Requesting Job Resources – Submitting MPI Jobs • Method 1 • Request: 4-way MPI job with 2 CPUs and 2GB memory per MPI task, with one MPI task per host, where each host has 2 CPUs and 2 GB memory qsub –l select=4:ncpus=2:mem=2GB –l place=scatter • Variable $PBS_NODEFILE contains list of vnodes VnodeA VnodeB VnodeC VnodeD • Sample of an MPI job script #!/bin/bash #PBS –l select=4:mem=2GB:mpiprocs=2 #PBS –l place=scatter mpirun –np 8 –mem 8GB file
Requesting Job Resources – Submitting MPI Jobs, cont. • Method 2 • Request: 4-way MPI job with 2 CPUs and 2GB memory per MPI task; request up to 4 hosts, where each host has 4 CPUs and 4 GB memory • qsub –l select=4:ncpus=2:mem=2GB –l place = free • Variable $PBS_NODEFILE contains list of vnodes • VnodeA • VnodeB • Sample of a MPI job script #!/bin/bash #PBS –l select=4:mem=2GB:mpiprocs=2 $PBS –l place=free mpirun –np 8 –mem 8GB file
Requesting Job Resources - Boolean Resources • A resource that can be requested as true or false • Requesting chunks that have resource ‘optistruct’, the qsub request line would be: qsub –l select=1:ncpus=1:optistruct=true The scheduler will only place this job on vnodes that have the resource “optistruct”set to “true” • If a boolean resource is requested as job-wide, e.g.: • qsub –l select=1:ncpus=1 –l optistruct=true PBS will check if it is available at the server or queue level – not vnode/host level
Default Job Attributes • PBS includes default values for resources that the user doesn’t specify during job submission • The following are resource defaults assigned to a job: • default_chunk.ncpus=1 • resources_default.ncpus=1 • resources_default.walltime=<5 years> Note: Root and managers can specify additional default resources
Querying Jobs – Using “qstat” • To show a list of current PBS jobs’ status • Using “qstat” command Usage: qstat <-a, -n, -s, -1, -w> Example: qstat Job id Name User Time Use S Queue ---------------- ---------------- ----------- -------- - ----- 6.traintb16 test_script pbsuser01 00:00:00 R workq 7.traintb16 jobA pbsuser02 00:00:00 R workq 8.traintb16 test_2 pbsuser04 0 Q workq 9.traintb16 test_script pbsuser01 00:00:00 R workq Note: If a job was deleted or completed then it can no longer be listed via qstat unless the PBS complex has enabled the job history functionality
Querying Jobs – Additional qstat Options -a job name, session id, # nodes req, #ncpus req, req’d mem, req’d, time, and elapsed time Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 8.traintb16 pbsuser0 workq test_scrip 6556 1 8 -- -- R 00:07 -s same as option –a, but with comments Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 8.traintb16 pbsuser0 workq test_scrip 5556 1 8 -- -- R 00:07 Job run at Wed Jul 05 at 14:48 on (traintb16:ncpus=8) -n same as option –a, but indicates which execution vnode(s) the job is running on Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 8.traintb16 pbsuser0 workq test_scrip 5556 1 8 -- -- R 00:07 traintb16/0 Note: - Adding an additional option “-1” will output each entry on a single line instead of wrapping around - Also using “-w” shows the full output of individual fields
Job Attributes - Viewing Job Attributes • To view job attributes that were assigned to a particular job, use the qstat command. Usage: qstat –f <job_id> Example: qstat –f 2.trainhp01 Job Id: 1.traintb16 Job_Name = sleep_job Job_Owner = email@example.com resources_used.cpupercent = 0 resources_used.cput = 00:00:00 resources_used.mem = 1028kb resources_used.ncpus = 1 resources_used.vmem = 18440kb resources_used.walltime = 00:00:00 job_state = R queue = workq server = traintb16 Checkpoint = u ctime = Tue May 5 17:49:09 2010 Error_Path = traintb16.prog.altair.com:/home/pbsuser01/boo/sleep_job.e1 exec_host = traintb16/0 exec_vnode = (traintb16:ncpus=1) Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Tue May 5 17:49:09 2010 Output_Path = traintb16.prog.altair.com:/home/pbsuser01/boo/sleep_job.o1 Priority = 0 qtime = Tue May 5 17:49:09 2010 Rerunable = True
Job Attributes - Viewing Job Attributes, cont. Resource_List.ncpus = 1 Resource_List.nodect = 1 Resource_List.place = pack Resource_List.select = 1:ncpus=1 stime = Tue May 5 17:49:11 2010 session_id = 11535 jobdir = /home/pbsuser01 substate = 42 Variable_List = PBS_O_HOME=/home/pbsuser01,PBS_O_LANG=en_US.UTF-8, PBS_O_LOGNAME=pbsuser01, PBS_O_PATH=/home/pbsuser01/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X 11:/usr/X11R6/bin:/usr/games:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mi t/sbin:/opt/pbs/default/bin:/opt/pbs/default/sbin, PBS_O_MAIL=/var/spool/mail/pbsuser01,PBS_O_SHELL=/bin/bash, PBS_O_HOST=traintb16.prog.altair.com, PBS_O_WORKDIR=/home/pbsuser01/boo,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq comment = Job run at Tue May 05 at 17:49 on (traintb16:ncpus=1) etime = Tue May 5 17:49:09 2010 Submit_arguments = -l select=1:ncpus=1 my_script • Note: Running as root or PBS Manager will output additional information
Querying Jobs – Using “tracejob” • Using tracejob to obtain comprehensive information about a job • Using “tracejob” command Usage: tracejob –n<days> <job id> Example: tracejob –n4 0.traintb16 Job: 0.traintb16 05/05/2010 17:43:35 S enqueuing into workq, state 1 hop 1 05/05/2010 17:43:35 S Job Queued at request of firstname.lastname@example.org, owner = email@example.com, job name = sleep_job, queue = workq 05/05/2010 17:45:08 L Considering job to run 05/05/2010 17:45:08 S Job Run at request of Scheduler@traintb16.prog.altair.com on exec_vnode (traintb16:ncpus=1) 05/05/2010 17:45:08 M Started, pid = 11491 05/05/2010 17:45:10 S Job Modified at request of Scheduler@traintb16.prog.altair.com 05/05/2010 17:45:10 L Job run 05/05/2010 17:45:14 M task 00000001 terminated 05/05/2010 17:45:14 M Terminated 05/05/2010 17:45:15 S Obit received momhop:1 serverhop:1 state:4 substate:42 05/05/2010 17:45:15 S Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=3056kb resources_used.ncpus=1 resources_used.vmem=39392kb resources_used.walltime=00:00:07 05/05/2010 17:45:15 M task 00000001 cput= 0:00:00 05/05/2010 17:45:15 M traintb16 cput= 0:00:00 mem=3056kb 05/05/2010 17:45:15 M Obit sent 05/05/2010 17:45:15 M copy file request received 05/05/2010 17:45:15 M staged 2 items out over 0:00:00 05/05/2010 17:45:15 M delete job request received 05/05/2010 17:45:15 S dequeuing from workq, state 5 05/05/2010 17:45:15 M kill_job 05/05/2010 17:45:15 M work proc outstanding S = Server L = Scheduler M = MOM Note: Information is taken from server logs, scheduler logs, and mom logs (local to that machine) past 24 hrs