Running job monitor
en construction |
under construction |
A running job can be monitored/interrogated with a local utility called
u.job-monitor
There a 2 ways to activate this utility
- at job submission time:
ord_soumet .... -prolog jobmonitor .... - with an explicit command in the job itself
u.job-monitor &
caveat: in the case of an MPI job the only node that will be monitored is node 0 (primary node)
the job monitor uses 3 files found in directory $HOME/top_in_batch for each monitored job
- jobname_node_jobid.top
refreshed every 10 seconds with the output of a top command for processes belonging to the user - jobname_node_jobid.cmd
if the user writes a line in this file then- this line is executed on the primary node
- the output (stdout and stderr) of said command is appended to the jobname_node_jobid.out file
- the jobname_node_jobid.cmd file is erased and re-created
- jobname_node_jobid.out
- the output of the command from the jobname_node_jobid.cmd file
node will be replaced by the host name of the primary node of the job
jobid will be replaced by the PBS job id of said job
jobname will be replaced by the job name
Sample ouput from jobname_node_jobid.top (guillimin job)
file: myjob_sw-2r13-n21_94568.top (job number 94568, primiry host is sw-2r13-n21) top - 13:10:31 up 11 days, 10:51, 0 users, load average: 9.36, 6.19, 7.64 Tasks: 240 total, 3 running, 237 sleeping, 0 stopped, 0 zombie Cpu(s): 10.6%us, 2.0%sy, 46.5%ni, 40.8%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 37020880k total, 2229884k used, 34790996k free, 202236k buffers Swap: 25165780k total, 999100k used, 24166680k free, 134044k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17809 winger 25 0 24.3g 138m 21m S 299.7 0.4 4:17.49 ATM_MOD.Abs 17812 winger 25 0 24.2g 136m 14m R 299.7 0.4 4:18.64 ATM_MOD.Abs 17810 winger 25 0 24.2g 139m 16m R 297.7 0.4 4:18.72 ATM_MOD.Abs 17811 winger 25 0 24.2g 135m 15m S 297.7 0.4 4:18.49 ATM_MOD.Abs 16674 winger 21 0 84112 2528 1856 S 0.0 0.0 0:00.04 bash 16811 winger 15 0 13316 780 412 S 0.0 0.0 0:00.00 pbs_demux 17360 winger 21 0 85136 1468 792 S 0.0 0.0 0:00.00 bash 17369 winger 20 0 85136 1276 596 S 0.0 0.0 0:00.02 bash 17377 winger 18 0 63896 1172 980 S 0.0 0.0 0:00.00 u.job-monitor 17402 winger 18 0 3808 492 420 S 0.0 0.0 0:00.00 repeat_command 17707 winger 18 0 65572 2024 1160 S 0.0 0.0 0:00.00 Um_runmod.ksh 17756 winger 18 0 65572 2020 1180 S 0.0 0.0 0:00.00 Um_model.ksh 17791 winger 18 0 3680 180 100 S 0.0 0.0 0:00.00 Climat_r.monito 17792 winger 19 0 65572 1980 1156 S 0.0 0.0 0:00.00 r.mpirun 17802 winger 15 0 47772 4112 2520 S 0.0 0.0 0:00.01 mpiexec 17805 winger 18 0 65444 1872 1080 S 0.0 0.0 0:00.00 POE_SCRIPT_1779 17806 winger 18 0 65444 1876 1080 S 0.0 0.0 0:00.00 POE_SCRIPT_1779 17807 winger 18 0 65444 1872 1080 S 0.0 0.0 0:00.00 POE_SCRIPT_1779 17808 winger 18 0 65444 1876 1080 S 0.0 0.0 0:00.00 POE_SCRIPT_1779 17876 winger 19 0 63896 1120 928 S 0.0 0.0 0:00.00 sh 17877 winger 15 0 30892 2176 1460 R 0.0 0.0 0:00.00 top