Running job monitor : Différence entre versions

Un article de Informaticiens département des sciences de la Terre et l'atmosphère
Aller à: navigation, charcher
m
m
 
(3 révisions intermédiaires par le même utilisateur non affichées)
Ligne 17: Ligne 17:
 
There a 2 ways to activate this utility  
 
There a 2 ways to activate this utility  
  
*at job submission time:<br>ord_soumet ....&nbsp; -prolog jobmonitor ....  
+
*at job submission time:<br>[[Soumet : travaux par lots / batch jobs|ord_soumet]] ....&nbsp; -prolog jobmonitor ....  
 
*with an explicit command in the job itself<br>u.job-monitor &amp;
 
*with an explicit command in the job itself<br>u.job-monitor &amp;
  
Ligne 28: Ligne 28:
 
the job monitor uses '''3''' files found in directory '''$HOME/top_in_batch'''&nbsp; for '''each''' monitored job  
 
the job monitor uses '''3''' files found in directory '''$HOME/top_in_batch'''&nbsp; for '''each''' monitored job  
  
*'''node'''_'''jobid'''.top <br>refreshed every 10 seconds with the output of a top command for processes belonging to the user  
+
*'''jobname_node_jobid'''.top <br>refreshed every 10 seconds with the output of a top command for processes belonging to the user  
*'''node'''_'''jobid'''.cmd <br>if the user writes a line in this file then  
+
*'''jobname_node_jobid'''.cmd <br>if the user writes a line in this file then  
 
**this line is executed on the primary node  
 
**this line is executed on the primary node  
**the output (stdout and stderr) of said command is appended to the '''node'''_'''jobid'''.out file<br>  
+
**the output (stdout and stderr) of said command is appended to the '''jobname_node_jobid'''.out file<br>  
**the '''node'''_'''job'''.cmd file is erased and re-created  
+
**the '''jobname_node_jobid'''.cmd file is erased and re-created  
*'''node'''_'''jobid'''.out  
+
*'''jobname_node_jobid'''.out  
**the output of the command from the '''node'''_'''jobid'''.cmd file
+
**the output of the command from the '''jobname_node_jobid'''.cmd file
  
where '''node''' will be replaced by the host name of the primary node of the job  
+
'''node''' will be replaced by the host name of the primary node of the job  
  
and '''jobid''' will be replaced by the PBS job id of said job  
+
'''jobid''' will be replaced by the PBS job id of said job  
 +
 
 +
'''jobname''' will be replaced by the job name<br>
  
 
<br>  
 
<br>  
  
Sample ouput from '''node'''_'''jobid'''.top (guillimin job)  
+
Sample ouput from '''jobname_node'''_'''jobid'''.top (guillimin job)  
<pre>file: sw-2r13-n21_94568.top
+
<pre>file: myjob_sw-2r13-n21_94568.top
 
(job number 94568, primiry host is sw-2r13-n21)
 
(job number 94568, primiry host is sw-2r13-n21)
  

Version actuelle datée du 15 de décembre 2011 à 14:02

en construction

under construction

A running job can be monitored/interrogated with a local utility called

u.job-monitor


There a 2 ways to activate this utility

  • at job submission time:
    ord_soumet ....  -prolog jobmonitor ....
  • with an explicit command in the job itself
    u.job-monitor &


caveat: in the case of an MPI job the only node that will be monitored is node 0 (primary node)


the job monitor uses 3 files found in directory $HOME/top_in_batch  for each monitored job

  • jobname_node_jobid.top
    refreshed every 10 seconds with the output of a top command for processes belonging to the user
  • jobname_node_jobid.cmd
    if the user writes a line in this file then
    • this line is executed on the primary node
    • the output (stdout and stderr) of said command is appended to the jobname_node_jobid.out file
    • the jobname_node_jobid.cmd file is erased and re-created
  • jobname_node_jobid.out
    • the output of the command from the jobname_node_jobid.cmd file

node will be replaced by the host name of the primary node of the job

jobid will be replaced by the PBS job id of said job

jobname will be replaced by the job name


Sample ouput from jobname_node_jobid.top (guillimin job)

file: myjob_sw-2r13-n21_94568.top
(job number 94568, primiry host is sw-2r13-n21)

top - 13:10:31 up 11 days, 10:51,  0 users,  load average: 9.36, 6.19, 7.64
Tasks: 240 total,   3 running, 237 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.6%us,  2.0%sy, 46.5%ni, 40.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  37020880k total,  2229884k used, 34790996k free,   202236k buffers
Swap: 25165780k total,   999100k used, 24166680k free,   134044k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
17809 winger    25   0 24.3g 138m  21m S 299.7  0.4   4:17.49 ATM_MOD.Abs       
17812 winger    25   0 24.2g 136m  14m R 299.7  0.4   4:18.64 ATM_MOD.Abs       
17810 winger    25   0 24.2g 139m  16m R 297.7  0.4   4:18.72 ATM_MOD.Abs       
17811 winger    25   0 24.2g 135m  15m S 297.7  0.4   4:18.49 ATM_MOD.Abs       
16674 winger    21   0 84112 2528 1856 S  0.0  0.0   0:00.04 bash               
16811 winger    15   0 13316  780  412 S  0.0  0.0   0:00.00 pbs_demux          
17360 winger    21   0 85136 1468  792 S  0.0  0.0   0:00.00 bash               
17369 winger    20   0 85136 1276  596 S  0.0  0.0   0:00.02 bash               
17377 winger    18   0 63896 1172  980 S  0.0  0.0   0:00.00 u.job-monitor      
17402 winger    18   0  3808  492  420 S  0.0  0.0   0:00.00 repeat_command     
17707 winger    18   0 65572 2024 1160 S  0.0  0.0   0:00.00 Um_runmod.ksh      
17756 winger    18   0 65572 2020 1180 S  0.0  0.0   0:00.00 Um_model.ksh       
17791 winger    18   0  3680  180  100 S  0.0  0.0   0:00.00 Climat_r.monito    
17792 winger    19   0 65572 1980 1156 S  0.0  0.0   0:00.00 r.mpirun           
17802 winger    15   0 47772 4112 2520 S  0.0  0.0   0:00.01 mpiexec            
17805 winger    18   0 65444 1872 1080 S  0.0  0.0   0:00.00 POE_SCRIPT_1779    
17806 winger    18   0 65444 1876 1080 S  0.0  0.0   0:00.00 POE_SCRIPT_1779    
17807 winger    18   0 65444 1872 1080 S  0.0  0.0   0:00.00 POE_SCRIPT_1779    
17808 winger    18   0 65444 1876 1080 S  0.0  0.0   0:00.00 POE_SCRIPT_1779    
17876 winger    19   0 63896 1120  928 S  0.0  0.0   0:00.00 sh                 
17877 winger    15   0 30892 2176 1460 R  0.0  0.0   0:00.00 top