Running job monitor : Différence entre versions

Un article de Informaticiens département des sciences de la Terre et l'atmosphère
Aller à: navigation, charcher
m
m
Ligne 38: Ligne 38:
 
where '''node''' will be replaced by the host name of the primary node of the job  
 
where '''node''' will be replaced by the host name of the primary node of the job  
  
and '''jobid''' will be replaced by the PBS job id of said job
+
and '''jobid''' will be replaced by the PBS job id of said job  
 +
 
 +
<br>
 +
 
 +
Sample ouput from '''node'''_'''jobid'''.top (guillimin job)
 +
<pre>file: sw-2r13-n21_94568.top
 +
(job number 94568, primiry host is sw-2r13-n21)
 +
 
 +
top - 13:10:31 up 11 days, 10:51,  0 users,  load average: 9.36, 6.19, 7.64
 +
Tasks: 240 total,  3 running, 237 sleeping,  0 stopped,  0 zombie
 +
Cpu(s): 10.6%us,  2.0%sy, 46.5%ni, 40.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
 +
Mem:  37020880k total,  2229884k used, 34790996k free,  202236k buffers
 +
Swap: 25165780k total,  999100k used, 24166680k free,  134044k cached
 +
 
 +
  PID USER      PR  NI  VIRT  RES  SHR S&nbsp;%CPU&nbsp;%MEM    TIME+  COMMAND           
 +
17809 winger    25  0 24.3g 138m  21m S 299.7  0.4  4:17.49 ATM_MOD.Abs     
 +
17812 winger    25  0 24.2g 136m  14m R 299.7  0.4  4:18.64 ATM_MOD.Abs     
 +
17810 winger    25  0 24.2g 139m  16m R 297.7  0.4  4:18.72 ATM_MOD.Abs     
 +
17811 winger    25  0 24.2g 135m  15m S 297.7  0.4  4:18.49 ATM_MOD.Abs     
 +
16674 winger    21  0 84112 2528 1856 S  0.0  0.0  0:00.04 bash             
 +
16811 winger    15  0 13316  780  412 S  0.0  0.0  0:00.00 pbs_demux         
 +
17360 winger    21  0 85136 1468  792 S  0.0  0.0  0:00.00 bash             
 +
17369 winger    20  0 85136 1276  596 S  0.0  0.0  0:00.02 bash             
 +
17377 winger    18  0 63896 1172  980 S  0.0  0.0  0:00.00 u.job-monitor     
 +
17402 winger    18  0  3808  492  420 S  0.0  0.0  0:00.00 repeat_command   
 +
17707 winger    18  0 65572 2024 1160 S  0.0  0.0  0:00.00 Um_runmod.ksh     
 +
17756 winger    18  0 65572 2020 1180 S  0.0  0.0  0:00.00 Um_model.ksh     
 +
17791 winger    18  0  3680  180  100 S  0.0  0.0  0:00.00 Climat_r.monito   
 +
17792 winger    19  0 65572 1980 1156 S  0.0  0.0  0:00.00 r.mpirun         
 +
17802 winger    15  0 47772 4112 2520 S  0.0  0.0  0:00.01 mpiexec           
 +
17805 winger    18  0 65444 1872 1080 S  0.0  0.0  0:00.00 POE_SCRIPT_1779   
 +
17806 winger    18  0 65444 1876 1080 S  0.0  0.0  0:00.00 POE_SCRIPT_1779   
 +
17807 winger    18  0 65444 1872 1080 S  0.0  0.0  0:00.00 POE_SCRIPT_1779   
 +
17808 winger    18  0 65444 1876 1080 S  0.0  0.0  0:00.00 POE_SCRIPT_1779   
 +
17876 winger    19  0 63896 1120  928 S  0.0  0.0  0:00.00 sh               
 +
17877 winger    15  0 30892 2176 1460 R  0.0  0.0  0:00.00 top               
 +
 
 +
</pre>

Version depuis le 22 de novembre 2011 à 18:01

en construction

under construction

A running job can be monitored/interrogated with a local utility called

u.job-monitor


There a 2 ways to activate this utility

  • at job submission time:
    ord_soumet ....  -prolog jobmonitor ....
  • with an explicit command in the job itself
    u.job-monitor &


caveat: in the case of an MPI job the only node that will be monitored is node 0 (primary node)


the job monitor uses 3 files found in directory $HOME/top_in_batch  for each monitored job

  • node_jobid.top
    refreshed every 10 seconds with the output of a top command for processes belonging to the user
  • node_jobid.cmd
    if the user writes a line in this file then
    • this line is executed on the primary node
    • the output (stdout and stderr) of said command is appended to the node_jobid.out file
    • the node_job.cmd file is erased and re-created
  • node_jobid.out
    • the output of the command from the node_jobid.cmd file

where node will be replaced by the host name of the primary node of the job

and jobid will be replaced by the PBS job id of said job


Sample ouput from node_jobid.top (guillimin job)

file: sw-2r13-n21_94568.top
(job number 94568, primiry host is sw-2r13-n21)

top - 13:10:31 up 11 days, 10:51,  0 users,  load average: 9.36, 6.19, 7.64
Tasks: 240 total,   3 running, 237 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.6%us,  2.0%sy, 46.5%ni, 40.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  37020880k total,  2229884k used, 34790996k free,   202236k buffers
Swap: 25165780k total,   999100k used, 24166680k free,   134044k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
17809 winger    25   0 24.3g 138m  21m S 299.7  0.4   4:17.49 ATM_MOD.Abs       
17812 winger    25   0 24.2g 136m  14m R 299.7  0.4   4:18.64 ATM_MOD.Abs       
17810 winger    25   0 24.2g 139m  16m R 297.7  0.4   4:18.72 ATM_MOD.Abs       
17811 winger    25   0 24.2g 135m  15m S 297.7  0.4   4:18.49 ATM_MOD.Abs       
16674 winger    21   0 84112 2528 1856 S  0.0  0.0   0:00.04 bash               
16811 winger    15   0 13316  780  412 S  0.0  0.0   0:00.00 pbs_demux          
17360 winger    21   0 85136 1468  792 S  0.0  0.0   0:00.00 bash               
17369 winger    20   0 85136 1276  596 S  0.0  0.0   0:00.02 bash               
17377 winger    18   0 63896 1172  980 S  0.0  0.0   0:00.00 u.job-monitor      
17402 winger    18   0  3808  492  420 S  0.0  0.0   0:00.00 repeat_command     
17707 winger    18   0 65572 2024 1160 S  0.0  0.0   0:00.00 Um_runmod.ksh      
17756 winger    18   0 65572 2020 1180 S  0.0  0.0   0:00.00 Um_model.ksh       
17791 winger    18   0  3680  180  100 S  0.0  0.0   0:00.00 Climat_r.monito    
17792 winger    19   0 65572 1980 1156 S  0.0  0.0   0:00.00 r.mpirun           
17802 winger    15   0 47772 4112 2520 S  0.0  0.0   0:00.01 mpiexec            
17805 winger    18   0 65444 1872 1080 S  0.0  0.0   0:00.00 POE_SCRIPT_1779    
17806 winger    18   0 65444 1876 1080 S  0.0  0.0   0:00.00 POE_SCRIPT_1779    
17807 winger    18   0 65444 1872 1080 S  0.0  0.0   0:00.00 POE_SCRIPT_1779    
17808 winger    18   0 65444 1876 1080 S  0.0  0.0   0:00.00 POE_SCRIPT_1779    
17876 winger    19   0 63896 1120  928 S  0.0  0.0   0:00.00 sh                 
17877 winger    15   0 30892 2176 1460 R  0.0  0.0   0:00.00 top