Bypass guillimin problems : Différence entre versions

Un article de Informaticiens département des sciences de la Terre et l'atmosphère
Aller à: navigation, charcher
m
m (2) Make model kill itself when it gets stuck)
 
Ligne 60: Ligne 60:
 
'''  r.make_exp'''  
 
'''  r.make_exp'''  
  
(Ignore the warnings:<br>WARNING: file clib_interface.cdk not found<br>WARNING: file pthread.h not found<br>WARNING: file stdio.h not found<br>WARNING: file stdlib.h not found<br>WARNING: file unistd.h not found)<br><br>'''&nbsp; r.compile -src dead_process_timer.c<br>&nbsp; mv dead_process_timer.o malibLinux_x86-64_pgi11xx'''  
+
(Ignore the warnings:<br>WARNING: file clib_interface.cdk not found<br>WARNING: file pthread.h not found<br>WARNING: file stdio.h not found<br>WARNING: file stdlib.h not found<br>WARNING: file unistd.h not found)<br><br>'''&nbsp; make dead_process_timer.o'''
 
 
(You can also copy it:
 
 
 
&nbsp; cp /home/winger/gem/v_3.3.3/Abs/CORDEX/malibLinux_x86-64_pgi11xx/dead_process_timer.o malibLinux_x86-64_pgi11xx&nbsp; <br>
 
  
 
<br>  
 
<br>  
  
 
Edit the routine 'gem_run.ftn'.<br>If you do not have it yet in your directory get it from the environment:<br>&nbsp; '''333'''<br>&nbsp; '''omd_exp gem_run.ftn'''<br><br>In 'gem_run.ftn', before the beginning of the time step loop:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; do istep = step0, stepf<br>add the line:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '''call start_dead_process_timer(60)'''<br>At the beginning of each time step, just after the line:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; do istep = step0, stepf<br>add the line:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '''call I_am_alive()'''<br><br>So you will end up haveing something like this:<br>:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; call itf_cpl_fillatm<br>*<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; call start_dead_process_timer(60)<br>*<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; do istep = step0, stepf<br>*<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; call I_am_alive()<br>*<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lctl_step = istep<br>:<br><br>Create the object file:<br>&nbsp; '''make gem_run.o'''<br>and the model executable:<br>&nbsp; '''make gemclimdm'''<br><br>Once you did this the model will kill itself when a new time step has not been calculated within the last 60 seconds.<br>If you think your model will usually take more than 60 sec to compute one time step increase the time in "call start_dead_process_timer(60)" from 60 to whatever you think is adequate.<br>
 
Edit the routine 'gem_run.ftn'.<br>If you do not have it yet in your directory get it from the environment:<br>&nbsp; '''333'''<br>&nbsp; '''omd_exp gem_run.ftn'''<br><br>In 'gem_run.ftn', before the beginning of the time step loop:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; do istep = step0, stepf<br>add the line:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '''call start_dead_process_timer(60)'''<br>At the beginning of each time step, just after the line:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; do istep = step0, stepf<br>add the line:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '''call I_am_alive()'''<br><br>So you will end up haveing something like this:<br>:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; call itf_cpl_fillatm<br>*<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; call start_dead_process_timer(60)<br>*<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; do istep = step0, stepf<br>*<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; call I_am_alive()<br>*<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lctl_step = istep<br>:<br><br>Create the object file:<br>&nbsp; '''make gem_run.o'''<br>and the model executable:<br>&nbsp; '''make gemclimdm'''<br><br>Once you did this the model will kill itself when a new time step has not been calculated within the last 60 seconds.<br>If you think your model will usually take more than 60 sec to compute one time step increase the time in "call start_dead_process_timer(60)" from 60 to whatever you think is adequate.<br>

Version actuelle datée du 3 d'avril 2013 à 20:43

Good things to know about Guillimin

General

  • 'xrec' does not work on guillimin.
  • 'xxdiff' does not work on guillimin. You can use 'tkdiff' or 'tkdiff+' instead.

Quota

You can check the quota in your HOME with 'mmlsquota', for example:

  mmlsquota -j --block-size M sb_home_${USER}

This will give you your quota in MB. You have a total of 10 GB.

Queues

Jobs using less than 12 cores will go to the 'sw' queue.

Jobs using 12 cores or more will use full nodes (12 cores per node) only and go to the 'hb' queue. Meaning if a jobs gets submitted to run on 16 cores it will be running on 2 nodes = 24 cores.

One can force all jobs to go to a certain queue by exporting for example: export SOUMET_EXTRAS="-q hb"
SOUMET_EXTRAS can be exported in the batch profile '~/.profile.d/.batch_profile'.

To check your jobs use 'qs'.

To kill a job use 'qdel Job-ID'.

To check out the "busyness" of the queues use 'nodes'.
As written above jobs with 12 or more cores should go on the 'hb' queue and smaller jobs to the 'sw' queue.
For short tests you can of course use the 'debug' queue or also the 'sw', 'hb' and 'lm' queue.


When the model gets stuck

Unfortunately guillimin is not a very "stable" machine and jobs often crash or get stuck.

It is not enough to check status of a simulation with 'showq' or 'qs'. I recommend to check the time and 'tail' of the listing as well.

For longer simulations, you shold be using 'Chunk_lance' instead of 'Um_lance' to run your simulation.
When using 'Chunk_lance' each model job (month) can automatically get executed up tp 5 times before the whole chunk-job will stop.
Therefore, in case the model gets stuck, there is no need to delete the whole chunk-job with 'qdel' but rather kill the mipexec. 

To kill the mpiexec go on the main node on which the model is running (MasterHost). You find the name of this node in the last column when executing 'qs'. Go on this node with 'ssh':

  ssh $MasterHost

Once on the node check which jobs are running with:

  ps -fu $USER

Look for the 'mpiexec' job, something like:
  mpiexec -npernode 4 --bind-to-core --cpus-per-proc 3 -n 36 ././POE_SCRIPT_25215

Then kill this job with 'kill -9 ' followed by the 'PID' of the mpiexec job:

  kill -9 $PID

Then the month which got stuck should automatically restart from the beginning.

Only if this does not work either, kill the whole job with the usual 'qdel'.


Avoiding trouble on guillimin

Here are a few tricks to bypass some of the machine's problems.

1) Have $TMPDIR under /localscratch

Instead of being under /tmp $TMPDIR will be under /localscratch on nodes where this file system exists (compute nodes):
  mkdir ~/tmp
  cd ~/tmp
  ln -s /localscratch guillimin
  ln -s /localscratch localhost

2) Make model kill itself when it gets stuck

Go into the directory in which you create your executables.

Copy the following file:
  cp /home/winger/gem/v_3.3.3/Abs/CORDEX/dead_process_timer.c .

Create the corresponding *.o:

  333

  r.make_exp

(Ignore the warnings:
WARNING: file clib_interface.cdk not found
WARNING: file pthread.h not found
WARNING: file stdio.h not found
WARNING: file stdlib.h not found
WARNING: file unistd.h not found)

  make dead_process_timer.o


Edit the routine 'gem_run.ftn'.
If you do not have it yet in your directory get it from the environment:
  333
  omd_exp gem_run.ftn

In 'gem_run.ftn', before the beginning of the time step loop:
        do istep = step0, stepf
add the line:
      call start_dead_process_timer(60)
At the beginning of each time step, just after the line:
        do istep = step0, stepf
add the line:
         call I_am_alive()

So you will end up haveing something like this:
:
      call itf_cpl_fillatm
*
      call start_dead_process_timer(60)
*
      do istep = step0, stepf
*
         call I_am_alive()
*
         Lctl_step = istep
:

Create the object file:
  make gem_run.o
and the model executable:
  make gemclimdm

Once you did this the model will kill itself when a new time step has not been calculated within the last 60 seconds.
If you think your model will usually take more than 60 sec to compute one time step increase the time in "call start_dead_process_timer(60)" from 60 to whatever you think is adequate.