Bypass guillimin problems
Good things to know about Guillimin
- 'xrec' does not work on guillimin.
- 'xxdiff' does not work on guillimin. You can use 'tkdiff' or 'tkdiff+' instead.
Jobs using less than 12 cores will go to the 'sw' queue.
Jobs using 12 cores or more will use full nodes (12 cores per node) only and go to the 'hb' queue. Meaning if a jobs gets submitted to run on 16 cores it will be running on 2 nodes = 24 cores.
One can force all jobs to go to a certain queue by setting: export SOUMET_EXTRAS="-q hb"
SOUMET_EXTRAS can be exported in the batch profile '~/.profile.d/.batch_profile'.
Avoiding trouble on guillimin
Unfortunately guillimin is not a very "stable" machine and jobs often crash or get stuck.
Here are a few tricks to bypass some of the machine's problems.
1) Have $TMPDIR under /localscratch
Instead of being under /tmp $TMPDIR will be under /localscratch on nodes where this file system exists (compute nodes):
mkdir ~/tmp
cd ~/tmp
ln -s /localscratch guillimin
ln -s /localscratch localhost
2) Make model kill itself when it gets stuck
Go into the directory in which you create your executables.
Copy the following file:
cp /home/winger/gem/v_3.3.3/Abs/CORDEX/dead_process_timer.c .
Create the corresponding *.o:
(Ignore the warnings:
WARNING: file clib_interface.cdk not found
WARNING: file pthread.h not found
WARNING: file stdio.h not found
WARNING: file stdlib.h not found
WARNING: file unistd.h not found)
r.compile -src dead_process_timer.c
mv dead_process_timer.o malibLinux_x86-64_pgi11xx
(You can also copy it:
cp /home/winger/gem/v_3.3.3/Abs/CORDEX/malibLinux_x86-64_pgi11xx/dead_process_timer.o malibLinux_x86-64_pgi11xx
Edit the routine 'gem_run.ftn'.
If you do not have it yet in your directory get it from the environment:
omd_exp gem_run.ftn
In 'gem_run.ftn', before the beginning of the time step loop:
do istep = step0, stepf
add the line:
call start_dead_process_timer(60)
At the beginning of each time step, just after the line:
do istep = step0, stepf
add the line:
call I_am_alive()
So you will end up haveing something like this:
call itf_cpl_fillatm
call start_dead_process_timer(60)
do istep = step0, stepf
call I_am_alive()
Lctl_step = istep
Create the object file:
make gem_run.o
and the model executable:
make gemclimdm
Once you did this the model will kill itself when a new time step has not been calculated within the last 60 seconds.
If you think your model will usually take more than 60 sec to compute one time step increase the time in "call start_dead_process_timer(60)" from 60 to whatever you think is adequate.