Bypass guillimin problems : Différence entre versions

Un article de Informaticiens département des sciences de la Terre et l'atmosphère
Aller à: navigation, charcher
m (Created page with "= How to avoid trouble on guillimin<br> = <br> Unfortunately guillimin is not a very "stable" machine and jobs often crash. Here are a few tricks to bypass some of the machin...")
 
m
Ligne 3: Ligne 3:
 
<br> Unfortunately guillimin is not a very "stable" machine and jobs often crash.  
 
<br> Unfortunately guillimin is not a very "stable" machine and jobs often crash.  
  
Here are a few tricks to bypass some of the machine's problems.
+
Here are a few tricks to bypass some of the machine's problems.  
  
 +
<br>
  
 +
=== Have $TMPDIR under /localscratch ===
  
#Have $TMPDIR under /localscratch<br>'''&nbsp; mkdir ~/tmp<br>&nbsp; cd ~/tmp<br>&nbsp; ln -s /localscratch guillimin<br>&nbsp; ln -s /localscratch localhost<br>'''Instead of being under /tmp $TMPDIR will be under /localscratch on nodes where this file system exists (compute nodes).<br>
+
Instead of being under /tmp $TMPDIR will be under /localscratch on nodes where this file system exists (compute nodes):<br>'''&nbsp; mkdir ~/tmp<br>&nbsp; cd ~/tmp<br>&nbsp; ln -s /localscratch guillimin<br>&nbsp; ln -s /localscratch localhost<br>'''<br>
#Make odel kill itself when getting stuck
+
 
 +
=== Make odel kill itself when getting stuck  ===
 +
 
 +
Go into the directory in which you create your executables.<br><br>Copy the following file:<br>&nbsp; '''cp /home/winger/gem/v_3.3.3/Abs/CORDEX/dead_process_timer.c .'''<br>
 +
 
 +
Create the corresponding *.o:
 +
 
 +
&nbsp; '''333'''
 +
 
 +
'''&nbsp; r.make_exp'''
 +
 
 +
(Ignore the warnings:<br>WARNING: file clib_interface.cdk not found<br>WARNING: file pthread.h not found<br>WARNING: file stdio.h not found<br>WARNING: file stdlib.h not found<br>WARNING: file unistd.h not found)<br><br>'''&nbsp; r.compile -src dead_process_timer.c<br>&nbsp; mv dead_process_timer.o malibLinux_x86-64_pgi11xx'''
 +
(You can also copy it:
 +
 
 +
&nbsp; cp /home/winger/gem/v_3.3.3/Abs/CORDEX/malibLinux_x86-64_pgi11xx/dead_process_timer.o malibLinux_x86-64_pgi11xx&nbsp;
 +
<br>
 +
 
 +
Edit the routine 'gem_run.ftn'.<br>If you do not have it yet in your directory get it from the environment:<br>&nbsp; 333<br>&nbsp; omd_exp gem_run.ftn<br><br>In 'gem_run.ftn', before the beginning of the time step loop:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; do istep = step0, stepf<br>add the line:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; call start_dead_process_timer(60)<br>At the beginning of each time step, just after the line:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; do istep = step0, stepf<br>add the line:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; call I_am_alive()<br><br>So you will end up haveing something like this:<br>:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; call itf_cpl_fillatm<br>*<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; call start_dead_process_timer(60)<br>*<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; do istep = step0, stepf<br>*<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; call I_am_alive()<br>*<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Lctl_step = istep<br>:<br><br>Create the object file:<br>&nbsp; make gem_run.o<br>and the model executable:<br>&nbsp; make gemclimdm<br><br>Once you did this the model will kill itself there a new time step has<br>not been calculated within the last 60 seconds.<br>If you think your model will usually take more than 60 sec to compute<br>one time step increase the time in "call start_dead_process_timer(60)"<br>from 60 to whatever you think is adequate.<br>

Version depuis le 30 de janvier 2012 à 20:49

How to avoid trouble on guillimin


Unfortunately guillimin is not a very "stable" machine and jobs often crash.

Here are a few tricks to bypass some of the machine's problems.


Have $TMPDIR under /localscratch

Instead of being under /tmp $TMPDIR will be under /localscratch on nodes where this file system exists (compute nodes):
  mkdir ~/tmp
  cd ~/tmp
  ln -s /localscratch guillimin
  ln -s /localscratch localhost

Make odel kill itself when getting stuck

Go into the directory in which you create your executables.

Copy the following file:
  cp /home/winger/gem/v_3.3.3/Abs/CORDEX/dead_process_timer.c .

Create the corresponding *.o:

  333

  r.make_exp

(Ignore the warnings:
WARNING: file clib_interface.cdk not found
WARNING: file pthread.h not found
WARNING: file stdio.h not found
WARNING: file stdlib.h not found
WARNING: file unistd.h not found)

  r.compile -src dead_process_timer.c
  mv dead_process_timer.o malibLinux_x86-64_pgi11xx
(You can also copy it:

  cp /home/winger/gem/v_3.3.3/Abs/CORDEX/malibLinux_x86-64_pgi11xx/dead_process_timer.o malibLinux_x86-64_pgi11xx 

Edit the routine 'gem_run.ftn'.
If you do not have it yet in your directory get it from the environment:
  333
  omd_exp gem_run.ftn

In 'gem_run.ftn', before the beginning of the time step loop:
        do istep = step0, stepf
add the line:
      call start_dead_process_timer(60)
At the beginning of each time step, just after the line:
        do istep = step0, stepf
add the line:
         call I_am_alive()

So you will end up haveing something like this:
:
      call itf_cpl_fillatm
*
      call start_dead_process_timer(60)
*
      do istep = step0, stepf
*
         call I_am_alive()
*
         Lctl_step = istep
:

Create the object file:
  make gem_run.o
and the model executable:
  make gemclimdm

Once you did this the model will kill itself there a new time step has
not been calculated within the last 60 seconds.
If you think your model will usually take more than 60 sec to compute
one time step increase the time in "call start_dead_process_timer(60)"
from 60 to whatever you think is adequate.