Chunk lance : Différence entre versions

Un article de Informaticiens département des sciences de la Terre et l'atmosphère
Aller à: navigation, charcher
m
m
Ligne 45: Ligne 45:
 
If there is NO file called "continue_with_next_job" in your config file directory taks a copy of the file "last_continue_with_next_job":  
 
If there is NO file called "continue_with_next_job" in your config file directory taks a copy of the file "last_continue_with_next_job":  
  
  cp continue_with_next_job last_continue_with_next_job
+
  cp last_continue_with_next_job continue_with_next_job  
  
 
Then restart the simulation by executing "Chunk_lance" again in the config file directory.  
 
Then restart the simulation by executing "Chunk_lance" again in the config file directory.  
  
==== Restart from previous restart file ====
+
==== Restart from previous restart file ====
  
In case a crash was so bad that the restart files got destroyed it will not be enought to restart any of the 3 jobs.<br>One will have to restart the simulation from the previous intact restart files.
+
In case a crash was so bad that the restart files got destroyed it will not be enough to just restart any of the 3 jobs.<br>One will have to restart the simulation from the previous intact restart files.  
 +
 
 +
#Copy (set of) restart files from the archive (${CLIMAT_archdir}/Restarts) back into the execution directory "~/MODEL_EXEC_RUN/$TRUEHOST"<br>
 +
#Gunzip and unarchive (cmcarc -x -f ...) the restart files in ~/MODEL_EXEC_RUN/$TRUEHOST
 +
#Go into the config file directory<br>
 +
#Copy: '''cp last_continue_with_next_job continue_with_next_job'''
 +
Edit 'continue_with_next_job':<br>prevjob=''name_of_previous_month''<br>nextname=''name_of_month_to_be_rerun<br>''Replace the command 'Um_lance" which goes over 4 lines with the Um_lance command you find in the file:<br>~/Climat_log/''name_of_previous_month''''.log''<br>Remove last 4 lines:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; export EXECDIR=...<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; export CLIMAT_model=3.3.3<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; export CLIMAT_version=3.3.3<br>
 +
 
 +
<br>

Version depuis le 20 de décembre 2011 à 16:09

Chunk_lance

Chunk_lance allows to run a sequence of monthly model jobs in one big job.

A GEM/GEMCLIM/CRCM5 simulation usually consists of a sequence of month jobs.

Each monthly job is made up out of 3 parts:

  1. auto_launch (copies restart files from previous month, prepare config files for new month)
  2. entry (only in LAM mode, prepares driving data)
  3. model (main model job)

Chunk_lance will run a series of these 3 jobs, checking at the end of each model job if there is still enough time to calculate another month. If yes the 3 jobs will get executed for another month, if not an new chunk_job will get submitted.

In case the model job fails it will automatically get reexecuted up to 4 times.

Since the running job is not always called "chunk_job" one cannot see anymore from the job name how far the simulation has progressed. But one can always have a look at the listings directory and also a log file is kept in the config file directory called "chunk_job.log".
This file is essential for the whole chunk_job procedure. The chunk_job itself will check this file to determine which job to execute next. Therefore this log file must only be removed if one wants to restart a simulation from the beginning.
However, to rerun part of a simulation one can alter the log file by hand. Just make sure there is never a blank line at the end of the log file since the chunk_job only checks the very last line of the log file!

To start a simulation using Chunk_lance one only has to set the model environment (for example with '333') and execute "Chunk_lance" in the config file directory.

The time up to which one chunk_job will be running can be set in the file 'configexp.dot.cfg' with the parameter 'BACKEND_time_mod'.
On guillimin one job is allowed to run up to 30 days (2592000 sec).
On colosse one job is allowed to run up to 2 days (172800 sec).


Restart using Chunk_lance

In case a simulation stoppes one first has to find out which job (auto_launch, entry or model) was the last one that finished propperly.

The best way is to look at the listing but the log file "chunk_job.log" can also be used for indications.

entry or model job crashed

In this case the last line in the log file should be:

   ... entry ..._E starting at ...
or
   ... model ..._M starting at ...

If the entry or the model job crashed it is enough to restart the simulation by executing "Chunk_lance" again in the config file directory.

auto_launch crashed

In this case the last line in the log file should be:

   ... continue_with_next_job ... starting at ...

If there is NO file called "continue_with_next_job" in your config file directory taks a copy of the file "last_continue_with_next_job":

  cp last_continue_with_next_job continue_with_next_job

Then restart the simulation by executing "Chunk_lance" again in the config file directory.

Restart from previous restart file

In case a crash was so bad that the restart files got destroyed it will not be enough to just restart any of the 3 jobs.
One will have to restart the simulation from the previous intact restart files.

  1. Copy (set of) restart files from the archive (${CLIMAT_archdir}/Restarts) back into the execution directory "~/MODEL_EXEC_RUN/$TRUEHOST"
  2. Gunzip and unarchive (cmcarc -x -f ...) the restart files in ~/MODEL_EXEC_RUN/$TRUEHOST
  3. Go into the config file directory
  4. Copy: cp last_continue_with_next_job continue_with_next_job

Edit 'continue_with_next_job':
prevjob=name_of_previous_month
nextname=name_of_month_to_be_rerun
Replace the command 'Um_lance" which goes over 4 lines with the Um_lance command you find in the file:
~/Climat_log/name_of_previous_month'.log
Remove last 4 lines:
      export EXECDIR=...
      export CLIMAT_model=3.3.3
      export CLIMAT_version=3.3.3