Version depuis le 11 de décembre 2013 à 20:41

Matières

1 Chunk_lance
- 1.1 Restart using Chunk_lance

Chunk_lance

Chunk_lance allows to run a sequence of monthly model jobs in one big job.

A GEM/GEMCLIM/CRCM5 simulation usually consists of a sequence of month jobs.

Each monthly job is made up out of 3 parts:

auto_launch (copies restart files from previous month, prepare config files for new month)
entry (only in LAM mode, prepares driving data)
model (main model job)

Chunk_lance will run a series of these 3 jobs, checking at the end of each model job if there is still enough time to calculate another month. If yes the 3 jobs will get executed for another month, if not an new chunk_job will get submitted.

In case the model job fails it will automatically get reexecuted up to 3 times.

But sometimes it happens that part of the restart files gets overwritten before a month is finished. In that case we cannot just relaunch the month with Chunk_lance but have to get the original restart files back (from the previous month) etc. and restart the simulation from there.

When the model job stops for whatever reason, there is an automatic check in Chunk_lance if the restart files are still the original ones or not. If they got already modified the following message will appear in the "chunk_job listing" (!!!) not in the model listing:

   At least one of the restart files got already rewritten
   Therefore the model could not get restarted automatically
   You have to restart your simulation starting from the previous restart files
            ----- ABORT -----

If you see this message please do not simply restart the simulation with Chunk_lance but as said, from the previous restart file.

Since the running job is not always called "chunk_job" one cannot see anymore from the job name how far the simulation has progressed. But one can always have a look at the listings directory and also a log file is kept in the config file directory called 'chunk_job.log'.

This file 'chunk_job.log' is essential for the whole chunk_job procedure. The chunk_job itself will check this file to determine which job to execute next. Therefore this log file should not be touched and must only be removed if one wants to restart a simulation from the beginning.
However, to rerun part of a simulation one can alter the log file by hand. Just make sure there is never a blank line at the end of the log file since the chunk_job only checks the very last line of the log file!

To start a simulation using Chunk_lance one only has to set the model environment (for example with '333') and execute "Chunk_lance" in the config file directory.

The time up to which one chunk_job will be running can be set in the file 'configexp.dot.cfg' with the parameter 'BACKEND_time_mod'.
On guillimin one job is allowed to run up to 30 days (2592000 sec).
On colosse one job is allowed to run up to 2 days (172800 sec).

Restart using Chunk_lance

In case a simulation stops one first has to find out which job (auto_launch, entry or model) was the last one that finished propperly or in other wor.

The best way is to look at the listing but the log file "chunk_job.log" can also be used for indications.

entry or model job crashed

In this case the last line in the log file should be:

... entry ..._E starting at ...
or
... model ..._M starting at ...

If the entry or the model job crashed it is enough to restart the simulation by executing "Chunk_lance" again in the config file directory.

auto_launch crashed

In this case the last line in the log file should be:

... continue_with_next_job ... starting at ...

If there is NO file called "continue_with_next_job" in your config file directory take a copy of the file "last_continue_with_next_job":

cp last_continue_with_next_job continue_with_next_job

Then restart the simulation by executing "Chunk_lance" again in the config file directory.

Restart from previous restart file

Continue simulation from a restart file:

Copy restart file of previous month (if there is more than 1 part, copy all parts!) from the archive (${CLIMAT_archdir}/Restarts) back into the execution directory "~/MODEL_EXEC_RUN/$TRUE_HOST"
Gunzip and unarchive (cmcarc -x -f ...) the restart file(s) in ~/MODEL_EXEC_RUN/$TRUE_HOST
Go into the config file directory
Copy: cp last_continue_with_next_job continue_with_next_job
Edit 'continue_with_next_job':
      prevjob=name_of_previous_month
      nextname=name_of_month_to_be_rerun
Replace the command "Um_lance ..." which goes over 4 lines with the Um_lance command you find in the file:
      ~/Climat_log/name_of_previous_month.log
Remove last 4 lines:
      export EXECDIR=...
      export CLIMAT_model=3.3.3
      export CLIMAT_version=3.3.3
     Climat_save_restarts ...
Edit the log file 'chunk_job.log':
First I suggest to make a backup copy of the log file.
The remove all lines below:
... continue_with_next_job name_of_month_to_be_rerun starting at ...
Execute "Chunk_lance" again

@@ Ligne 59: / Ligne 59: @@
 Continue simulation from a restart file:<br>
-#Copy (set of) restart files from the archive (${CLIMAT_archdir}/Restarts) back into the execution directory "~/MODEL_EXEC_RUN/$TRUE_HOST"<br>
+#Copy restart file of previous month (if there is more than 1 part, copy all parts!)&nbsp; from the archive (${CLIMAT_archdir}/Restarts) back into the execution directory "~/MODEL_EXEC_RUN/$TRUE_HOST"<br>
-#Gunzip and unarchive (cmcarc -x -f ...) the restart files in ~/MODEL_EXEC_RUN/$TRUE_HOST
+#Gunzip and unarchive (cmcarc -x -f ...) the restart file(s) in ~/MODEL_EXEC_RUN/$TRUE_HOST
 #Go into the config file directory<br>
 #Copy: '''cp last_continue_with_next_job continue_with_next_job'''

Chunk lance : Différence entre versions

Version depuis le 11 de décembre 2013 à 20:41

Matières

Chunk_lance

Restart using Chunk_lance

entry or model job crashed

auto_launch crashed

Restart from previous restart file

Menu de navigation

Outils personnels

Espaces des noms

Variantes

Vues

Plus

Charcher

Navigation

Outils