Chunk lance : Différence entre versions

Un article de Informaticiens département des sciences de la Terre et l'atmosphère
Aller à: navigation, charcher
m
m (Chunk_lance)
 
(22 révisions intermédiaires par 2 utilisateurs non affichées)
Ligne 1: Ligne 1:
 
= Chunk_lance  =
 
= Chunk_lance  =
  
Chunk_lance allows to run a sequence of monthly model jobs in one big job.  
+
Chunk_lance allows to run a sequence of monthly (or sub-monthly) model jobs in one big job.  
  
A GEM/GEMCLIM/CRCM5 simulation usually consists of a sequence of month jobs.  
+
A GEM/GEMCLIM/CRCM climate simulation usually consists of a sequence of monthly jobs.
 +
But even a weather forecast simulation can consist of a sequence of, for example, n-daily jobs (set using the parameter 'Fcst_rstrt_S').
  
Each monthly job is made up out of 3 parts:  
+
Each job consists of 2 parts:  
  
#auto_launch (copies restart files from previous month, prepare config files for new month)  
+
#'''scripts''' (copies restart files from previous month, prepare config files for new month)  
#entry (only in LAM mode, prepares driving data)
+
#'''model''' (main model job)
#model (main model job)
 
  
Chunk_lance will run a series of these 3 jobs, checking at the end of each model job if there is still enough time to calculate another month. If yes the 3 jobs will get executed for another month, if not an new chunk_job will get submitted.  
+
Chunk_lance will run a series of these 2 part jobs, checking at the end of each job if there is still enough time to calculate another job. If yes another job will get executed, if not an new chunk_job will get submitted automatically(!).  
  
In case the model job fails it will automatically get reexecuted up to 4 times.
 
  
Since the running job is not always called "chunk_job" one cannot see anymore from the job name how far the simulation has progressed. But one can always have a look at the listings directory and also a log file is kept in the config file directory called "chunk_job.log".<br>This file is essential for the whole chunk_job procedure. The chunk_job itself will check this file to determine which job to execute next. Therefore this log file must only be removed if one wants to restart a simulation from the beginning.<br>However, to rerun part of a simulation one can alter the log file by hand. Just make sure there is never a blank line at the end of the log file since the chunk_job only checks the very last line of the log file!
 
  
To start a simulation using Chunk_lance one only has to set the model environment (for example with '333') and execute "Chunk_lance" in the config file directory.
+
As with Um_lance, Chunk_lance needs to get executed from the config file directory and the model environment needs to have been set.
 +
To '''start''' a simulation from the beginning add the key '-start';
  
The time up to which one chunk_job will be running can be set in the file 'configexp.dot.cfg' with the parameter 'BACKEND_time_mod'.<br>On guillimin one job is allowed to run up to 30 days (2592000 sec). <br>On colosse one job is allowed to run up to 2 days (172800 sec).  
+
&nbsp;&nbsp;&nbsp; '''Chunk_lance -start'''
 +
 
 +
To '''continue''' a simulation just execute the command without any key:
 +
 
 +
&nbsp;&nbsp;&nbsp; '''Chunk_lance'''
 +
 
 +
 
 +
The wallclock time for which one chunk_job will be running can be set in the file 'configexp.dot.cfg' with the parameter 'BACKEND_time_mod'.<br>
 +
 
 +
 
 +
 
 +
In case the '''model job fails''' it will automatically get reexecuted a second time.
 +
 
 +
In case the '''whole simulation crashes''', it can easily get '''resubmitted by executing 'Chunk_lance' again'''.<br>
 +
To know where the simulation was, Chunk_lance is using a log file called ''''chunk_job.log''''. The chunk_job itself will check this file to determine which job to execute next. Therefore this log file should not be touched.<br>
 +
However, to rerun part of a simulation one can alter the log file by hand. Just make sure there is <u>never a blank line at the end of the log file</u> since the chunk_job only checks the very last line of the log file!
 +
 
 +
 
 +
 
 +
But sometimes it happens that part of the restart files gets overwritten before a month is finished or that the restart files are corrupted. In that case we cannot just relaunch the month with Chunk_lance but have to get the original last uncorrupted restart files back (from the previous month) etc. and restart the simulation from there.
 +
 
 +
When the model job stops for whatever reason, there is an automatic check in Chunk_lance if the restart files are still the original ones or not. If they got already modified the following message will appear in the "chunk_job listing" (!!!) not in the model listing:<br> <br> &nbsp;&nbsp; At least one of the restart files got already rewritten<br>&nbsp;&nbsp; Therefore the model could not get restarted automatically<br> &nbsp;&nbsp; You have to restart your simulation starting from the previous restart files<br> &nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ----- ABORT -----<br> &nbsp;<br> If you see this message please do not simply restart the simulation with Chunk_lance but as said, from the previous restart file - see below.<br>
 +
 
 +
 
 +
 
 +
Since the running job is now always called "cjob_${exp}_..." one cannot see anymore from the job name how far the simulation has progressed. But one can always have a look at the listings directory and also a log file is kept in the config file directory called 'chunk_job.log'.<br>
  
 
<br>  
 
<br>  
Ligne 25: Ligne 49:
 
=== Restart using Chunk_lance  ===
 
=== Restart using Chunk_lance  ===
  
In case a simulation stoppes one first has to find out which job (auto_launch, entry or model) was the last one that finished propperly.<br>  
+
In case a simulation stops and you want to find out which job (scripts, entry or model) crashed, you have have look in the listings directory (~/listings/${TRUE_HOST}). Check which of the following jobs has crashed:<br>  
  
The best way is to look at the listing but the log file "chunk_job.log" can also be used for indications.<br>
+
&nbsp;&nbsp;&nbsp; ${exp}_'''S'''&nbsp;&nbsp;&nbsp;&nbsp; ('''S'''cripts)<br>
 +
&nbsp;&nbsp;&nbsp; ${exp}_'''M''' &nbsp;&nbsp;&nbsp; ('''M'''odel)<br><br>
 +
Or you can have a look at the log file "chunk_job.log".  
  
==== entry or model job crashed<br> ====
+
In any case, you can restart your simulation by simply executing <br>&nbsp;&nbsp;&nbsp; '''Chunk_lance'''<br>again in the config file directory. <br>Of course AFTER you fixed the problem - unless it was a machine problem. In the latter case, just restart the simulation with 'Chunk_lance'.
  
In this case the last line in the log file should be:
+
<br>  
 
 
&nbsp;&nbsp; ... entry ..._E starting at ...<br>or<br>&nbsp;&nbsp; ... model ..._M starting at ...<br>  
 
  
If the entry or the model job crashed it is enough to restart the simulation by executing "Chunk_lance" again in the config file directory.
+
=== Restart from previous restart file ===
  
==== auto_launch crashed  ====
+
Continue simulation from a restart file:<br>
  
In this case the last line in the log file should be:
+
&nbsp;&nbsp;&nbsp; * '''Copy restart file''' of previous month (if there is more than 1 part, copy all parts!)&nbsp; from the archive (${CLIMAT_archdir}/Restarts) back into the execution directory "~/MODEL_EXEC_RUN/$TRUE_HOST"<br>
  
&nbsp;&nbsp; ... continue_with_next_job ... starting at ...  
+
&nbsp;&nbsp;&nbsp; * '''Gunzip''' and '''unarchive''' (cmcarc -x -f ...) the restart file(s) in ~/MODEL_EXEC_RUN/$TRUE_HOST<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The cmcarc-command will create a new directory, but the *.ca file will remain in the directory. You can remove it again.  
  
If there is NO file called "continue_with_next_job" in your config file directory taks a copy of the file "last_continue_with_next_job":
+
&nbsp;&nbsp;&nbsp; * Go into the config file directory<br>
  
&nbsp; cp continue_with_next_job last_continue_with_next_job
+
&nbsp;&nbsp;&nbsp; * '''Edit''' the log file ''''chunk_job.log'''':<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (First I suggest to make a backup copy of the log file.)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The remove all lines concerning the month you want to rerun and all following lines.<br>
  
Then restart the simulation by executing "Chunk_lance" again in the config file directory.  
+
&nbsp;&nbsp;&nbsp; * '''If''' you are '''running the entry in parallel''', <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; you will also have to '''remove''' all the '''${exp}_entry_finished''' flags in your config file directory for all the months you want to rerun.<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Otherwise the entries for these months will not get rerun!
  
==== Restart from previous restart file ====
+
&nbsp;&nbsp;&nbsp; * Execute "'''Chunk_lance'''" again
  
In case a crash was so bad that the restart files got destroyed it will not be enought to restart any of the 3 jobs.<br>One will have to restart the simulation from the previous intact restart files.
+
<br>

Version actuelle datée du 8 de mars 2021 à 14:38

Chunk_lance

Chunk_lance allows to run a sequence of monthly (or sub-monthly) model jobs in one big job.

A GEM/GEMCLIM/CRCM climate simulation usually consists of a sequence of monthly jobs. But even a weather forecast simulation can consist of a sequence of, for example, n-daily jobs (set using the parameter 'Fcst_rstrt_S').

Each job consists of 2 parts:

  1. scripts (copies restart files from previous month, prepare config files for new month)
  2. model (main model job)

Chunk_lance will run a series of these 2 part jobs, checking at the end of each job if there is still enough time to calculate another job. If yes another job will get executed, if not an new chunk_job will get submitted automatically(!).


As with Um_lance, Chunk_lance needs to get executed from the config file directory and the model environment needs to have been set. To start a simulation from the beginning add the key '-start';

    Chunk_lance -start

To continue a simulation just execute the command without any key:

    Chunk_lance


The wallclock time for which one chunk_job will be running can be set in the file 'configexp.dot.cfg' with the parameter 'BACKEND_time_mod'.


In case the model job fails it will automatically get reexecuted a second time.

In case the whole simulation crashes, it can easily get resubmitted by executing 'Chunk_lance' again.
To know where the simulation was, Chunk_lance is using a log file called 'chunk_job.log'. The chunk_job itself will check this file to determine which job to execute next. Therefore this log file should not be touched.
However, to rerun part of a simulation one can alter the log file by hand. Just make sure there is never a blank line at the end of the log file since the chunk_job only checks the very last line of the log file!


But sometimes it happens that part of the restart files gets overwritten before a month is finished or that the restart files are corrupted. In that case we cannot just relaunch the month with Chunk_lance but have to get the original last uncorrupted restart files back (from the previous month) etc. and restart the simulation from there.

When the model job stops for whatever reason, there is an automatic check in Chunk_lance if the restart files are still the original ones or not. If they got already modified the following message will appear in the "chunk_job listing" (!!!) not in the model listing:

   At least one of the restart files got already rewritten
   Therefore the model could not get restarted automatically
   You have to restart your simulation starting from the previous restart files
            ----- ABORT -----
 
If you see this message please do not simply restart the simulation with Chunk_lance but as said, from the previous restart file - see below.


Since the running job is now always called "cjob_${exp}_..." one cannot see anymore from the job name how far the simulation has progressed. But one can always have a look at the listings directory and also a log file is kept in the config file directory called 'chunk_job.log'.


Restart using Chunk_lance

In case a simulation stops and you want to find out which job (scripts, entry or model) crashed, you have have look in the listings directory (~/listings/${TRUE_HOST}). Check which of the following jobs has crashed:

    ${exp}_S     (Scripts)
    ${exp}_M     (Model)

Or you can have a look at the log file "chunk_job.log".

In any case, you can restart your simulation by simply executing
    Chunk_lance
again in the config file directory.
Of course AFTER you fixed the problem - unless it was a machine problem. In the latter case, just restart the simulation with 'Chunk_lance'.


Restart from previous restart file

Continue simulation from a restart file:

    * Copy restart file of previous month (if there is more than 1 part, copy all parts!)  from the archive (${CLIMAT_archdir}/Restarts) back into the execution directory "~/MODEL_EXEC_RUN/$TRUE_HOST"

    * Gunzip and unarchive (cmcarc -x -f ...) the restart file(s) in ~/MODEL_EXEC_RUN/$TRUE_HOST
      The cmcarc-command will create a new directory, but the *.ca file will remain in the directory. You can remove it again.

    * Go into the config file directory

    * Edit the log file 'chunk_job.log':
        (First I suggest to make a backup copy of the log file.)
        The remove all lines concerning the month you want to rerun and all following lines.

    * If you are running the entry in parallel,
       you will also have to remove all the ${exp}_entry_finished flags in your config file directory for all the months you want to rerun.
       Otherwise the entries for these months will not get rerun!

    * Execute "Chunk_lance" again