Superjobs guillimin

Un article de Informaticiens département des sciences de la Terre et l'atmosphère
Aller à: navigation, charcher

Superjobs

A "superjob" is a job which runs on one of the normal queues and executes other jobs,  submitted to a faked queue, one after the other.
It will run until the requested wallclock time is used up or until it does not find any job to execute for a certain time.

                NEVER KILL A SUPERJOB !!!               See below for more information.

A superjob is a very useful tool to execute post processing jobs. It will make the automatic submission of post processing jobs by the model independent of guillimin's "moods". No jobs will get lost or have to get resubmitted by hand.


How to start a "superjob"

The command to submit a superjob is "u.run_work_stream":

  u.run_work_stream [-instances n] -t mseconds -cpus number_of_cpus -name stream_name -maxidle nseconds -queues q1 q2 ... qn [--] "arguments_for_ord_soumet"

  Arguments_for_ord_soumet (anything found after -- will be passed verbatim to ord_soumet) may include -q, -jn, and any other relevant argument

Submission example:

  u.run_work_stream -t 2592000 -cpus 1 -name superjob_1a -maxidle 3600 -queues sj1  --  -jn superjob_1a

In this case a superjob with the name superjob_1 will get submitted.
'-name' is the internal name of the superjob, '-jn' the name of the listing.
For simplicity I suggest to keep the two names the same.
Make sure to NEVER HAVE TWO SUPERJOBS WITH THE SAME NAME running. But once a superjob has finished you can submit a new one with the same name.

The superjob will get submitted for '-t 2592000' seconds (30 days) on '-cpus 1' cpu to the queue '-q sw'.

If it does not find a job to execute for '-maxidle 36000' seconds it will terminate itself.

The superjob will execute jobs which got submitted to the faked queue '-queues sj1'. You can name the faked queue anyway you want.


How to send jobs to the "faked" queue

At the moment only jobs running on 1-4 cores can get executed by a superjob. But this can easily be changed. Just let Katja or Michel know.

To have for example all jobs  submitted to run on 1 core, executed by the above submitted superjob instead of being actually submitted, one has to set the environment variable:

  QUEUE_1CPU=sj1@

You can export this variable in your ~/.profile.d/.batch_profile:

  export QUEUE_1CPU=sj1@

The '@' at the end is very important. This tells 'soumet' that this is a faked queue and not a real one.


What will happen

Once the environment variable QUEUE_1CPU is set to 'sj1@' all jobs submitted on 1 cpu will not actually get submitted. Instead a link to them will get created in the directory:

  ~/.job_queues/sj1-1

A superjob "picking" from queue 'sj1' will check if there is a link in this directory. If yes, it will execute the corresponding job.

If you see the links in this directory piling up you can submit a second, third, ... superjob, executing job from the same faked queue. Just make sure to use a different name for each superjob you submit!
It does make sense to submit "extra" superjobs with a very short '-maxidle' time.


How to elegantly terminate a superjob

As mentioned above: Never kill a superjob with 'qdel'!

Every superjob has a config file:

  ~/.job_queues/.active_superjob_name_*.1

You can edit this file and set for example:

  MaxIdle=0

As soon as there are no more jobs to be executed, the superjob will gracefully terminate itself.

Detailed technical information on Superjob technology

Queued work streams page