Queued Work Stream : Différence entre versions

Un article de Informaticiens département des sciences de la Terre et l'atmosphère
Aller à: navigation, charcher
m
m
Ligne 50: Ligne 50:
 
*there is not enough time left in the master job to run the piece of work (including the safety margin of 1 minute)
 
*there is not enough time left in the master job to run the piece of work (including the safety margin of 1 minute)
  
example:  
+
example: <br><tt>u.run_work_stream -name stream01 -maxidle 120 -queues p01 p02 p03 -t 7200 -mpi -cpus 144x1 -jn my_stream</tt><br> A master job stream01 will be started for 7200 seconds on 144 cpus, the batch scheduler job name will be my_stream, pieces of work will be fetched from pseudo queues p01, p02 and p03. If no suitable work if found for more than 120 seconds, the master job will terminate.
<pre>u.run_work_stream -name stream01 -maxidle 120 -queues p01 p02 p03 -t 7200 -mpi -cpus 144x1 -jn my_stream</pre>
 
A master job stream01 will be started for 7200 seconds on 144 cpus, the batch scheduler job name will be my_stream, pieces of work will be fetched from pseudo queues p01, p02 and p03. If no suitable work if found for more than 120 seconds, the master job will terminate.
 
  
 
=== Controlling a work stream  ===
 
=== Controlling a work stream  ===

Version depuis le 28 de novembre 2011 à 14:48

en construction

under construction

Work Streams

What is a work stream

A work stream is a series of "jobs" having a similar resource profile. In order not to overtax the system job scheduler with a myriad of relatively "small" work items, said items are inserted into "pseudo queues" and processed by a "master job".

  • A user's work stream(s) will be found in directory $HOME/.job_queues
    This directory in turn contains subdirectories, one for each "pseudo queue".
  • More than one master job can go "fishing" into a "pseudo queue".
  • Job monitoring will be started by the master job using u.job-monitor

The main characteristics of a work stream are

  • a name (arbitrary)
  • a set of pseudo queues (may be used to implement some sort of priority scheme)
  • a computing surface (number of nodes)
  • a duration (number of hours, days, weeks...)
  • a maximum idle time (if a stream is using a large number of nodes, its maximum idle time should be very short)

Inserting work into a work queue

The ord_soumet utility is used to insert work into a "pseudo queue". The syntax is almost the same as for submitting a job to the system's batch scheduler. The "-q pseudo_queue_name@" parameter to ord_soumet is used to indicate that instead of being submitted directly, the piece of work (job) should rather be inserted into the "pseudo_queue_name" work queue.

In order to activate "queue" inheritance (a job/piece of work will automagically submit to its own queue)

  • use "-q" when calling ord_soumet
  • export SOUMET_EXTRAS="-q" (may be done using ~/.profile.d/.batch_profile)

Submitting a master job for a work stream

By submitting a master job with the u.run_work_stream command

A stream master job will terminate automatically if

  • no work was found for maxidle seconds
  • there is less than one minute left in the master job

A piece of work will be left in the queue if

  • there are not enough cpus in the master job to do the work
  • there is not enough time left in the master job to run the piece of work (including the safety margin of 1 minute)

example:
u.run_work_stream -name stream01 -maxidle 120 -queues p01 p02 p03 -t 7200 -mpi -cpus 144x1 -jn my_stream
A master job stream01 will be started for 7200 seconds on 144 cpus, the batch scheduler job name will be my_stream, pieces of work will be fetched from pseudo queues p01, p02 and p03. If no suitable work if found for more than 120 seconds, the master job will terminate.

Controlling a work stream

The work stream can be controlled via its control file

$HOME/.job_queues/.active_name_jobid

  • removing the file will terminate the stream after the current piece of work is done
  • writing
    MaxIdle=new_value
    in the control file will implement the new value for max idle time

Aborting and rerunning a piece of work

a piece of work may abort and signal to the master job that it should be rerun (up to N times) with the following command

. exit_and_rerun_work.dot N

this command will also make sure that the post work cleanup code inserted by ord_soumet will not be performed