Queued Work Stream : Différence entre versions

Version depuis le 15 de décembre 2011 à 12:58

en construction

under construction

Matières

1 Work Streams

Work Streams

What is a work stream

A work stream is a series of "jobs" having a similar resource profile (number of cpus). In order not to overtax the system job scheduler with a myriad of relatively "small" work items, said items are inserted into "pseudo queues" and processed by a "master job".

A user's work stream(s) will be found in directory $HOME/.job_queues
This directory in turn contains subdirectories, one for each "pseudo queue".
More than one master job can go "fishing" into a "pseudo queue".
Job monitoring will be started by the master job using u.job-monitor

The main characteristics of a work stream are

a name (arbitrary)
a set of pseudo queues (may be used to implement a priority scheme)
a computing surface (number of cpus/cores)
a duration (number of hours, days, weeks...)
a maximum idle time (if a stream is using a large number of nodes, its maximum idle time should be very short)

Inserting work into a work queue

The ord_soumet utility is used to insert work into a "pseudo queue". The syntax is almost the same as for submitting a job to the system's batch scheduler. The "-q pseudo_queue_name@" parameter to ord_soumet is used to indicate that instead of being submitted directly, the piece of work (job) should rather be inserted into the "pseudo_queue_name" work queue.

In order to activate "queue" inheritance (a job/piece of work will automagically submit to the queue it is coming from)

use "-q" when calling ord_soumet
export SOUMET_EXTRAS="-q" (may be done using ~/.profile.d/.batch_profile)

Submitting the master job for a work stream

The master job will be launched with the u.run_work_stream command

A stream master job will terminate automatically if

no work was found for maxidle seconds
there is less than one minute left in the master job

A piece of work will be left in the queue if

there are not enough cpus in the master job to do the work
there is not enough time left in the master job to run the piece of work (including the safety margin of 1 minute)

example:
u.run_work_stream -instances 1 -name stream01 -maxidle 120 -queues p01 p02 p03 -t 7200 -mpi -cpus 144x1 -jn my_stream
A single instance master job named stream01 will be started for 7200 seconds on 144 cpus, the batch scheduler job name will be my_stream, pieces of work will be fetched from pseudo queues p01, p02 and p03. If no suitable work if found for more than 120 seconds, the master job will terminate.

Controlling a work stream

The work stream can be controlled via its control file

$HOME/.job_queues/.active_name_jobid

removing the file will terminate the stream after the current piece of work is done
writing
MaxIdle=new_value
in the control file will implement the new value for max idle time

Aborting and rerunning a piece of work

a piece of work may abort and signal to the master job that it should be rerun (up to N times) with the following command

. exit_and_rerun_work.dot N

this command will also make sure that the post work cleanup code inserted by ord_soumet will not be performed

@@ Ligne 13: / Ligne 13: @@
 === What is a work stream  ===
-A work stream is a series of "jobs" having a similar resource profile. In order not to overtax the system job scheduler with a myriad of relatively "small" work items, said items are inserted into "pseudo queues" and processed by a "master job".
+A work stream is a series of "jobs" having a similar resource profile (number of cpus). In order not to overtax the system job scheduler with a myriad of relatively "small" work items, said items are inserted into "pseudo queues" and processed by a "master job".
 *A user's work stream(s) will be found in directory '''$HOME/.job_queues''' <br>This directory in turn contains subdirectories, one for each "pseudo queue".
@@ Ligne 22: / Ligne 22: @@
 *a name (arbitrary)
-*a set of pseudo queues (may be used to implement some sort of priority scheme)
+*a set of pseudo queues (may be used to implement a priority scheme)
-*a computing surface (number of nodes)
+*a computing surface (number of cpus/cores)
 *a duration (number of hours, days, weeks...)
 *a maximum idle time (if a stream is using a large number of nodes, its maximum idle time should be very short)
@@ Ligne 31: / Ligne 31: @@
 The [[Soumet : travaux par lots / batch jobs|ord_soumet]] utility is used to insert work into a "pseudo queue". The syntax is almost the same as for submitting a job to the system's batch scheduler. The "'''-q pseudo_queue_name@'''" parameter to ord_soumet is used to indicate that instead of being submitted directly, the piece of work (job) should rather be inserted into the "pseudo_queue_name" work queue.<br>
-In order to activate "queue" inheritance (a job/piece of work will automagically submit to its own queue)
+In order to activate "queue" inheritance (a job/piece of work will automagically submit to the queue it is coming from)
-*use "'''-q'''" when calling [[Soumet : travaux par lots / batch jobs|ord_soumet]]
+#use "'''-q'''" when calling [[Soumet : travaux par lots / batch jobs|ord_soumet]]
-*export SOUMET_EXTRAS="-q" (may be done using ~/.profile.d/.batch_profile)
+#export SOUMET_EXTRAS="-q" (may be done using ~/.profile.d/.batch_profile)
-=== Submitting a master job for a work stream  ===
+=== Submitting the master job for a work stream  ===
-By submitting a master job with the '''u.run_work_stream''' command
+The master job will be launched with the '''u.run_work_stream''' command
 A stream master job will terminate automatically if
@@ Ligne 50: / Ligne 50: @@
 *there is not enough time left in the master job to run the piece of work (including the safety margin of 1 minute)
-example: <br><tt>u.run_work_stream -name stream01 -maxidle 120 -queues p01 p02 p03 -t 7200 -mpi -cpus 144x1 -jn my_stream</tt><br> A master job stream01 will be started for 7200 seconds on 144 cpus, the batch scheduler job name will be my_stream, pieces of work will be fetched from pseudo queues p01, p02 and p03. If no suitable work if found for more than 120 seconds, the master job will terminate.
+example: <br><tt>u.run_work_stream -instances 1 -name stream01 -maxidle 120 -queues p01 p02 p03 -t 7200 -mpi -cpus 144x1 -jn my_stream</tt><br> A single instance master job named stream01 will be started for 7200 seconds on 144 cpus, the batch scheduler job name will be my_stream, pieces of work will be fetched from pseudo queues p01, p02 and p03. If no suitable work if found for more than 120 seconds, the master job will terminate.
 === Controlling a work stream  ===

Queued Work Stream : Différence entre versions

Version depuis le 15 de décembre 2011 à 12:58

Matières

Work Streams

What is a work stream

Inserting work into a work queue

Submitting the master job for a work stream

Controlling a work stream

Aborting and rerunning a piece of work

Menu de navigation

Outils personnels

Espaces des noms

Variantes

Vues

Plus

Charcher

Navigation

Outils