Queued Work Stream : Différence entre versions
m |
m (→Work Streams) |
||
Ligne 13: | Ligne 13: | ||
=== What is a work stream === | === What is a work stream === | ||
− | A work stream is a series of "jobs" having a similar resource profile. In order not to overtax the system job scheduler with a myriad of relatively "small" work items, said items are inserted into "pseudo queues" and processed by a "master job". | + | A work stream is a series of "jobs" having a similar resource profile (number of cpus). In order not to overtax the system job scheduler with a myriad of relatively "small" work items, said items are inserted into "pseudo queues" and processed by a "master job". |
*A user's work stream(s) will be found in directory '''$HOME/.job_queues''' <br>This directory in turn contains subdirectories, one for each "pseudo queue". | *A user's work stream(s) will be found in directory '''$HOME/.job_queues''' <br>This directory in turn contains subdirectories, one for each "pseudo queue". | ||
Ligne 22: | Ligne 22: | ||
*a name (arbitrary) | *a name (arbitrary) | ||
− | *a set of pseudo queues (may be used to implement | + | *a set of pseudo queues (may be used to implement a priority scheme) |
− | *a computing surface (number of | + | *a computing surface (number of cpus/cores) |
*a duration (number of hours, days, weeks...) | *a duration (number of hours, days, weeks...) | ||
*a maximum idle time (if a stream is using a large number of nodes, its maximum idle time should be very short) | *a maximum idle time (if a stream is using a large number of nodes, its maximum idle time should be very short) | ||
Ligne 31: | Ligne 31: | ||
The [[Soumet : travaux par lots / batch jobs|ord_soumet]] utility is used to insert work into a "pseudo queue". The syntax is almost the same as for submitting a job to the system's batch scheduler. The "'''-q pseudo_queue_name@'''" parameter to ord_soumet is used to indicate that instead of being submitted directly, the piece of work (job) should rather be inserted into the "pseudo_queue_name" work queue.<br> | The [[Soumet : travaux par lots / batch jobs|ord_soumet]] utility is used to insert work into a "pseudo queue". The syntax is almost the same as for submitting a job to the system's batch scheduler. The "'''-q pseudo_queue_name@'''" parameter to ord_soumet is used to indicate that instead of being submitted directly, the piece of work (job) should rather be inserted into the "pseudo_queue_name" work queue.<br> | ||
− | In order to activate "queue" inheritance (a job/piece of work will automagically submit to | + | In order to activate "queue" inheritance (a job/piece of work will automagically submit to the queue it is coming from) |
− | + | #use "'''-q'''" when calling [[Soumet : travaux par lots / batch jobs|ord_soumet]] | |
− | + | #export SOUMET_EXTRAS="-q" (may be done using ~/.profile.d/.batch_profile) | |
− | === Submitting | + | === Submitting the master job for a work stream === |
− | + | The master job will be launched with the '''u.run_work_stream''' command | |
A stream master job will terminate automatically if | A stream master job will terminate automatically if | ||
Ligne 50: | Ligne 50: | ||
*there is not enough time left in the master job to run the piece of work (including the safety margin of 1 minute) | *there is not enough time left in the master job to run the piece of work (including the safety margin of 1 minute) | ||
− | example: <br><tt>u.run_work_stream -name stream01 -maxidle 120 -queues p01 p02 p03 -t 7200 -mpi -cpus 144x1 -jn my_stream</tt><br> A master job stream01 will be started for 7200 seconds on 144 cpus, the batch scheduler job name will be my_stream, pieces of work will be fetched from pseudo queues p01, p02 and p03. If no suitable work if found for more than 120 seconds, the master job will terminate. | + | example: <br><tt>u.run_work_stream -instances 1 -name stream01 -maxidle 120 -queues p01 p02 p03 -t 7200 -mpi -cpus 144x1 -jn my_stream</tt><br> A single instance master job named stream01 will be started for 7200 seconds on 144 cpus, the batch scheduler job name will be my_stream, pieces of work will be fetched from pseudo queues p01, p02 and p03. If no suitable work if found for more than 120 seconds, the master job will terminate. |
=== Controlling a work stream === | === Controlling a work stream === |
Version depuis le 15 de décembre 2011 à 12:58
en construction |
under construction |
Matières
Work Streams
What is a work stream
A work stream is a series of "jobs" having a similar resource profile (number of cpus). In order not to overtax the system job scheduler with a myriad of relatively "small" work items, said items are inserted into "pseudo queues" and processed by a "master job".
- A user's work stream(s) will be found in directory $HOME/.job_queues
This directory in turn contains subdirectories, one for each "pseudo queue". - More than one master job can go "fishing" into a "pseudo queue".
- Job monitoring will be started by the master job using u.job-monitor
The main characteristics of a work stream are
- a name (arbitrary)
- a set of pseudo queues (may be used to implement a priority scheme)
- a computing surface (number of cpus/cores)
- a duration (number of hours, days, weeks...)
- a maximum idle time (if a stream is using a large number of nodes, its maximum idle time should be very short)
Inserting work into a work queue
The ord_soumet utility is used to insert work into a "pseudo queue". The syntax is almost the same as for submitting a job to the system's batch scheduler. The "-q pseudo_queue_name@" parameter to ord_soumet is used to indicate that instead of being submitted directly, the piece of work (job) should rather be inserted into the "pseudo_queue_name" work queue.
In order to activate "queue" inheritance (a job/piece of work will automagically submit to the queue it is coming from)
- use "-q" when calling ord_soumet
- export SOUMET_EXTRAS="-q" (may be done using ~/.profile.d/.batch_profile)
Submitting the master job for a work stream
The master job will be launched with the u.run_work_stream command
A stream master job will terminate automatically if
- no work was found for maxidle seconds
- there is less than one minute left in the master job
A piece of work will be left in the queue if
- there are not enough cpus in the master job to do the work
- there is not enough time left in the master job to run the piece of work (including the safety margin of 1 minute)
example:
u.run_work_stream -instances 1 -name stream01 -maxidle 120 -queues p01 p02 p03 -t 7200 -mpi -cpus 144x1 -jn my_stream
A single instance master job named stream01 will be started for 7200 seconds on 144 cpus, the batch scheduler job name will be my_stream, pieces of work will be fetched from pseudo queues p01, p02 and p03. If no suitable work if found for more than 120 seconds, the master job will terminate.
Controlling a work stream
The work stream can be controlled via its control file
$HOME/.job_queues/.active_name_jobid
- removing the file will terminate the stream after the current piece of work is done
- writing
MaxIdle=new_value
in the control file will implement the new value for max idle time
Aborting and rerunning a piece of work
a piece of work may abort and signal to the master job that it should be rerun (up to N times) with the following command
. exit_and_rerun_work.dot N
this command will also make sure that the post work cleanup code inserted by ord_soumet will not be performed