[GE users] (another) slotwise preemption question
reuti at staff.uni-marburg.de
Fri Aug 27 10:28:16 BST 2010
[ The following text is in the "utf-8" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some characters may be displayed incorrectly. ]
Am 27.08.2010 um 10:51 schrieb spow_:
> reuti a écrit :
> > <snip>
> >> So, question is, why is SGE trying to push a 5th job onto
> >> a machine that has only 4 slots, and all 4 are "busy" ? And, is
> >> there a way around this ?
> > What about using a checkpointing environment for the jobs in the secondary queue, where the suspension of the job will kill and requeue it (check-transparent will do already). You wouldn't need any special script like the one you used for the suspension right now.
> Could you further explain this ? I am also using a co-scheduler to qmod -rj jobs
> that have 'S' in their state which means their slots got preempted, and I am
> also concerned with the example the PO adduced.
> Does the check-transparent environment automatically requeue jobs that got
> suspended ?
yep. The setting of "when x" in the checkpointing environment will do it.
> Can it be used _without_ any end-user code/script modification ? (just specify
> parameters in SGE)
yep (the jobs will restart all from the beginning this way, for a real checkpointing you need more efforts).
There are some nice state diagramms in:
and also a Howto is available: http://gridengine.sunsource.net/howto/checkpointing.html
If you just want a reschedule in case a job gets suspended (either automatically or by a `qmod -sj <jobid>`), the checkpointing environment can look like:
$ qconf -sckpt check_transparent
This you will have to attach to a queue and either request it on the command line:
$ qsub -ckpt check_transparent ...
or automatically for some queues by a JSV (nevertheless: when you request the checkpointing environment in `qsub` and it's only attached to one queue, you don't need to request a particular queue any longer as it can run only in a certain one due to the request).
PS: `man sge_chkpt` and `man checkpoint` will have additional infos.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users