[GE users] (another) slotwise preemption question

reuti reuti at staff.uni-marburg.de
Fri Aug 27 10:28:16 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

Am 27.08.2010 um 10:51 schrieb spow_:

> reuti a écrit :
> > <snip>
> >> So, question is, why is SGE trying to push a 5th job onto
> >> a machine that has only 4 slots, and all 4 are "busy" ? And, is
> >> there a way around this ?
> >>     
> >
> > What about using a checkpointing environment for the jobs in the secondary queue, where the suspension of the job will kill and requeue it (check-transparent will do already). You wouldn't need any special script like the one you used for the suspension right now.
> >   
> Could you further explain this ? I am also using a co-scheduler to qmod -rj jobs 
> that have 'S' in their state which means their slots got preempted, and I am 
> also concerned with the example the PO adduced.
> Does the check-transparent environment automatically requeue jobs that got 
> suspended ?

yep. The setting of "when x" in the checkpointing environment will do it.


> Can it be used _without_ any end-user code/script modification ? (just specify 
> parameters in SGE)

yep (the jobs will restart all from the beginning this way, for a real checkpointing you need more efforts).


There are some nice state diagramms in:

http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf

and also a Howto is available: http://gridengine.sunsource.net/howto/checkpointing.html

If you just want a reschedule in case a job gets suspended (either automatically or by a `qmod -sj <jobid>`), the checkpointing environment can look like:

$ qconf -sckpt check_transparent
ckpt_name          check_transparent
interface          transparent
ckpt_command       none
migr_command       none
restart_command    none
clean_command      none
ckpt_dir           /tmp
signal             usr1
when               x

This you will have to attach to a queue and either request it on the command line:

$ qsub -ckpt check_transparent ...

or automatically for some queues by a JSV (nevertheless: when you request the checkpointing environment in `qsub` and it's only attached to one queue, you don't need to request a particular queue any longer as it can run only in a certain one due to the request).

-- Reuti

PS: `man sge_chkpt` and `man checkpoint` will have additional infos.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277372

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list