[GE users] powered off nodes and SGE

reuti reuti at staff.uni-marburg.de
Fri Aug 13 12:54:58 BST 2010

Am 12.08.2010 um 20:04 schrieb kisielk:

>>> 2) when a node is powered off , scheduler ignore that node or still schedule jobs on that node ? 
>> No, it won't schedule any job to it. OTOH, when a node shuts down while it's running a job, SGE can be configured to reschedule the job to a different node to restart from the beginning.
>> -- Reuti
> Can you give an example of how to set this up? 

$ qconf -sconf
max_unheard                  00:05:00
reschedule_unknown           00:01:00

The jobs (`qsub -r y ...`) or queue ("rerun TRUE") must be configured to make this happen for specific jobs/queues.

For details please see `man sge_conf`.

> I've had problems getting SGE to handle jobs on nodes that go down. The queues go in to the "au" state but the job remains there.  Sometimes after the node reboots, SGE still reports the jobs as running in those queues even though they clearly are not. 
> My nodes use tmpfs mounted spool directories, so the spool is empty after they reboot. Perhaps this has something to do with it?

Mmh, in this case the restarted execd can't know anything about the failed job before the last crash. So there is no email sent from the exechost about this job in addition or the qmaster informed.

AFAIK, there is nothing implemented that will synchronize the list of jobs the qmaster expects to be on a node, and the list the execd knows about.

-- Reuti

> In any case, what I would like to happen is when a node becomes unavailable for the jobs on it to either fail or be restarted on another node if they are eligible.
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=274052
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list