[GE users] Using qdel leaves queues in error status

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Mon May 26 15:51:02 BST 2008


Hi Filipe,

do you keep your execd spool directories on a shared volume? 
If so you should consider to move them to local volumes

    http://gridengine.sunsource.net/howto/nfsreduce.html

note, this is always worthwhile in terms of overall throughput 
and there is a chance that the error you observe does not occur 
anymore.

Another possibility is to upgrade to 6.1u4. That way you would

    752      6288953   scalability issue with qdel and very large array jobs

which we fixed actually already during 6.1 beta:

    http://gridengine.sunsource.net/project/gridengine/61patches.txt

Regards,
Andreas


On Fri, 23 May 2008, Filipe Brandenburger wrote:

> Hello,
>
> I'm quite new to SGE, but I'm managing a quite large installation. We
> are using SGE 6.0 on Linux CentOS 4.
>
> I'm having a problem (happened twice this week) when users submit a very
> large number of jobs and then use qdel to kill all of them. Twice this
> problem left me with queues in (E) error state.
>
> The problem appears to happen when the kill signal is delivered before
> the job has started. sge_shepherd quits with a message that says that
> the "exit_status" file did not exist, then it returns code 7 (problem
> before prolog), and this leaves the queue in error state.
>
> I think this should not be something that should leave the queue in
> error state, since because of that no new jobs will run on that node
> until an administrator realises this problem happened and re-enables the
> queue, and we will lose CPU time meanwhile. There is nothing really
> wrong with the queue, to me it seems like this is a "race condition"
> bug. Properly disabling/reenabling signals would probably fix this, if I
> remember Unix programming correctly.
>
> Below I'm attaching the logs for both times the problem happened.
>
> The questions I have are:
>
> Has anyone else seen this problem before? How do you deal with it?
>
> Is there a fix for it? Does version 6.1 address this problem? Is there a
> patch that I could apply to my installation of 6.0?
>
> Why in the first case the jobs died with signal HUP, and in the second
> case they died with signal KILL?
>
> TIA,
> Filipe
>
>
>
>
>
> These are the logs for the first time:
>
>> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: endpoint is not unique error (endpoint "submitnode1.mydomain.com/qrsh/63548" is already connected)
>> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: endpoint is not unique error (endpoint "submitnode1.mydomain.com/qrsh/63546" is already connected)
>> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: got read error (closing "submitnode1.mydomain.com/qrsh/63548")
>> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: got read error (closing "submitnode1.mydomain.com/qrsh/63546")
>> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: got read error (closing "submitnode1.mydomain.com/qrsh/63631")
> ...
>> 05/20/2008 09:29:38|qmaster|sgemaster|E|commlib error: got read error (closing "submitnode1.mydomain.com/qrsh/63593")
>> 05/20/2008 09:29:38|qmaster|sgemaster|E|commlib error: got read error (closing "submitnode1.mydomain.com/qrsh/63568")
>> 05/20/2008 09:29:38|qmaster|sgemaster|W|job 7972205.1 failed on host d03.mydomain.com assumedly after job because: job 7972205.1 died through signal HUP (1)
>> 05/20/2008 09:29:38|qmaster|sgemaster|W|job 7972216.1 failed on host s04.mydomain.com assumedly after job because: job 7972216.1 died through signal HUP (1)
>> 05/20/2008 09:29:38|qmaster|sgemaster|W|job 7972221.1 failed on host b13.mydomain.com assumedly after job because: job 7972221.1 died through signal HUP (1)
>> 05/20/2008 09:29:39|qmaster|sgemaster|W|job 7972233.1 failed on host d03.mydomain.com assumedly after job because: job 7972233.1 died through signal HUP (1)
>> 05/20/2008 09:29:39|qmaster|sgemaster|W|job 7972314.1 failed on host l02.mydomain.com assumedly after job because: job 7972314.1 died through signal HUP (1)
> ...
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972298.1 failed on host s12.mydomain.com assumedly after job because: job 7972298.1 died through signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972208.1 failed on host d04.mydomain.com assumedly after job because: job 7972208.1 died through signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972239.1 failed on host d04.mydomain.com assumedly after job because: job 7972239.1 died through signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972258.1 failed on host d04.mydomain.com general before prolog because: shepherd exited with exit status 7
>> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972258's failure at host d04.mydomain.com
>> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972258's failure at host d04.mydomain.com
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972212.1 failed on host s01.mydomain.com assumedly after job because: job 7972212.1 died through signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972240.1 failed on host s01.mydomain.com assumedly after job because: job 7972240.1 died through signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972260.1 failed on host s01.mydomain.com general before prolog because: shepherd exited with exit status 7
>> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972260's failure at host s01.mydomain.com
>> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972260's failure at host s01.mydomain.com
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972283.1 failed on host j12.mydomain.com assumedly after job because: job 7972283.1 died through signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972311.1 failed on host j12.mydomain.com assumedly after job because: job 7972311.1 died through signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972327.1 failed on host j12.mydomain.com general before prolog because: shepherd exited with exit status 7
>> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972327's failure at host j12.mydomain.com
>> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972327's failure at host j12.mydomain.com
>> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972300.1 failed on host s14.mydomain.com assumedly after job because: job 7972300.1 died through signal HUP (1)
>> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972318.1 failed on host s14.mydomain.com assumedly after job because: job 7972318.1 died through signal HUP (1)
>> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972334.1 failed on host s14.mydomain.com general before prolog because: shepherd exited with exit status 7
>> 05/20/2008 09:29:44|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972334's failure at host s14.mydomain.com
>> 05/20/2008 09:29:44|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972334's failure at host s14.mydomain.com
>> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972209.1 failed on host j14.mydomain.com assumedly after job because: job 7972209.1 died through signal HUP (1)
>> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972238.1 failed on host j14.mydomain.com assumedly after job because: job 7972238.1 died through signal HUP (1)
> ...
>> 05/20/2008 09:29:47|qmaster|sgemaster|W|job 7972371.1 failed on host k01.mydomain.com assumedly after job because: job 7972371.1 died through signal HUP (1)
>> 05/20/2008 09:29:47|qmaster|sgemaster|W|job 7972335.1 failed on host s02.mydomain.com assumedly after job because: job 7972335.1 died through signal HUP (1)
>> 05/20/2008 09:29:48|qmaster|sgemaster|W|job 7972372.1 failed on host k01.mydomain.com assumedly after job because: job 7972372.1 died through signal HUP (1)
>> 05/20/2008 09:29:48|qmaster|sgemaster|E|ack event for unknown job 7972361
>> 05/20/2008 09:29:48|qmaster|sgemaster|E|ack event for unknown job 7972365
>> 05/20/2008 09:29:49|qmaster|sgemaster|W|job 7972373.1 failed on host k01.mydomain.com assumedly after job because: job 7972373.1 died through signal HUP (1)
>> 05/20/2008 09:29:49|qmaster|sgemaster|W|job 7972374.1 failed on host k01.mydomain.com assumedly after job because: job 7972374.1 died through signal HUP (1)
>> 05/20/2008 09:29:49|qmaster|sgemaster|W|job 7972375.1 failed on host k01.mydomain.com assumedly after job because: job 7972375.1 died through signal HUP (1)
>
> Logs of the nodes:
>
>> 05/20/2008 09:29:42|execd|d04|E|abnormal termination of shepherd for job 7972258.1: no "exit_status" file
>> 05/20/2008 09:29:42|execd|s01|E|abnormal termination of shepherd for job 7972260.1: no "exit_status" file
>> 05/20/2008 09:29:42|execd|j12|E|abnormal termination of shepherd for job 7972327.1: no "exit_status" file
>> 05/20/2008 09:29:42|execd|s14|E|abnormal termination of shepherd for job 7972334.1: no "exit_status" file
>
>
>
> Second time the problem happened:
>
>> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8011604.1 failed on host d05.mydomain.com assumedly after job because: job 8011604.1 died through signal KILL (9)
>> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8011572.1 failed on host j12.mydomain.com assumedly after job because: job 8011572.1 died through signal KILL (9)
>> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8011707.1 failed on host j05.mydomain.com assumedly after job because: job 8011707.1 died through signal KILL (9)
>> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8011708.1 failed on host j05.mydomain.com assumedly after job because: job 8011708.1 died through signal KILL (9)
>> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8012402.1 failed on host j05.mydomain.com general before prolog because: shepherd exited with exit status 7
>> 05/21/2008 18:42:10|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 8012402's failure at host j05.mydomain.com
>> 05/21/2008 18:42:10|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 8012402's failure at host j05.mydomain.com
>> 05/21/2008 18:42:11|qmaster|sgemaster|W|job 8012368.1 failed on host j14.mydomain.com assumedly after job because: job 8012368.1 died through signal KILL (9)
>> 05/21/2008 18:42:11|qmaster|sgemaster|W|job 8011730.1 failed on host j14.mydomain.com assumedly after job because: job 8011730.1 died through signal KILL (9)
>> 05/21/2008 18:42:11|qmaster|sgemaster|W|job 8012332.1 failed on host j14.mydomain.com assumedly after job because: job 8012332.1 died through signal KILL (9)
>> 05/21/2008 18:42:11|qmaster|sgemaster|E|commlib error: got read error (closing "submitnode2.mydomain.com/qdel/30540")
>
> Log of the node:
>
>> 05/21/2008 18:42:07|execd|j05|E|abnormal termination of shepherd for job 8012402.1: no "exit_status" file
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list