[GE users] Using qdel leaves queues in error status

Filipe Brandenburger filipe.brandenburger at idilia.com
Fri May 23 14:28:31 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

I'm quite new to SGE, but I'm managing a quite large installation. We
are using SGE 6.0 on Linux CentOS 4.

I'm having a problem (happened twice this week) when users submit a very
large number of jobs and then use qdel to kill all of them. Twice this
problem left me with queues in (E) error state.

The problem appears to happen when the kill signal is delivered before
the job has started. sge_shepherd quits with a message that says that
the "exit_status" file did not exist, then it returns code 7 (problem
before prolog), and this leaves the queue in error state.

I think this should not be something that should leave the queue in
error state, since because of that no new jobs will run on that node
until an administrator realises this problem happened and re-enables the
queue, and we will lose CPU time meanwhile. There is nothing really
wrong with the queue, to me it seems like this is a "race condition"
bug. Properly disabling/reenabling signals would probably fix this, if I
remember Unix programming correctly.

Below I'm attaching the logs for both times the problem happened.

The questions I have are:

Has anyone else seen this problem before? How do you deal with it?

Is there a fix for it? Does version 6.1 address this problem? Is there a
patch that I could apply to my installation of 6.0?

Why in the first case the jobs died with signal HUP, and in the second
case they died with signal KILL?

TIA,
Filipe





These are the logs for the first time:

> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: endpoint is not unique error (endpoint "submitnode1.mydomain.com/qrsh/63548" is already connected)
> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: endpoint is not unique error (endpoint "submitnode1.mydomain.com/qrsh/63546" is already connected)
> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: got read error (closing "submitnode1.mydomain.com/qrsh/63548")
> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: got read error (closing "submitnode1.mydomain.com/qrsh/63546")
> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: got read error (closing "submitnode1.mydomain.com/qrsh/63631")
...
> 05/20/2008 09:29:38|qmaster|sgemaster|E|commlib error: got read error (closing "submitnode1.mydomain.com/qrsh/63593")
> 05/20/2008 09:29:38|qmaster|sgemaster|E|commlib error: got read error (closing "submitnode1.mydomain.com/qrsh/63568")
> 05/20/2008 09:29:38|qmaster|sgemaster|W|job 7972205.1 failed on host d03.mydomain.com assumedly after job because: job 7972205.1 died through signal HUP (1)
> 05/20/2008 09:29:38|qmaster|sgemaster|W|job 7972216.1 failed on host s04.mydomain.com assumedly after job because: job 7972216.1 died through signal HUP (1)
> 05/20/2008 09:29:38|qmaster|sgemaster|W|job 7972221.1 failed on host b13.mydomain.com assumedly after job because: job 7972221.1 died through signal HUP (1)
> 05/20/2008 09:29:39|qmaster|sgemaster|W|job 7972233.1 failed on host d03.mydomain.com assumedly after job because: job 7972233.1 died through signal HUP (1)
> 05/20/2008 09:29:39|qmaster|sgemaster|W|job 7972314.1 failed on host l02.mydomain.com assumedly after job because: job 7972314.1 died through signal HUP (1)
...
> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972298.1 failed on host s12.mydomain.com assumedly after job because: job 7972298.1 died through signal HUP (1)
> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972208.1 failed on host d04.mydomain.com assumedly after job because: job 7972208.1 died through signal HUP (1)
> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972239.1 failed on host d04.mydomain.com assumedly after job because: job 7972239.1 died through signal HUP (1)
> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972258.1 failed on host d04.mydomain.com general before prolog because: shepherd exited with exit status 7
> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972258's failure at host d04.mydomain.com
> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972258's failure at host d04.mydomain.com
> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972212.1 failed on host s01.mydomain.com assumedly after job because: job 7972212.1 died through signal HUP (1)
> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972240.1 failed on host s01.mydomain.com assumedly after job because: job 7972240.1 died through signal HUP (1)
> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972260.1 failed on host s01.mydomain.com general before prolog because: shepherd exited with exit status 7
> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972260's failure at host s01.mydomain.com
> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972260's failure at host s01.mydomain.com
> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972283.1 failed on host j12.mydomain.com assumedly after job because: job 7972283.1 died through signal HUP (1)
> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972311.1 failed on host j12.mydomain.com assumedly after job because: job 7972311.1 died through signal HUP (1)
> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972327.1 failed on host j12.mydomain.com general before prolog because: shepherd exited with exit status 7
> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972327's failure at host j12.mydomain.com
> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972327's failure at host j12.mydomain.com
> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972300.1 failed on host s14.mydomain.com assumedly after job because: job 7972300.1 died through signal HUP (1)
> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972318.1 failed on host s14.mydomain.com assumedly after job because: job 7972318.1 died through signal HUP (1)
> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972334.1 failed on host s14.mydomain.com general before prolog because: shepherd exited with exit status 7
> 05/20/2008 09:29:44|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972334's failure at host s14.mydomain.com
> 05/20/2008 09:29:44|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 7972334's failure at host s14.mydomain.com
> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972209.1 failed on host j14.mydomain.com assumedly after job because: job 7972209.1 died through signal HUP (1)
> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972238.1 failed on host j14.mydomain.com assumedly after job because: job 7972238.1 died through signal HUP (1)
...
> 05/20/2008 09:29:47|qmaster|sgemaster|W|job 7972371.1 failed on host k01.mydomain.com assumedly after job because: job 7972371.1 died through signal HUP (1)
> 05/20/2008 09:29:47|qmaster|sgemaster|W|job 7972335.1 failed on host s02.mydomain.com assumedly after job because: job 7972335.1 died through signal HUP (1)
> 05/20/2008 09:29:48|qmaster|sgemaster|W|job 7972372.1 failed on host k01.mydomain.com assumedly after job because: job 7972372.1 died through signal HUP (1)
> 05/20/2008 09:29:48|qmaster|sgemaster|E|ack event for unknown job 7972361
> 05/20/2008 09:29:48|qmaster|sgemaster|E|ack event for unknown job 7972365
> 05/20/2008 09:29:49|qmaster|sgemaster|W|job 7972373.1 failed on host k01.mydomain.com assumedly after job because: job 7972373.1 died through signal HUP (1)
> 05/20/2008 09:29:49|qmaster|sgemaster|W|job 7972374.1 failed on host k01.mydomain.com assumedly after job because: job 7972374.1 died through signal HUP (1)
> 05/20/2008 09:29:49|qmaster|sgemaster|W|job 7972375.1 failed on host k01.mydomain.com assumedly after job because: job 7972375.1 died through signal HUP (1)

Logs of the nodes:

> 05/20/2008 09:29:42|execd|d04|E|abnormal termination of shepherd for job 7972258.1: no "exit_status" file
> 05/20/2008 09:29:42|execd|s01|E|abnormal termination of shepherd for job 7972260.1: no "exit_status" file
> 05/20/2008 09:29:42|execd|j12|E|abnormal termination of shepherd for job 7972327.1: no "exit_status" file
> 05/20/2008 09:29:42|execd|s14|E|abnormal termination of shepherd for job 7972334.1: no "exit_status" file



Second time the problem happened:

> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8011604.1 failed on host d05.mydomain.com assumedly after job because: job 8011604.1 died through signal KILL (9)
> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8011572.1 failed on host j12.mydomain.com assumedly after job because: job 8011572.1 died through signal KILL (9)
> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8011707.1 failed on host j05.mydomain.com assumedly after job because: job 8011707.1 died through signal KILL (9)
> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8011708.1 failed on host j05.mydomain.com assumedly after job because: job 8011708.1 died through signal KILL (9)
> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8012402.1 failed on host j05.mydomain.com general before prolog because: shepherd exited with exit status 7
> 05/21/2008 18:42:10|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 8012402's failure at host j05.mydomain.com
> 05/21/2008 18:42:10|qmaster|sgemaster|E|queue all.q marked QERROR as result of job 8012402's failure at host j05.mydomain.com
> 05/21/2008 18:42:11|qmaster|sgemaster|W|job 8012368.1 failed on host j14.mydomain.com assumedly after job because: job 8012368.1 died through signal KILL (9)
> 05/21/2008 18:42:11|qmaster|sgemaster|W|job 8011730.1 failed on host j14.mydomain.com assumedly after job because: job 8011730.1 died through signal KILL (9)
> 05/21/2008 18:42:11|qmaster|sgemaster|W|job 8012332.1 failed on host j14.mydomain.com assumedly after job because: job 8012332.1 died through signal KILL (9)
> 05/21/2008 18:42:11|qmaster|sgemaster|E|commlib error: got read error (closing "submitnode2.mydomain.com/qdel/30540")

Log of the node:

> 05/21/2008 18:42:07|execd|j05|E|abnormal termination of shepherd for job 8012402.1: no "exit_status" file


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list