[GE users] Using qdel leaves queues in error status

Andrew Preece Apreece at nextwave.com
Fri May 23 16:46:56 BST 2008


Filipe, 
I had the same issue with one of my users.
We ended up working around it by putting a hold on the jobs for that user by
running qalter -h u <jid>, then deleting the jobs.

-Andrew. 


On 23/05/08 7:28 AM, "Filipe Brandenburger"
<filipe.brandenburger at idilia.com> wrote:

> Hello,
> 
> I'm quite new to SGE, but I'm managing a quite large installation. We
> are using SGE 6.0 on Linux CentOS 4.
> 
> I'm having a problem (happened twice this week) when users submit a very
> large number of jobs and then use qdel to kill all of them. Twice this
> problem left me with queues in (E) error state.
> 
> The problem appears to happen when the kill signal is delivered before
> the job has started. sge_shepherd quits with a message that says that
> the "exit_status" file did not exist, then it returns code 7 (problem
> before prolog), and this leaves the queue in error state.
> 
> I think this should not be something that should leave the queue in
> error state, since because of that no new jobs will run on that node
> until an administrator realises this problem happened and re-enables the
> queue, and we will lose CPU time meanwhile. There is nothing really
> wrong with the queue, to me it seems like this is a "race condition"
> bug. Properly disabling/reenabling signals would probably fix this, if I
> remember Unix programming correctly.
> 
> Below I'm attaching the logs for both times the problem happened.
> 
> The questions I have are:
> 
> Has anyone else seen this problem before? How do you deal with it?
> 
> Is there a fix for it? Does version 6.1 address this problem? Is there a
> patch that I could apply to my installation of 6.0?
> 
> Why in the first case the jobs died with signal HUP, and in the second
> case they died with signal KILL?
> 
> TIA,
> Filipe
> 
> 
> 
> 
> 
> These are the logs for the first time:
> 
>> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: endpoint is not unique
>> error (endpoint "submitnode1.mydomain.com/qrsh/63548" is already connected)
>> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: endpoint is not unique
>> error (endpoint "submitnode1.mydomain.com/qrsh/63546" is already connected)
>> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: got read error
>> (closing "submitnode1.mydomain.com/qrsh/63548")
>> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: got read error
>> (closing "submitnode1.mydomain.com/qrsh/63546")
>> 05/20/2008 09:29:32|qmaster|sgemaster|E|commlib error: got read error
>> (closing "submitnode1.mydomain.com/qrsh/63631")
> ...
>> 05/20/2008 09:29:38|qmaster|sgemaster|E|commlib error: got read error
>> (closing "submitnode1.mydomain.com/qrsh/63593")
>> 05/20/2008 09:29:38|qmaster|sgemaster|E|commlib error: got read error
>> (closing "submitnode1.mydomain.com/qrsh/63568")
>> 05/20/2008 09:29:38|qmaster|sgemaster|W|job 7972205.1 failed on host
>> d03.mydomain.com assumedly after job because: job 7972205.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:38|qmaster|sgemaster|W|job 7972216.1 failed on host
>> s04.mydomain.com assumedly after job because: job 7972216.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:38|qmaster|sgemaster|W|job 7972221.1 failed on host
>> b13.mydomain.com assumedly after job because: job 7972221.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:39|qmaster|sgemaster|W|job 7972233.1 failed on host
>> d03.mydomain.com assumedly after job because: job 7972233.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:39|qmaster|sgemaster|W|job 7972314.1 failed on host
>> l02.mydomain.com assumedly after job because: job 7972314.1 died through
>> signal HUP (1)
> ...
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972298.1 failed on host
>> s12.mydomain.com assumedly after job because: job 7972298.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972208.1 failed on host
>> d04.mydomain.com assumedly after job because: job 7972208.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972239.1 failed on host
>> d04.mydomain.com assumedly after job because: job 7972239.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972258.1 failed on host
>> d04.mydomain.com general before prolog because: shepherd exited with exit
>> status 7
>> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result
>> of job 7972258's failure at host d04.mydomain.com
>> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result
>> of job 7972258's failure at host d04.mydomain.com
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972212.1 failed on host
>> s01.mydomain.com assumedly after job because: job 7972212.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972240.1 failed on host
>> s01.mydomain.com assumedly after job because: job 7972240.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972260.1 failed on host
>> s01.mydomain.com general before prolog because: shepherd exited with exit
>> status 7
>> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result
>> of job 7972260's failure at host s01.mydomain.com
>> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result
>> of job 7972260's failure at host s01.mydomain.com
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972283.1 failed on host
>> j12.mydomain.com assumedly after job because: job 7972283.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972311.1 failed on host
>> j12.mydomain.com assumedly after job because: job 7972311.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:43|qmaster|sgemaster|W|job 7972327.1 failed on host
>> j12.mydomain.com general before prolog because: shepherd exited with exit
>> status 7
>> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result
>> of job 7972327's failure at host j12.mydomain.com
>> 05/20/2008 09:29:43|qmaster|sgemaster|E|queue all.q marked QERROR as result
>> of job 7972327's failure at host j12.mydomain.com
>> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972300.1 failed on host
>> s14.mydomain.com assumedly after job because: job 7972300.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972318.1 failed on host
>> s14.mydomain.com assumedly after job because: job 7972318.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972334.1 failed on host
>> s14.mydomain.com general before prolog because: shepherd exited with exit
>> status 7
>> 05/20/2008 09:29:44|qmaster|sgemaster|E|queue all.q marked QERROR as result
>> of job 7972334's failure at host s14.mydomain.com
>> 05/20/2008 09:29:44|qmaster|sgemaster|E|queue all.q marked QERROR as result
>> of job 7972334's failure at host s14.mydomain.com
>> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972209.1 failed on host
>> j14.mydomain.com assumedly after job because: job 7972209.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972238.1 failed on host
>> j14.mydomain.com assumedly after job because: job 7972238.1 died through
>> signal HUP (1)
> ...
>> 05/20/2008 09:29:47|qmaster|sgemaster|W|job 7972371.1 failed on host
>> k01.mydomain.com assumedly after job because: job 7972371.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:47|qmaster|sgemaster|W|job 7972335.1 failed on host
>> s02.mydomain.com assumedly after job because: job 7972335.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:48|qmaster|sgemaster|W|job 7972372.1 failed on host
>> k01.mydomain.com assumedly after job because: job 7972372.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:48|qmaster|sgemaster|E|ack event for unknown job 7972361
>> 05/20/2008 09:29:48|qmaster|sgemaster|E|ack event for unknown job 7972365
>> 05/20/2008 09:29:49|qmaster|sgemaster|W|job 7972373.1 failed on host
>> k01.mydomain.com assumedly after job because: job 7972373.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:49|qmaster|sgemaster|W|job 7972374.1 failed on host
>> k01.mydomain.com assumedly after job because: job 7972374.1 died through
>> signal HUP (1)
>> 05/20/2008 09:29:49|qmaster|sgemaster|W|job 7972375.1 failed on host
>> k01.mydomain.com assumedly after job because: job 7972375.1 died through
>> signal HUP (1)
> 
> Logs of the nodes:
> 
>> 05/20/2008 09:29:42|execd|d04|E|abnormal termination of shepherd for job
>> 7972258.1: no "exit_status" file
>> 05/20/2008 09:29:42|execd|s01|E|abnormal termination of shepherd for job
>> 7972260.1: no "exit_status" file
>> 05/20/2008 09:29:42|execd|j12|E|abnormal termination of shepherd for job
>> 7972327.1: no "exit_status" file
>> 05/20/2008 09:29:42|execd|s14|E|abnormal termination of shepherd for job
>> 7972334.1: no "exit_status" file
> 
> 
> 
> Second time the problem happened:
> 
>> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8011604.1 failed on host
>> d05.mydomain.com assumedly after job because: job 8011604.1 died through
>> signal KILL (9)
>> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8011572.1 failed on host
>> j12.mydomain.com assumedly after job because: job 8011572.1 died through
>> signal KILL (9)
>> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8011707.1 failed on host
>> j05.mydomain.com assumedly after job because: job 8011707.1 died through
>> signal KILL (9)
>> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8011708.1 failed on host
>> j05.mydomain.com assumedly after job because: job 8011708.1 died through
>> signal KILL (9)
>> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8012402.1 failed on host
>> j05.mydomain.com general before prolog because: shepherd exited with exit
>> status 7
>> 05/21/2008 18:42:10|qmaster|sgemaster|E|queue all.q marked QERROR as result
>> of job 8012402's failure at host j05.mydomain.com
>> 05/21/2008 18:42:10|qmaster|sgemaster|E|queue all.q marked QERROR as result
>> of job 8012402's failure at host j05.mydomain.com
>> 05/21/2008 18:42:11|qmaster|sgemaster|W|job 8012368.1 failed on host
>> j14.mydomain.com assumedly after job because: job 8012368.1 died through
>> signal KILL (9)
>> 05/21/2008 18:42:11|qmaster|sgemaster|W|job 8011730.1 failed on host
>> j14.mydomain.com assumedly after job because: job 8011730.1 died through
>> signal KILL (9)
>> 05/21/2008 18:42:11|qmaster|sgemaster|W|job 8012332.1 failed on host
>> j14.mydomain.com assumedly after job because: job 8012332.1 died through
>> signal KILL (9)
>> 05/21/2008 18:42:11|qmaster|sgemaster|E|commlib error: got read error
>> (closing "submitnode2.mydomain.com/qdel/30540")
> 
> Log of the node:
> 
>> 05/21/2008 18:42:07|execd|j05|E|abnormal termination of shepherd for job
>> 8012402.1: no "exit_status" file
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list