[GE users] Jobs get queued but not execute

aja alena.plestilova at sun.com
Mon Dec 7 22:04:36 GMT 2009


Hi Allan,

You should look for the answer why the scheduler daemon was killed. You 
can simply get this error message when the schedd is killed manually.
After restarting the qmaster, also the schedd is started, so the jobs 
were started.

Regards,
aja


allantran wrote:
> Hi group,
> Sometime yesterday I noticed some strange behaviour of our sge 
> cluster. Multiple jobs got stuck in qw state, even though there's 
> available resources.
> So I restarted sgemaster and those stuck jobs started running.
> I looked at the sge messages on master, and see lots of entries like 
> these:
>
> 12/03/2009 14:57:41|event_master|E|acknowledge timeout after 600 
> seconds for event client (schedd:0) on host "master"
> 12/03/2009 14:57:41|event_|master|I|event client "scheduler" with id 1 
> deregistered
> .....
> 12/03/2009 15:35:12|event_|master|E|no event client known with id 1 to 
> modify
> ......
> 12/03/2009 15:05:05|worker|master|I|exiting job "14562.1": job does 
> not exist
> 12/03/2009 15:05:05|worker|master|I|exiting job "14560.1 task 
> 1.node2": job does not exist
> 12/03/2009 15:05:05|worker|master|I|exiting job "14562.1": job does 
> not exist
> 12/03/2009 15:05:05|worker|master|I|exiting job "14562.1 task 
> 1.node3": job does not exist
> 12/03/2009 15:05:05|worker|master|I|exiting job "14562.1 task 
> 1.node2": job does not exist
> 12/03/2009 15:05:05|worker|master|I|exiting job "14560.1": job does 
> not exist
>
> Anyone know what these means and where should I start looking into?
> Thanks
> Allan


-- 
aja && sun
e-mail: aja at sun.com

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=232107

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list