[GE users] Jobs get queued but not execute

allantran tran.v.allan at gmail.com
Fri Dec 4 22:27:27 GMT 2009


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi group,
Sometime yesterday I noticed some strange behaviour of our sge cluster. Multiple jobs got stuck in qw state, even though there's available resources.
So I restarted sgemaster and those stuck jobs started running.
I looked at the sge messages on master, and see lots of entries like these:

12/03/2009 14:57:41|event_master|E|acknowledge timeout after 600 seconds for event client (schedd:0) on host "master"
12/03/2009 14:57:41|event_|master|I|event client "scheduler" with id 1 deregistered
.....
12/03/2009 15:35:12|event_|master|E|no event client known with id 1 to modify
......
12/03/2009 15:05:05|worker|master|I|exiting job "14562.1": job does not exist
12/03/2009 15:05:05|worker|master|I|exiting job "14560.1 task 1.node2": job does not exist
12/03/2009 15:05:05|worker|master|I|exiting job "14562.1": job does not exist
12/03/2009 15:05:05|worker|master|I|exiting job "14562.1 task 1.node3": job does not exist
12/03/2009 15:05:05|worker|master|I|exiting job "14562.1 task 1.node2": job does not exist
12/03/2009 15:05:05|worker|master|I|exiting job "14560.1": job does not exist

Anyone know what these means and where should I start looking into?
Thanks
Allan



More information about the gridengine-users mailing list