[GE users] No scheduler registered at qmaster
sofia.bassil at fra.se
Tue Oct 21 12:34:02 BST 2008
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
> Hi Sofia,
> Am 20.10.2008 um 11:29 schrieb Sofia Bassil:
>> I am having some problems with my SGE environment, among other things
>> the "can't unpack gdi request" error popped up, which prompted me to
>> run 'qconf -tsm'. The output was "no scheduler registered at
>> qmaster". What exactly does that mean? The scheduler is running on
>> the qmaster, although I restarted it earlier today as part of my
>> other troubleshooting. The first time I restarted qmaster the
>> scheduler process timed out and didn't start, but then I stopped
>> qmaster process and restarted and both processes came up fine. No
>> configuration is changed. I haven't restarted the execds on the
>> nodes. I think I have seen this output before from 'qconf -tsm', but
>> then it fixed itself overnight. I prefer not to wait if there is
>> anything I can do to fix it quicker.
> were there two schdulers running by accident? Did you check "ps -e f"?
> Other error messages about problems with the daemons you will find in
Thank you for the reply Reuti.
I did do a ps check (ps -ef|grep sge) and there were 2 processes in the
result, one for sge_qmaster and one for sge_schedd.
In /tmp there is an sge_messages log with the last entry at 9.27 a.m.
yesterday, which I don't think its related. It says it can't start up
qmaster due to communication errors (can't bind to socket), but I have
restarted qmaster several times since without anything new being written
to the file.
However, after having been troubleshooting this problem all day
yesterday and today I have noticed that when I restart sge_qmaster and
sge_schedd it seems to work well for a while. 'qconf -tsm' returns the
name of the qmaster node and the number of used cores go up. Then slowly
the number of used cores go down and the number of available cores go
up. At this point qstat starts returning "error: failed receiving gdi
request" and after a while 'qconf -tsm' again reports that there is no
scheduler defined. If I restart the master this process starts over from
the beginning. The top notation of used cores yesterday was just above
half the total number of cores available, but I have jobs in qw state
(the original reason I started troubleshooting) that don't get scheduled
even though there are cores available.
The jobs in qw state are array jobs, although no new jobs are accepted
wheather array or non-array job. There are non-array jobs running that
seem to work just fine however. They are running through all my restarts
of qmaster and those cores remain allocated. When I restart the qmaster,
the top array job in qw state starts allocating cores, but there always
remains one line in the output of 'qstat -u \*' in qw state for that
job, and there is no change in the status of the other waiting jobs.
When the cores are unallocated again, all array jobs are in qw state.
When I stop and start the qmaster and scheduler, I get a "commlib error:
got read error (closing "sgemaster/qconf/2")" or "sgemaster/schedd/1" or
"sgemaster/qstat/2", but I find no other errors in connection with the
restart (nothing in the schedd_runlog or /tmp or on the nodes). After a
while I get "acknowledge timeout after 1200 seconds for event client
(schedd:1) on host "sgemaster"".
'qstat -j <jobid>' shows nothing in particular at first and later it
shows the "error: can't unpack gdi request" "error: error unpacking gdi
request: bad argument" "failed receiving gdi request" messages plus the
'qalter -w v <jobid>' shows nothing but exits with status 1.
'qstat -f' shows no state E, u or au.
By this time I have eliminated a bunch of stuff that I thought might
have been involved, so the behaviour described above is basically what
the problem has boiled down to. I am about to look into various NFS
mounts and the network connections, but any pointers would be helpful.
In any case, I gather the error "No scheduler registered at qmaster"
simply means there is a problem with the scheduler and logs are supposed
to tell me what exactly is wrong and I am probably too new at this to
figure it out right away.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users