[GE users] No scheduler registered at qmaster

Sofia Bassil sofia.bassil at fra.se
Tue Oct 21 12:34:02 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti skrev:
> Hi Sofia,
>
> Am 20.10.2008 um 11:29 schrieb Sofia Bassil:
>
>> I am having some problems with my SGE environment, among other things 
>> the "can't unpack gdi request" error popped up, which prompted me to 
>> run 'qconf -tsm'. The output was "no scheduler registered at 
>> qmaster". What exactly does that mean? The scheduler is running on 
>> the qmaster, although I restarted it earlier today as part of my 
>> other troubleshooting. The first time I restarted qmaster the 
>> scheduler process timed out and didn't start, but then I stopped 
>> qmaster process and restarted and both processes came up fine. No 
>> configuration is changed. I haven't restarted the execds on the 
>> nodes. I think I have seen this output before from 'qconf -tsm', but 
>> then it fixed itself overnight. I prefer not to wait if there is 
>> anything I can do to fix it quicker.
>
> were there two schdulers running by accident? Did you check "ps -e f"? 
> Other error messages about problems with the daemons you will find in 
> /tmp.
Thank you for the reply Reuti.

I did do a ps check  (ps -ef|grep sge) and there were 2 processes in the 
result, one for sge_qmaster and one for sge_schedd.

In /tmp there is an sge_messages log with the last entry at 9.27 a.m. 
yesterday, which I don't think its related. It says it can't start up 
qmaster due to communication errors (can't bind to socket), but I have 
restarted qmaster several times since without anything new being written 
to the file.

However, after having been troubleshooting this problem all day 
yesterday and today I have noticed that when I restart sge_qmaster and 
sge_schedd it seems to work well for a while. 'qconf -tsm' returns the 
name of the qmaster node and the number of used cores go up. Then slowly 
the number of used cores go down and the number of available cores go 
up. At this point qstat starts returning "error: failed receiving gdi 
request" and after a while 'qconf -tsm' again reports that there is no 
scheduler defined. If I restart the master this process starts over from 
the beginning. The top notation of used cores yesterday was just above 
half  the total number of cores available, but I have jobs in qw state 
(the original reason I started troubleshooting) that don't get scheduled 
even though there are cores available.

The jobs in qw state are array jobs, although no new jobs are accepted 
wheather array or non-array job. There are non-array jobs running that 
seem to work just fine however. They are running through all my restarts 
of qmaster and those cores remain allocated. When I restart the qmaster, 
the top array job in qw state starts allocating cores, but there always 
remains one line in the output of 'qstat -u \*' in qw state for that 
job, and there is no change in the status of the other waiting jobs. 
When the cores are unallocated again, all array jobs are in qw state.

When I stop and start the qmaster and scheduler, I get a "commlib error: 
got read error (closing "sgemaster/qconf/2")" or "sgemaster/schedd/1" or 
"sgemaster/qstat/2", but I find no other errors in connection with the 
restart (nothing in the schedd_runlog or /tmp or on the nodes). After a 
while I get "acknowledge timeout after 1200 seconds for event client 
(schedd:1) on host "sgemaster"".
'qstat -j <jobid>' shows nothing in particular at first and later it 
shows the "error: can't unpack gdi request" "error: error unpacking gdi 
request: bad argument" "failed receiving gdi request" messages plus the 
job_number...script_file information.
'qalter -w v <jobid>' shows nothing but exits with status 1.
'qstat -f' shows no state E, u or au.

By this time I have eliminated a bunch of stuff that I thought might 
have been involved, so the behaviour described above is basically what 
the problem has boiled down to. I am about to look into various NFS 
mounts and the network connections, but any pointers would be helpful. 
In any case, I gather the error "No scheduler registered at qmaster" 
simply means there is a problem with the scheduler and logs are supposed 
to tell me what exactly is wrong and I am probably too new at this to 
figure it out right away.

Sincerely,
Sofia

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list