[GE users] No scheduler registered at qmaster
reuti at staff.uni-marburg.de
Wed Oct 22 19:10:31 BST 2008
Am 21.10.2008 um 13:34 schrieb Sofia Bassil:
> Reuti skrev:
>> Hi Sofia,
>> Am 20.10.2008 um 11:29 schrieb Sofia Bassil:
>>> I am having some problems with my SGE environment, among other
>>> things the "can't unpack gdi request" error popped up, which
>>> prompted me to run 'qconf -tsm'. The output was "no scheduler
>>> registered at qmaster". What exactly does that mean? The
>>> scheduler is running on the qmaster, although I restarted it
>>> earlier today as part of my other troubleshooting. The first time
>>> I restarted qmaster the scheduler process timed out and didn't
>>> start, but then I stopped qmaster process and restarted and both
>>> processes came up fine. No configuration is changed. I haven't
>>> restarted the execds on the nodes. I think I have seen this
>>> output before from 'qconf -tsm', but then it fixed itself
>>> overnight. I prefer not to wait if there is anything I can do to
>>> fix it quicker.
>> were there two schdulers running by accident? Did you check "ps -e
>> f"? Other error messages about problems with the daemons you will
>> find in /tmp.
> Thank you for the reply Reuti.
> I did do a ps check (ps -ef|grep sge) and there were 2 processes
> in the result, one for sge_qmaster and one for sge_schedd.
> In /tmp there is an sge_messages log with the last entry at 9.27
> a.m. yesterday, which I don't think its related. It says it can't
> start up qmaster due to communication errors (can't bind to
> socket), but I have restarted qmaster several times since without
> anything new being written to the file.
> However, after having been troubleshooting this problem all day
> yesterday and today I have noticed that when I restart sge_qmaster
> and sge_schedd it seems to work well for a while. 'qconf -tsm'
> returns the name of the qmaster node and the number of used cores
> go up. Then slowly the number of used cores go down and the number
> of available cores go up. At this point qstat starts returning
> "error: failed receiving gdi request" and after a while 'qconf -
> tsm' again reports that there is no scheduler defined. If I restart
> the master this process starts over from the beginning. The top
> notation of used cores yesterday was just above half the total
> number of cores available, but I have jobs in qw state (the
> original reason I started troubleshooting) that don't get scheduled
> even though there are cores available.
Did you mention, which version of SGE you are using? There is a bug
in one version keeping the -tsm running forever.
You are using the same version of SGE on all machines?
> The jobs in qw state are array jobs, although no new jobs are
> accepted wheather array or non-array job.
Do you mean you get an error when issuing qsub?
> There are non-array jobs running that seem to work just fine
> however. They are running through all my restarts of qmaster and
> those cores remain allocated. When I restart the qmaster, the top
> array job in qw state starts allocating cores, but there always
> remains one line in the output of 'qstat -u \*' in qw state for
> that job, and there is no change in the status of the other waiting
> jobs. When the cores are unallocated again, all array jobs are in
> qw state.
When at least one task of the array job is not running, you will have
this line there.
> When I stop and start the qmaster and scheduler, I get a "commlib
> error: got read error (closing "sgemaster/qconf/2")" or "sgemaster/
> schedd/1" or "sgemaster/qstat/2", but I find no other errors in
> connection with the restart (nothing in the schedd_runlog or /tmp
> or on the nodes). After a while I get "acknowledge timeout after
> 1200 seconds for event client (schedd:1) on host "sgemaster"".
> 'qstat -j <jobid>' shows nothing in particular at first and later
> it shows the "error: can't unpack gdi request" "error: error
> unpacking gdi request: bad argument" "failed receiving gdi request"
> messages plus the job_number...script_file information.
One of the reason could be a version mix in the cluster.
> 'qalter -w v <jobid>' shows nothing but exits with status 1.
> 'qstat -f' shows no state E, u or au.
> By this time I have eliminated a bunch of stuff that I thought
> might have been involved, so the behaviour described above is
> basically what the problem has boiled down to. I am about to look
> into various NFS mounts and the network connections, but any
> pointers would be helpful. In any case, I gather the error "No
> scheduler registered at qmaster" simply means there is a problem
> with the scheduler and logs are supposed to tell me what exactly is
> wrong and I am probably too new at this to figure it out right away.
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users