[GE users] No scheduler registered at qmaster

Reuti reuti at staff.uni-marburg.de
Wed Oct 22 19:10:31 BST 2008


Am 21.10.2008 um 13:34 schrieb Sofia Bassil:

> Reuti skrev:
>> Hi Sofia,
>>
>> Am 20.10.2008 um 11:29 schrieb Sofia Bassil:
>>
>>> I am having some problems with my SGE environment, among other  
>>> things the "can't unpack gdi request" error popped up, which  
>>> prompted me to run 'qconf -tsm'. The output was "no scheduler  
>>> registered at qmaster". What exactly does that mean? The  
>>> scheduler is running on the qmaster, although I restarted it  
>>> earlier today as part of my other troubleshooting. The first time  
>>> I restarted qmaster the scheduler process timed out and didn't  
>>> start, but then I stopped qmaster process and restarted and both  
>>> processes came up fine. No configuration is changed. I haven't  
>>> restarted the execds on the nodes. I think I have seen this  
>>> output before from 'qconf -tsm', but then it fixed itself  
>>> overnight. I prefer not to wait if there is anything I can do to  
>>> fix it quicker.
>>
>> were there two schdulers running by accident? Did you check "ps -e  
>> f"? Other error messages about problems with the daemons you will  
>> find in /tmp.
> Thank you for the reply Reuti.
>
> I did do a ps check  (ps -ef|grep sge) and there were 2 processes  
> in the result, one for sge_qmaster and one for sge_schedd.
>
> In /tmp there is an sge_messages log with the last entry at 9.27  
> a.m. yesterday, which I don't think its related. It says it can't  
> start up qmaster due to communication errors (can't bind to  
> socket), but I have restarted qmaster several times since without  
> anything new being written to the file.
>
> However, after having been troubleshooting this problem all day  
> yesterday and today I have noticed that when I restart sge_qmaster  
> and sge_schedd it seems to work well for a while. 'qconf -tsm'  
> returns the name of the qmaster node and the number of used cores  
> go up. Then slowly the number of used cores go down and the number  
> of available cores go up. At this point qstat starts returning  
> "error: failed receiving gdi request" and after a while 'qconf - 
> tsm' again reports that there is no scheduler defined. If I restart  
> the master this process starts over from the beginning. The top  
> notation of used cores yesterday was just above half  the total  
> number of cores available, but I have jobs in qw state (the
> original reason I started troubleshooting) that don't get scheduled  
> even though there are cores available.

Did you mention, which version of SGE you are using? There is a bug  
in one version keeping the -tsm running forever.

You are using the same version of SGE on all machines?

> The jobs in qw state are array jobs, although no new jobs are  
> accepted wheather array or non-array job.

Do you mean you get an error when issuing qsub?

> There are non-array jobs running that seem to work just fine  
> however. They are running through all my restarts of qmaster and  
> those cores remain allocated. When I restart the qmaster, the top  
> array job in qw state starts allocating cores, but there always  
> remains one line in the output of 'qstat -u \*' in qw state for  
> that job, and there is no change in the status of the other waiting  
> jobs. When the cores are unallocated again, all array jobs are in  
> qw state.

When at least one task of the array job is not running, you will have  
this line there.

> When I stop and start the qmaster and scheduler, I get a "commlib  
> error: got read error (closing "sgemaster/qconf/2")" or "sgemaster/ 
> schedd/1" or "sgemaster/qstat/2", but I find no other errors in  
> connection with the restart (nothing in the schedd_runlog or /tmp  
> or on the nodes). After a while I get "acknowledge timeout after  
> 1200 seconds for event client (schedd:1) on host "sgemaster"".
> 'qstat -j <jobid>' shows nothing in particular at first and later  
> it shows the "error: can't unpack gdi request" "error: error  
> unpacking gdi request: bad argument" "failed receiving gdi request"  
> messages plus the job_number...script_file information.

One of the reason could be a version mix in the cluster.

-- Reuti


> 'qalter -w v <jobid>' shows nothing but exits with status 1.
> 'qstat -f' shows no state E, u or au.
>
> By this time I have eliminated a bunch of stuff that I thought  
> might have been involved, so the behaviour described above is  
> basically what the problem has boiled down to. I am about to look  
> into various NFS mounts and the network connections, but any  
> pointers would be helpful. In any case, I gather the error "No  
> scheduler registered at qmaster" simply means there is a problem  
> with the scheduler and logs are supposed to tell me what exactly is  
> wrong and I am probably too new at this to figure it out right away.
>
> Sincerely,
> Sofia
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list