[GE users] No scheduler registered at qmaster

Sofia Bassil sofia.bassil at fra.se
Thu Oct 23 15:50:47 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti skrev:
> Am 21.10.2008 um 13:34 schrieb Sofia Bassil:
>
>> Reuti skrev:
>>> Hi Sofia,
>>>
>>> Am 20.10.2008 um 11:29 schrieb Sofia Bassil:
>>>
>>>> I am having some problems with my SGE environment, among other 
>>>> things the "can't unpack gdi request" error popped up, which 
>>>> prompted me to run 'qconf -tsm'. The output was "no scheduler 
>>>> registered at qmaster". What exactly does that mean? The scheduler 
>>>> is running on the qmaster, although I restarted it earlier today as 
>>>> part of my other troubleshooting. The first time I restarted 
>>>> qmaster the scheduler process timed out and didn't start, but then 
>>>> I stopped qmaster process and restarted and both processes came up 
>>>> fine. No configuration is changed. I haven't restarted the execds 
>>>> on the nodes. I think I have seen this output before from 'qconf 
>>>> -tsm', but then it fixed itself overnight. I prefer not to wait if 
>>>> there is anything I can do to fix it quicker.
>>>
>>> were there two schdulers running by accident? Did you check "ps -e 
>>> f"? Other error messages about problems with the daemons you will 
>>> find in /tmp.
>> Thank you for the reply Reuti.
>>
>> I did do a ps check  (ps -ef|grep sge) and there were 2 processes in 
>> the result, one for sge_qmaster and one for sge_schedd.
>>
>> In /tmp there is an sge_messages log with the last entry at 9.27 a.m. 
>> yesterday, which I don't think its related. It says it can't start up 
>> qmaster due to communication errors (can't bind to socket), but I 
>> have restarted qmaster several times since without anything new being 
>> written to the file.
>>
>> However, after having been troubleshooting this problem all day 
>> yesterday and today I have noticed that when I restart sge_qmaster 
>> and sge_schedd it seems to work well for a while. 'qconf -tsm' 
>> returns the name of the qmaster node and the number of used cores go 
>> up. Then slowly the number of used cores go down and the number of 
>> available cores go up. At this point qstat starts returning "error: 
>> failed receiving gdi request" and after a while 'qconf -tsm' again 
>> reports that there is no scheduler defined. If I restart the master 
>> this process starts over from the beginning. The top notation of used 
>> cores yesterday was just above half  the total number of cores 
>> available, but I have jobs in qw state (the original reason I started 
>> troubleshooting) that don't get scheduled even though there are cores 
>> available.
>
> Did you mention, which version of SGE you are using? There is a bug in 
> one version keeping the -tsm running forever.
>
> You are using the same version of SGE on all machines?
Its the same version in the entire cluster. Its GE 6.1u4.
>
>> The jobs in qw state are array jobs, although no new jobs are 
>> accepted wheather array or non-array job.
>
> Do you mean you get an error when issuing qsub?
No error. Non-array job seem to be submittable and runnable as long as 
no array jobs are ahead in the queue. Array jobs end up in qw state, or 
one starts but not all tasks of it, even though there are cores 
available. Eventually the cores are unallocated even for the tasks that 
started and the entire array job ends up in qw state.
>
>> There are non-array jobs running that seem to work just fine however. 
>> They are running through all my restarts of qmaster and those cores 
>> remain allocated. When I restart the qmaster, the top array job in qw 
>> state starts allocating cores, but there always remains one line in 
>> the output of 'qstat -u \*' in qw state for that job, and there is no 
>> change in the status of the other waiting jobs. When the cores are 
>> unallocated again, all array jobs are in qw state.
>
> When at least one task of the array job is not running, you will have 
> this line there.
>
>> When I stop and start the qmaster and scheduler, I get a "commlib 
>> error: got read error (closing "sgemaster/qconf/2")" or 
>> "sgemaster/schedd/1" or "sgemaster/qstat/2", but I find no other 
>> errors in connection with the restart (nothing in the schedd_runlog 
>> or /tmp or on the nodes). After a while I get "acknowledge timeout 
>> after 1200 seconds for event client (schedd:1) on host "sgemaster"".
>> 'qstat -j <jobid>' shows nothing in particular at first and later it 
>> shows the "error: can't unpack gdi request" "error: error unpacking 
>> gdi request: bad argument" "failed receiving gdi request" messages 
>> plus the job_number...script_file information.
>
> One of the reason could be a version mix in the cluster.
The night before last the sge_schedd died with a segfault error. There 
is nothing wrong with any hardware as far as can be seen. I have been 
stracing the process and turned on sar, but so far I can't say I have 
seen anything particularly strange. I have removed the offending array 
jobs and now things seem much more stable. I have a reboot of the 
machine scheduled for tomorrow though.

//Sofia

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list