[GE users] No scheduler registered at qmaster

Reuti reuti at staff.uni-marburg.de
Fri Oct 24 11:10:32 BST 2008


Am 23.10.2008 um 16:50 schrieb Sofia Bassil:
<snip>
>>> The jobs in qw state are array jobs, although no new jobs are  
>>> accepted wheather array or non-array job.
>>
>> Do you mean you get an error when issuing qsub?
> No error. Non-array job seem to be submittable and runnable as long  
> as no array jobs are ahead in the queue. Array jobs end up in qw  
> state, or one starts but not all tasks of it, even though there are  
> cores available. Eventually the cores are unallocated even for the  
> tasks that started and the entire array job ends up in qw state.

How many tasks are waiting? There is a setting in the scheduler  
configuration "max_pending_tasks_per_job" for the maximum number of  
tasks of an array job which will be scheduled at once. Another entry  
"max_aj_instances" in SGE's configrations controls the amount of  
running tasks of an array job.

-- Reuti


>>> There are non-array jobs running that seem to work just fine  
>>> however. They are running through all my restarts of qmaster and  
>>> those cores remain allocated. When I restart the qmaster, the top  
>>> array job in qw state starts allocating cores, but there always  
>>> remains one line in the output of 'qstat -u \*' in qw state for  
>>> that job, and there is no change in the status of the other  
>>> waiting jobs. When the cores are unallocated again, all array  
>>> jobs are in qw state.
>>
>> When at least one task of the array job is not running, you will  
>> have this line there.
>>
>>> When I stop and start the qmaster and scheduler, I get a "commlib  
>>> error: got read error (closing "sgemaster/qconf/2")" or  
>>> "sgemaster/schedd/1" or "sgemaster/qstat/2", but I find no other  
>>> errors in connection with the restart (nothing in the  
>>> schedd_runlog or /tmp or on the nodes). After a while I get  
>>> "acknowledge timeout after 1200 seconds for event client (schedd: 
>>> 1) on host "sgemaster"".
>>> 'qstat -j <jobid>' shows nothing in particular at first and later  
>>> it shows the "error: can't unpack gdi request" "error: error  
>>> unpacking gdi request: bad argument" "failed receiving gdi  
>>> request" messages plus the job_number...script_file information.
>>
>> One of the reason could be a version mix in the cluster.
> The night before last the sge_schedd died with a segfault error.  
> There is nothing wrong with any hardware as far as can be seen. I  
> have been stracing the process and turned on sar, but so far I  
> can't say I have seen anything particularly strange. I have removed  
> the offending array jobs and now things seem much more stable. I  
> have a reboot of the machine scheduled for tomorrow though.
>
> //Sofia
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list