[GE users] No scheduler registered at qmaster

Reuti reuti at staff.uni-marburg.de
Mon Oct 27 17:04:23 GMT 2008


Am 27.10.2008 um 16:03 schrieb Sofia Bassil:

> Reuti skrev:
>> Am 23.10.2008 um 16:50 schrieb Sofia Bassil:
>> <snip>
>>>>> The jobs in qw state are array jobs, although no new jobs are  
>>>>> accepted wheather array or non-array job.
>>>>
>>>> Do you mean you get an error when issuing qsub?
>>> No error. Non-array job seem to be submittable and runnable as  
>>> long as no array jobs are ahead in the queue. Array jobs end up  
>>> in qw state, or one starts but not all tasks of it, even though  
>>> there are cores available. Eventually the cores are unallocated  
>>> even for the tasks that started and the entire array job ends up  
>>> in qw state.
>>
>> How many tasks are waiting? There is a setting in the scheduler  
>> configuration "max_pending_tasks_per_job" for the maximum number  
>> of tasks of an array job which will be scheduled at once. Another  
>> entry "max_aj_instances" in SGE's configrations controls the  
>> amount of running tasks of an array job.
>>
>> -- Reuti
> I removed the pending tasks to get the non-array jobs waiting  
> running, so I don't really know. Max aj_instances I changed to 0  
> for no limit. The only error output I found from qstat -j when I  
> first started troubleshooting was that max_aj_instances was set too  
> low. That's why I removed the limit entirely, assuming the cluster  
> would show if it got overloaded, but it seems to handle it so far,  
> at least load-wise on the machines. Max_pending_tasks_per_job was  
> set to 50, but I just increased it to 75. How can I see the effect  
> of max_pending_tasks_per_job in qstat output?

Usually you can't see it directly. Only that array tasks get a  
maximum of 50 tasks scheduled at once. You can try "qstat -g d" to  
make each task having a line on it's own, or using the "status"  
script below with the option "status -acl" to see the bunches of jobs  
scheduled. Means the waiting ones should be decreased by 75 every  
schedule interval:

http://gridengine.sunsource.net/files/documents/7/8/status-1.2.tgz

-- Reuti


> //Sofia
>>
>>
>>>>> There are non-array jobs running that seem to work just fine  
>>>>> however. They are running through all my restarts of qmaster  
>>>>> and those cores remain allocated. When I restart the qmaster,  
>>>>> the top array job in qw state starts allocating cores, but  
>>>>> there always remains one line in the output of 'qstat -u \*' in  
>>>>> qw state for that job, and there is no change in the status of  
>>>>> the other waiting jobs. When the cores are unallocated again,  
>>>>> all array jobs are in qw state.
>>>>
>>>> When at least one task of the array job is not running, you will  
>>>> have this line there.
>>>>
>>>>> When I stop and start the qmaster and scheduler, I get a  
>>>>> "commlib error: got read error (closing "sgemaster/qconf/2")"  
>>>>> or "sgemaster/schedd/1" or "sgemaster/qstat/2", but I find no  
>>>>> other errors in connection with the restart (nothing in the  
>>>>> schedd_runlog or /tmp or on the nodes). After a while I get  
>>>>> "acknowledge timeout after 1200 seconds for event client  
>>>>> (schedd:1) on host "sgemaster"".
>>>>> 'qstat -j <jobid>' shows nothing in particular at first and  
>>>>> later it shows the "error: can't unpack gdi request" "error:  
>>>>> error unpacking gdi request: bad argument" "failed receiving  
>>>>> gdi request" messages plus the job_number...script_file  
>>>>> information.
>>>>
>>>> One of the reason could be a version mix in the cluster.
>>> The night before last the sge_schedd died with a segfault error.  
>>> There is nothing wrong with any hardware as far as can be seen. I  
>>> have been stracing the process and turned on sar, but so far I  
>>> can't say I have seen anything particularly strange. I have  
>>> removed the offending array jobs and now things seem much more  
>>> stable. I have a reboot of the machine scheduled for tomorrow  
>>> though.
>>>
>>> //Sofia
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list