[GE users] No scheduler registered at qmaster
sofia.bassil at fra.se
Fri Oct 31 13:11:04 GMT 2008
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
> Am 27.10.2008 um 16:03 schrieb Sofia Bassil:
>> Reuti skrev:
>>> Am 23.10.2008 um 16:50 schrieb Sofia Bassil:
>>>>>> The jobs in qw state are array jobs, although no new jobs are
>>>>>> accepted wheather array or non-array job.
>>>>> Do you mean you get an error when issuing qsub?
>>>> No error. Non-array job seem to be submittable and runnable as long
>>>> as no array jobs are ahead in the queue. Array jobs end up in qw
>>>> state, or one starts but not all tasks of it, even though there are
>>>> cores available. Eventually the cores are unallocated even for the
>>>> tasks that started and the entire array job ends up in qw state.
>>> How many tasks are waiting? There is a setting in the scheduler
>>> configuration "max_pending_tasks_per_job" for the maximum number of
>>> tasks of an array job which will be scheduled at once. Another entry
>>> "max_aj_instances" in SGE's configrations controls the amount of
>>> running tasks of an array job.
>>> -- Reuti
>> I removed the pending tasks to get the non-array jobs waiting
>> running, so I don't really know. Max aj_instances I changed to 0 for
>> no limit. The only error output I found from qstat -j when I first
>> started troubleshooting was that max_aj_instances was set too low.
>> That's why I removed the limit entirely, assuming the cluster would
>> show if it got overloaded, but it seems to handle it so far, at least
>> load-wise on the machines. Max_pending_tasks_per_job was set to 50,
>> but I just increased it to 75. How can I see the effect of
>> max_pending_tasks_per_job in qstat output?
> Usually you can't see it directly. Only that array tasks get a maximum
> of 50 tasks scheduled at once. You can try "qstat -g d" to make each
> task having a line on it's own, or using the "status" script below
> with the option "status -acl" to see the bunches of jobs scheduled.
> Means the waiting ones should be decreased by 75 every schedule interval:
> -- Reuti
We have done some further testing in this issue, more controlled this
time. What the user does is submitting an array job with 75000 tasks. I
have read somewhere that this is the default setting for max_aj_tasks,
if so you would think that the scheduler should be able to handle it.
However, sge_schedd dies within minutes after the job has been
submitted. If restarted it works for a while, even though its not
responding to requests such av qstat and qconf, but eventually it dies
again. Is 75000 the default setting in GE 6.1u4? Has anyone sucessfully
been able to run an array job with 75000 tasks in GE 6.1u4?
>>>>>> There are non-array jobs running that seem to work just fine
>>>>>> however. They are running through all my restarts of qmaster and
>>>>>> those cores remain allocated. When I restart the qmaster, the top
>>>>>> array job in qw state starts allocating cores, but there always
>>>>>> remains one line in the output of 'qstat -u \*' in qw state for
>>>>>> that job, and there is no change in the status of the other
>>>>>> waiting jobs. When the cores are unallocated again, all array
>>>>>> jobs are in qw state.
>>>>> When at least one task of the array job is not running, you will
>>>>> have this line there.
>>>>>> When I stop and start the qmaster and scheduler, I get a "commlib
>>>>>> error: got read error (closing "sgemaster/qconf/2")" or
>>>>>> "sgemaster/schedd/1" or "sgemaster/qstat/2", but I find no other
>>>>>> errors in connection with the restart (nothing in the
>>>>>> schedd_runlog or /tmp or on the nodes). After a while I get
>>>>>> "acknowledge timeout after 1200 seconds for event client
>>>>>> (schedd:1) on host "sgemaster"".
>>>>>> 'qstat -j <jobid>' shows nothing in particular at first and later
>>>>>> it shows the "error: can't unpack gdi request" "error: error
>>>>>> unpacking gdi request: bad argument" "failed receiving gdi
>>>>>> request" messages plus the job_number...script_file information.
>>>>> One of the reason could be a version mix in the cluster.
>>>> The night before last the sge_schedd died with a segfault error.
>>>> There is nothing wrong with any hardware as far as can be seen. I
>>>> have been stracing the process and turned on sar, but so far I
>>>> can't say I have seen anything particularly strange. I have removed
>>>> the offending array jobs and now things seem much more stable. I
>>>> have a reboot of the machine scheduled for tomorrow though.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users