[GE users] Strange behavior with sge_qmaster

Andy Schwierskott andy.schwierskott at sun.com
Fri Jul 9 15:18:20 BST 2004


Sean,

I seperated the problems into two issues (#1141 and #1142)

Andy

> On Tue, 2004-07-06 at 09:45, Andy Schwierskott wrote:
>
>>
>>> Something else I've had happen several times this weekend is that SGE
>>> will stop scheduling jobs.  There will be several jobs submitted to SGE,
>>> for which there are resources, yet SGE will not launch the jobs.  If I
>>> shut down sge_qmaster, then start it up again, those jobs are launched
>>> immediately.  I have a feeling that the scheduling loop may be
>>> stopping.  I have schedd_job_info set to false.  However when this
>>> occurs, I change it to true, yet no matter how long I wait, scheduling
>>> info for the jobs never shows up.  Originally I had flush_submit_sec and
>>> flush_finish_sec set to '1'.  However when this started I changed them
>>> back to '0', but the problem didn't go away.
>>
>> --> dto. Please provide more information, e.g. what does
>>
>>     qconf -sss
>>
>> show? If qmaster doesn't get order from scheduler you will get a "no
>> scheduling host defined" answer.
>>
>>     Is the scheduler busy (at least from time to time?)
>
> Just noticed the problem happening again.  'qconf -sss' gave 'no
> scheduling host defined'.
>
> In the messages file for qmaster, I found this:
> 07/08/2004 01:56:57|qmaster|head4|E|acknowledge timeout after 600
> seconds for event client (schedd:1) on host "head4"
> 07/08/2004 01:56:57|qmaster|head4|I|event client "scheduler" with id 1
> deregistered
>
> In the schedd messages file, I saw this:
> 07/08/2004 01:48:53|schedd|head4|W|qmaster alive timeout expired
> 07/08/2004 01:50:30|schedd|head4|E|unable to send message to qmaster
> using port 535 on host "head4": got send error
> 07/08/2004 01:50:31|schedd|head4|W|qmaster alive timeout expired
>
> Another interesting thing I noticed.. the messages file for schedd seems
> to be full of messages like this:
> 07/08/2004 01:45:16|schedd|head4|E|can't find parallel task 21384.1 task
> 1.node10 for update in function pe_task_update_master_list_usage
> 07/08/2004 01:45:16|schedd|head4|E|callback function for event "565298.
> EVENT JOB 21384.1 task 1.node10 USAGE" failed

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list