[GE users] Strange behavior with sge_qmaster

christian reissmann christian.reissmann at sun.com
Wed Jul 14 09:20:17 BST 2004


Hi Sean,

issue #1141 may result from issue #1126  "qmaster clients may not 
reconnect after qmaster outage".
The scheduler is such a client which could be affected by this bug. Did 
you shutdown
the qmaster with qconf -km (or with SIGKILL) ?

If (for any reason like timeouts, high NFS traffic, ...) the qmaster 
connection is broken
the scheduler may also get disconnected to the qmaster and would not try 
a reconnect.

Issue #1126 is fixed in CVS maintrunc, V60_FCS_fixes_BRANCH and
V60_BRANCH.

I guess the issue #1141 results from #1126. Can you please update your 
sources and
check this?

Best Regards,

Christian



Andy Schwierskott wrote:
> Sean,
> 
> I seperated the problems into two issues (#1141 and #1142)
> 
> Andy
> 
>> On Tue, 2004-07-06 at 09:45, Andy Schwierskott wrote:
>>
>>>
>>>> Something else I've had happen several times this weekend is that SGE
>>>> will stop scheduling jobs.  There will be several jobs submitted to 
>>>> SGE,
>>>> for which there are resources, yet SGE will not launch the jobs.  If I
>>>> shut down sge_qmaster, then start it up again, those jobs are launched
>>>> immediately.  I have a feeling that the scheduling loop may be
>>>> stopping.  I have schedd_job_info set to false.  However when this
>>>> occurs, I change it to true, yet no matter how long I wait, scheduling
>>>> info for the jobs never shows up.  Originally I had flush_submit_sec 
>>>> and
>>>> flush_finish_sec set to '1'.  However when this started I changed them
>>>> back to '0', but the problem didn't go away.
>>>
>>>
>>> --> dto. Please provide more information, e.g. what does
>>>
>>>     qconf -sss
>>>
>>> show? If qmaster doesn't get order from scheduler you will get a "no
>>> scheduling host defined" answer.
>>>
>>>     Is the scheduler busy (at least from time to time?)
>>
>>
>> Just noticed the problem happening again.  'qconf -sss' gave 'no
>> scheduling host defined'.
>>
>> In the messages file for qmaster, I found this:
>> 07/08/2004 01:56:57|qmaster|head4|E|acknowledge timeout after 600
>> seconds for event client (schedd:1) on host "head4"
>> 07/08/2004 01:56:57|qmaster|head4|I|event client "scheduler" with id 1
>> deregistered
>>
>> In the schedd messages file, I saw this:
>> 07/08/2004 01:48:53|schedd|head4|W|qmaster alive timeout expired
>> 07/08/2004 01:50:30|schedd|head4|E|unable to send message to qmaster
>> using port 535 on host "head4": got send error
>> 07/08/2004 01:50:31|schedd|head4|W|qmaster alive timeout expired
>>
>> Another interesting thing I noticed.. the messages file for schedd seems
>> to be full of messages like this:
>> 07/08/2004 01:45:16|schedd|head4|E|can't find parallel task 21384.1 task
>> 1.node10 for update in function pe_task_update_master_list_usage
>> 07/08/2004 01:45:16|schedd|head4|E|callback function for event "565298.
>> EVENT JOB 21384.1 task 1.node10 USAGE" failed
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


-- 
Christian Reissmann    Tel: +49 (0)941 3075 112  mailto:crei at sun.com
Software Engineer      Fax: +49 (0)941 3075 222 
http://www.sun.com/gridengine
Sun Microsystems GmbH, Dr.-Leo-Ritter-Str. 7,
D-93049 Regensburg,    Tel: +49 (0)941 3075 0


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list