[GE users] Strange behavior with sge_qmaster

christian reissmann christian.reissmann at sun.com
Thu Jul 15 09:06:18 BST 2004


Hi Sean,

I seems to be a problem when an event client (schedd is an event client)
tries to re_register at qmaster. So issue #1126 may not solve the problem.

Still working on debuging ....

Best regards,

Christian

christian reissmann wrote:
> Hi Sean,
> 
> issue #1141 may result from issue #1126  "qmaster clients may not 
> reconnect after qmaster outage".
> The scheduler is such a client which could be affected by this bug. Did 
> you shutdown
> the qmaster with qconf -km (or with SIGKILL) ?
> 
> If (for any reason like timeouts, high NFS traffic, ...) the qmaster 
> connection is broken
> the scheduler may also get disconnected to the qmaster and would not try 
> a reconnect.
> 
> Issue #1126 is fixed in CVS maintrunc, V60_FCS_fixes_BRANCH and
> V60_BRANCH.
> 
> I guess the issue #1141 results from #1126. Can you please update your 
> sources and
> check this?
> 
> Best Regards,
> 
> Christian
> 
> 
> 
> Andy Schwierskott wrote:
> 
>> Sean,
>>
>> I seperated the problems into two issues (#1141 and #1142)
>>
>> Andy
>>
>>> On Tue, 2004-07-06 at 09:45, Andy Schwierskott wrote:
>>>
>>>>
>>>>> Something else I've had happen several times this weekend is that SGE
>>>>> will stop scheduling jobs.  There will be several jobs submitted to 
>>>>> SGE,
>>>>> for which there are resources, yet SGE will not launch the jobs.  If I
>>>>> shut down sge_qmaster, then start it up again, those jobs are launched
>>>>> immediately.  I have a feeling that the scheduling loop may be
>>>>> stopping.  I have schedd_job_info set to false.  However when this
>>>>> occurs, I change it to true, yet no matter how long I wait, scheduling
>>>>> info for the jobs never shows up.  Originally I had 
>>>>> flush_submit_sec and
>>>>> flush_finish_sec set to '1'.  However when this started I changed them
>>>>> back to '0', but the problem didn't go away.
>>>>
>>>>
>>>>
>>>> --> dto. Please provide more information, e.g. what does
>>>>
>>>>     qconf -sss
>>>>
>>>> show? If qmaster doesn't get order from scheduler you will get a "no
>>>> scheduling host defined" answer.
>>>>
>>>>     Is the scheduler busy (at least from time to time?)
>>>
>>>
>>>
>>> Just noticed the problem happening again.  'qconf -sss' gave 'no
>>> scheduling host defined'.
>>>
>>> In the messages file for qmaster, I found this:
>>> 07/08/2004 01:56:57|qmaster|head4|E|acknowledge timeout after 600
>>> seconds for event client (schedd:1) on host "head4"
>>> 07/08/2004 01:56:57|qmaster|head4|I|event client "scheduler" with id 1
>>> deregistered
>>>
>>> In the schedd messages file, I saw this:
>>> 07/08/2004 01:48:53|schedd|head4|W|qmaster alive timeout expired
>>> 07/08/2004 01:50:30|schedd|head4|E|unable to send message to qmaster
>>> using port 535 on host "head4": got send error
>>> 07/08/2004 01:50:31|schedd|head4|W|qmaster alive timeout expired
>>>
>>> Another interesting thing I noticed.. the messages file for schedd seems
>>> to be full of messages like this:
>>> 07/08/2004 01:45:16|schedd|head4|E|can't find parallel task 21384.1 task
>>> 1.node10 for update in function pe_task_update_master_list_usage
>>> 07/08/2004 01:45:16|schedd|head4|E|callback function for event "565298.
>>> EVENT JOB 21384.1 task 1.node10 USAGE" failed
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 
> 


-- 
Christian Reissmann    Tel: +49 (0)941 3075 112  mailto:crei at sun.com
Software Engineer      Fax: +49 (0)941 3075 222 
http://www.sun.com/gridengine
Sun Microsystems GmbH, Dr.-Leo-Ritter-Str. 7,
D-93049 Regensburg,    Tel: +49 (0)941 3075 0


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list