[GE users] Qmaster Problem

Justin Ottley ottley at coredp.com
Wed Dec 31 16:47:15 GMT 2008


thought id comment on the qping output way at the start of this thread - 
in my investigations with 6.2 and 6.2u1 on linux so far, the output of 
qping always(?) shows something like what you describe:

[snip]

info:                     MAIN: E (510335.92) | signaler000: E
(510333.48) | event_master000: E (0.01) | timer000: E (3.00) |
worker000: E (57078.01) | worker001: E (56779.02) | listener000: E
(0.25) | listener001: E (0.10) | scheduler000: E (56751.01) | ERROR

[/snip]
 
I have an issue filed:
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2767

In my original thread (entitled "6.2 qping and deadlocks"), we were 
speculating whether revision 1.4471 
(http://gridengine.sunsource.net/source/browse/gridengine/Changelog?view=log) 
fixed the issue, but im not sure whether this revision was included in 
6.2u1 (I presume not, I was unable to correlate revision 1.4471 with the 
6.2u1 changelog at 
http://gridengine.sunsource.net/project/gridengine/62patches.txt), but 
in any case, ive found that the output of qping is always(?) what you 
describe in 6.2 and 6.2u1. I am yet to try 6.2u2beta.

-justin

craffi wrote:
> { reopening this thread ... }
>
> I'm working on a 6.2 based install and just ran into this same error:
>
> error: failed receiving gdi request response for mid=1 (got syncron  
> message receive timeout error).
>
> ... was there any discovery of what lies behind this error or what  
> causes it?
>
> Regards,
> Chris
>
>
>
>
> On Nov 7, 2008, at 5:52 AM, Bradford, Matthew wrote:
>
>   
>> Andy,
>>
>> Thanks for that. We have now happily restarted sge_master and all is  
>> now
>> well.
>>
>> Do you have any idea what could have caused the problem in the first
>> place?
>>
>> Cheers,
>>
>> Mat
>>
>>     
>>> -----Original Message-----
>>> From: andy [mailto:andy.schwierskott at sun.com]
>>> Sent: 07 November 2008 09:47
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] Qmaster Problem
>>>
>>> Hi,
>>>
>>> stopping and restarting qmaster has no impact on running and
>>> pending jobs.
>>> There's only a short interupption in qmaster's responsiveness.
>>>
>>> There is bug in SGE 6.2 FCS regards queues (and their jobs)
>>> which are suspended due to suspend_on_subordinate: This state
>>> is lost at qmaster restart, however job reminad to be
>>> suspended. This needs to be fixed manually by sending a
>>> SIGCONT to the processes of the job.
>>>
>>> Andy
>>>
>>>
>>>
>>> On Fri, 7 Nov 2008, Bradford, Matthew wrote:
>>>
>>>       
>>>> We have been running SGE 6.2 happily for the last few weeks, and all
>>>> of a sudden we are seeing a problem.
>>>>
>>>> From a client, attempting any SGE command returns this message:
>>>> error: failed receiving gdi request response for mid=1 (got syncron
>>>> message receive timeout error).
>>>>
>>>> And running the qping command returns this information:
>>>>
>>>> qping -info sge_master_host 801 qmaster 1
>>>> 11/07/2008 09:07:55:
>>>> SIRM version:             0.1
>>>> SIRM message id:          1
>>>> start time:               11/01/2008 11:22:19 (1225538539)
>>>> run time [s]:             510336
>>>> messages in read buffer:  0
>>>> messages in write buffer: 0
>>>> nr. of connected clients: 411
>>>> status:                   2
>>>> info:                     MAIN: E (510335.92) | signaler000: E
>>>> (510333.48) | event_master000: E (0.01) | timer000: E (3.00) |
>>>> worker000: E (57078.01) | worker001: E (56779.02) | listener000: E
>>>> (0.25) | listener001: E (0.10) | scheduler000: E (56751.01) | ERROR
>>>> malloc:                   arena(451719168) |ordblks(14) |
>>>>         
>>> smblks(52) |
>>>       
>>>> hblksr(2) | hblhkd(2105344) usmblks(0) | fsmblks(1904) |
>>>> uordblks(451578592) | fordblks(140576) | keepcost(126736)
>>>> Monitor:                  disabled
>>>>
>>>>
>>>> The sge_master process is still running on the master host, and
>>>> contains about 12 child sge_master processes.
>>>>
>>>> Would stopping and starting the sge_master service kill any running
>>>> jobs, or should they happily communicate with the new master  
>>>> process.
>>>>
>>>> Any help would be much appreciated
>>>>
>>>> Cheers,
>>>>
>>>> Mat Bradford
>>>>
>>>> ------------------------------------------------------
>>>>
>>>>         
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessa
>>>       
>>>> geId=88269
>>>>
>>>> To unsubscribe from this discussion, e-mail:
>>>>         
>>> [users-unsubscribe at gridengine.sunsource.net].
>>>       
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&
>>> dsMessageId=88270
>>>
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe at gridengine.sunsource.net].
>>>
>>>       
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88276
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
>> ].
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=94346
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=95016

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list