[GE users] Qmaster Problem

craffi dag at sonsorol.org
Wed Dec 24 22:42:32 GMT 2008


{ reopening this thread ... }

I'm working on a 6.2 based install and just ran into this same error:

error: failed receiving gdi request response for mid=1 (got syncron  
message receive timeout error).

... was there any discovery of what lies behind this error or what  
causes it?

Regards,
Chris




On Nov 7, 2008, at 5:52 AM, Bradford, Matthew wrote:

> Andy,
>
> Thanks for that. We have now happily restarted sge_master and all is  
> now
> well.
>
> Do you have any idea what could have caused the problem in the first
> place?
>
> Cheers,
>
> Mat
>
>> -----Original Message-----
>> From: andy [mailto:andy.schwierskott at sun.com]
>> Sent: 07 November 2008 09:47
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Qmaster Problem
>>
>> Hi,
>>
>> stopping and restarting qmaster has no impact on running and
>> pending jobs.
>> There's only a short interupption in qmaster's responsiveness.
>>
>> There is bug in SGE 6.2 FCS regards queues (and their jobs)
>> which are suspended due to suspend_on_subordinate: This state
>> is lost at qmaster restart, however job reminad to be
>> suspended. This needs to be fixed manually by sending a
>> SIGCONT to the processes of the job.
>>
>> Andy
>>
>>
>>
>> On Fri, 7 Nov 2008, Bradford, Matthew wrote:
>>
>>> We have been running SGE 6.2 happily for the last few weeks, and all
>>> of a sudden we are seeing a problem.
>>>
>>> From a client, attempting any SGE command returns this message:
>>> error: failed receiving gdi request response for mid=1 (got syncron
>>> message receive timeout error).
>>>
>>> And running the qping command returns this information:
>>>
>>> qping -info sge_master_host 801 qmaster 1
>>> 11/07/2008 09:07:55:
>>> SIRM version:             0.1
>>> SIRM message id:          1
>>> start time:               11/01/2008 11:22:19 (1225538539)
>>> run time [s]:             510336
>>> messages in read buffer:  0
>>> messages in write buffer: 0
>>> nr. of connected clients: 411
>>> status:                   2
>>> info:                     MAIN: E (510335.92) | signaler000: E
>>> (510333.48) | event_master000: E (0.01) | timer000: E (3.00) |
>>> worker000: E (57078.01) | worker001: E (56779.02) | listener000: E
>>> (0.25) | listener001: E (0.10) | scheduler000: E (56751.01) | ERROR
>>> malloc:                   arena(451719168) |ordblks(14) |
>> smblks(52) |
>>> hblksr(2) | hblhkd(2105344) usmblks(0) | fsmblks(1904) |
>>> uordblks(451578592) | fordblks(140576) | keepcost(126736)
>>> Monitor:                  disabled
>>>
>>>
>>> The sge_master process is still running on the master host, and
>>> contains about 12 child sge_master processes.
>>>
>>> Would stopping and starting the sge_master service kill any running
>>> jobs, or should they happily communicate with the new master  
>>> process.
>>>
>>> Any help would be much appreciated
>>>
>>> Cheers,
>>>
>>> Mat Bradford
>>>
>>> ------------------------------------------------------
>>>
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessa
>>> geId=88269
>>>
>>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&
>> dsMessageId=88270
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88276
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=94346

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list