[GE users] Qmaster Problem

Bradford, Matthew matthew.bradford at eds.com
Fri Nov 7 10:52:03 GMT 2008


Andy,

Thanks for that. We have now happily restarted sge_master and all is now
well.

Do you have any idea what could have caused the problem in the first
place? 

Cheers,

Mat

>-----Original Message-----
>From: andy [mailto:andy.schwierskott at sun.com] 
>Sent: 07 November 2008 09:47
>To: users at gridengine.sunsource.net
>Subject: Re: [GE users] Qmaster Problem
>
>Hi,
>
>stopping and restarting qmaster has no impact on running and 
>pending jobs.
>There's only a short interupption in qmaster's responsiveness.
>
>There is bug in SGE 6.2 FCS regards queues (and their jobs) 
>which are suspended due to suspend_on_subordinate: This state 
>is lost at qmaster restart, however job reminad to be 
>suspended. This needs to be fixed manually by sending a 
>SIGCONT to the processes of the job.
>
>Andy
>
>
>
>On Fri, 7 Nov 2008, Bradford, Matthew wrote:
>
>> We have been running SGE 6.2 happily for the last few weeks, and all 
>> of a sudden we are seeing a problem.
>>
>> From a client, attempting any SGE command returns this message:
>> error: failed receiving gdi request response for mid=1 (got syncron 
>> message receive timeout error).
>>
>> And running the qping command returns this information:
>>
>> qping -info sge_master_host 801 qmaster 1
>> 11/07/2008 09:07:55:
>> SIRM version:             0.1
>> SIRM message id:          1
>> start time:               11/01/2008 11:22:19 (1225538539)
>> run time [s]:             510336
>> messages in read buffer:  0
>> messages in write buffer: 0
>> nr. of connected clients: 411
>> status:                   2
>> info:                     MAIN: E (510335.92) | signaler000: E
>> (510333.48) | event_master000: E (0.01) | timer000: E (3.00) |
>> worker000: E (57078.01) | worker001: E (56779.02) | listener000: E
>> (0.25) | listener001: E (0.10) | scheduler000: E (56751.01) | ERROR
>> malloc:                   arena(451719168) |ordblks(14) | 
>smblks(52) |
>> hblksr(2) | hblhkd(2105344) usmblks(0) | fsmblks(1904) |
>> uordblks(451578592) | fordblks(140576) | keepcost(126736)
>> Monitor:                  disabled
>>
>>
>> The sge_master process is still running on the master host, and 
>> contains about 12 child sge_master processes.
>>
>> Would stopping and starting the sge_master service kill any running 
>> jobs, or should they happily communicate with the new master process.
>>
>> Any help would be much appreciated
>>
>> Cheers,
>>
>> Mat Bradford
>>
>> ------------------------------------------------------
>> 
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessa
>> geId=88269
>>
>> To unsubscribe from this discussion, e-mail: 
>[users-unsubscribe at gridengine.sunsource.net].
>>
>
>------------------------------------------------------
>http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&
>dsMessageId=88270
>
>To unsubscribe from this discussion, e-mail: 
>[users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88276

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list