[GE users] Qmaster Problem

andy andy.schwierskott at sun.com
Fri Nov 7 09:47:20 GMT 2008


Hi,

stopping and restarting qmaster has no impact on running and pending jobs.
There's only a short interupption in qmaster's responsiveness.

There is bug in SGE 6.2 FCS regards queues (and their jobs) which are
suspended due to suspend_on_subordinate: This state is lost at qmaster
restart, however job reminad to be suspended. This needs to be fixed
manually by sending a SIGCONT to the processes of the job.

Andy



On Fri, 7 Nov 2008, Bradford, Matthew wrote:

> We have been running SGE 6.2 happily for the last few weeks, and all of
> a sudden we are seeing a problem.
>
> From a client, attempting any SGE command returns this message:
> error: failed receiving gdi request response for mid=1 (got syncron
> message receive timeout error).
>
> And running the qping command returns this information:
>
> qping -info sge_master_host 801 qmaster 1
> 11/07/2008 09:07:55:
> SIRM version:             0.1
> SIRM message id:          1
> start time:               11/01/2008 11:22:19 (1225538539)
> run time [s]:             510336
> messages in read buffer:  0
> messages in write buffer: 0
> nr. of connected clients: 411
> status:                   2
> info:                     MAIN: E (510335.92) | signaler000: E
> (510333.48) | event_master000: E (0.01) | timer000: E (3.00) |
> worker000: E (57078.01) | worker001: E (56779.02) | listener000: E
> (0.25) | listener001: E (0.10) | scheduler000: E (56751.01) | ERROR
> malloc:                   arena(451719168) |ordblks(14) | smblks(52) |
> hblksr(2) | hblhkd(2105344) usmblks(0) | fsmblks(1904) |
> uordblks(451578592) | fordblks(140576) | keepcost(126736)
> Monitor:                  disabled
>
>
> The sge_master process is still running on the master host, and contains
> about 12 child sge_master processes.
>
> Would stopping and starting the sge_master service kill any running
> jobs, or should they happily communicate with the new master process.
>
> Any help would be much appreciated
>
> Cheers,
>
> Mat Bradford
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88269
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=88270

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list