[GE users] Qmaster hangs frequently (v6.2_u2)

parimi Venkateswara.Rao.Parimi at deshaw.com
Mon Jun 22 23:53:04 BST 2009


-> We do spool on NFS server and there are no file server issues.
-> Spooling method is flat files
-> Qmaster has no messages. If we enable log level to log_info from
log_warning we notice few event client messages immediately after the
qmaster is restarted and gets dead silent.
-> qping works

<<>>
$ qping -info desans6 5450 qmaster 1
06/11/2009 19:13:20:
SIRM version: 0.1
SIRM message id: 1
start time: 06/11/2009 19:09:44 (1244761784)
run time [s]: 216
messages in read buffer: 0
messages in write buffer: 0
nr. of connected clients: 589
status: 2
info: MAIN: W (215.77) | signaler000: W (204.78) | event_master000: W
(202.24) | timer000: W (190.35) | worker000: W (202.52) | worker001: W
(200.16) | listener000: W (202.56) | listener001: W (0.04) |
scheduler000: W (3.34) | ERROR
Monitor:
06/11/2009 19:09:44 | MAIN: no monitoring data available
06/11/2009 19:09:55 | signaler000: no monitoring data available
06/11/2009 19:09:55 | event_master000: no monitoring data available
06/11/2009 19:09:55 | timer000: no monitoring data available
06/11/2009 19:09:55 | worker000: no monitoring data available
06/11/2009 19:09:55 | worker001: no monitoring data available
06/11/2009 19:09:55 | listener000: no monitoring data available
06/11/2009 19:12:59 | listener001: runs: 7.32r/s (in (g:7.28 a:0.00
e:0.00 r:0.00)/s) out: 0.00m/s APT: 0.0000s/m idle: 99.98% wait: 0.00%
time: 29.52s
06/11/2009 19:09:55 | scheduler000: no monitoring data available
$
<</>>

-> Cluster is about 600 nodes
-> We have about 30k to 50k job turnaround per day.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203011

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list