[GE users] Scheduler hangs or crashes sporadically

parimi Venkateswara.Rao.Parimi at deshaw.com
Fri Mar 20 00:57:44 GMT 2009


Hello.

We have migrated to SGE v6.2u1 this week. In last 4 days
scheduler/qmaster went unresponsive. Already scheduled jobs keep running
however new batch jobs are getting in to qw state and interactive jobs
failing with below error.

<<>>
$ qrsh
error: getting configuration: failed receiving gdi request response for
mid=1 (got syncron message receive timeout error).
error:
Cannot get configuration from qmaster.
<<>>

Restarting qmaster or switching it to a different node helps.

Diagnostic info:

Qstat returns below message:

<<>>
$ qstat -j 7604
Can not get job info messages, scheduler is not available
==============================================================
job_number:                 7604
exec_file:                  job_scripts/7604
submission_time:            Sun Mar 15 03:40:08 2009
<</>>

Qping info:

<<>>
$ qping -info server port qmaster 1
03/19/2009 14:56:40:
SIRM version:             0.1
SIRM message id:          1
start time:               03/17/2009 17:21:51 (1237324911)
run time [s]:             164089
messages in read buffer:  0
messages in write buffer: 0
nr. of connected clients: 903
status:                   1
info:                     MAIN: E (164089.09) | signaler000: E
(164082.16) | event_master000: E (0.08) | timer000: E (4.71) |
worker000: E (0.08) | worker001: E (0.59) | listener000: E (0.71) |
listener001: E (0.08) | scheduler000: E (6.71) | WARNING
Monitor:
03/17/2009 17:21:51 | MAIN: no monitoring data available
03/17/2009 17:21:58 | signaler000: no monitoring data available
03/19/2009 14:56:28 | event_master000: runs: 48.22r/s (clients: 166.00
mod: 0.00/s ack: 5.56/s blocked: 0.00 busy: 0.00 | events: 59.68/s
added: 0.07/s skipt: 59.62/s) out: 5.56m/s APT: 0.0001s/m idle: 99.38%
wait: 0.00% time: 30.01s
03/19/2009 14:56:29 | timer000: runs: 0.43r/s (pending: 10.00 executed:
0.43/s) out: 0.00m/s APT: 0.0135s/m idle: 99.41% wait: 0.00% time:
30.01s
03/19/2009 14:56:31 | worker000: runs: 11.11r/s (EXECD
(l:9.51,j:8.67,c:9.51,p:0.00,a:0.00)/s GDI
(a:0.27,g:3.64,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
11.11m/s APT: 0.0032s/m idle: 96.50% wait: 0.66% time: 29.97s
03/19/2009 14:56:33 | worker001: runs: 10.98r/s (EXECD
(l:9.54,j:8.45,c:9.54,p:0.00,a:0.00)/s GDI
(a:0.16,g:3.55,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
10.98m/s APT: 0.0026s/m idle: 97.20% wait: 0.62% time: 30.41s
03/19/2009 14:56:12 | listener000: runs: 14.41r/s (in (g:1.16 a:3.30
e:0.00 r:9.95)/s) out: 0.00m/s APT: 0.0002s/m idle: 99.77% wait: 0.04%
time: 30.26s
03/19/2009 14:56:40 | listener001: runs: 14.67r/s (in (g:1.65 a:2.89
e:0.00 r:10.14)/s) out: 0.00m/s APT: 0.0002s/m idle: 99.76% wait: 0.03%
time: 29.78s
03/19/2009 13:41:14 | scheduler000: runs: 0.00r/s () out: 0.00m/s APT:
0.0000s/m idle: 0.52% wait: 0.00% time: 1648.96s
<</>>

Qping thread states always shows 'E' state irrespective of qmaster in
responsive state or not.

Is it a known bug or some manifestation of scheduler crashing with
memory leak etc.?

Thanks, Parimi V.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=137089

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list