[GE users] Scheduler hangs or crashes sporadically

crei crei at sun.com
Wed Mar 25 10:36:22 GMT 2009


Please look into the following problems:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=2890
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2767


If your qmaster is overloaded (gdi response errors) it might
help to use the qmaster_params cl_ping, gdi_timeout and
gdi_retries (see man sge_conf).

Regards,

Christian


On 03/20/09 01:57, parimi wrote:
> Hello.
> 
> We have migrated to SGE v6.2u1 this week. In last 4 days
> scheduler/qmaster went unresponsive. Already scheduled jobs keep running
> however new batch jobs are getting in to qw state and interactive jobs
> failing with below error.
> 
> <<>>
> $ qrsh
> error: getting configuration: failed receiving gdi request response for
> mid=1 (got syncron message receive timeout error).
> error:
> Cannot get configuration from qmaster.
> <<>>
> 
> Restarting qmaster or switching it to a different node helps.
> 
> Diagnostic info:
> 
> Qstat returns below message:
> 
> <<>>
> $ qstat -j 7604
> Can not get job info messages, scheduler is not available
> ==============================================================
> job_number:                 7604
> exec_file:                  job_scripts/7604
> submission_time:            Sun Mar 15 03:40:08 2009
> <</>>
> 
> Qping info:
> 
> <<>>
> $ qping -info server port qmaster 1
> 03/19/2009 14:56:40:
> SIRM version:             0.1
> SIRM message id:          1
> start time:               03/17/2009 17:21:51 (1237324911)
> run time [s]:             164089
> messages in read buffer:  0
> messages in write buffer: 0
> nr. of connected clients: 903
> status:                   1
> info:                     MAIN: E (164089.09) | signaler000: E
> (164082.16) | event_master000: E (0.08) | timer000: E (4.71) |
> worker000: E (0.08) | worker001: E (0.59) | listener000: E (0.71) |
> listener001: E (0.08) | scheduler000: E (6.71) | WARNING
> Monitor:
> 03/17/2009 17:21:51 | MAIN: no monitoring data available
> 03/17/2009 17:21:58 | signaler000: no monitoring data available
> 03/19/2009 14:56:28 | event_master000: runs: 48.22r/s (clients: 166.00
> mod: 0.00/s ack: 5.56/s blocked: 0.00 busy: 0.00 | events: 59.68/s
> added: 0.07/s skipt: 59.62/s) out: 5.56m/s APT: 0.0001s/m idle: 99.38%
> wait: 0.00% time: 30.01s
> 03/19/2009 14:56:29 | timer000: runs: 0.43r/s (pending: 10.00 executed:
> 0.43/s) out: 0.00m/s APT: 0.0135s/m idle: 99.41% wait: 0.00% time:
> 30.01s
> 03/19/2009 14:56:31 | worker000: runs: 11.11r/s (EXECD
> (l:9.51,j:8.67,c:9.51,p:0.00,a:0.00)/s GDI
> (a:0.27,g:3.64,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 11.11m/s APT: 0.0032s/m idle: 96.50% wait: 0.66% time: 29.97s
> 03/19/2009 14:56:33 | worker001: runs: 10.98r/s (EXECD
> (l:9.54,j:8.45,c:9.54,p:0.00,a:0.00)/s GDI
> (a:0.16,g:3.55,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 10.98m/s APT: 0.0026s/m idle: 97.20% wait: 0.62% time: 30.41s
> 03/19/2009 14:56:12 | listener000: runs: 14.41r/s (in (g:1.16 a:3.30
> e:0.00 r:9.95)/s) out: 0.00m/s APT: 0.0002s/m idle: 99.77% wait: 0.04%
> time: 30.26s
> 03/19/2009 14:56:40 | listener001: runs: 14.67r/s (in (g:1.65 a:2.89
> e:0.00 r:10.14)/s) out: 0.00m/s APT: 0.0002s/m idle: 99.76% wait: 0.03%
> time: 29.78s
> 03/19/2009 13:41:14 | scheduler000: runs: 0.00r/s () out: 0.00m/s APT:
> 0.0000s/m idle: 0.52% wait: 0.00% time: 1648.96s
> <</>>
> 
> Qping thread states always shows 'E' state irrespective of qmaster in
> responsive state or not.
> 
> Is it a known bug or some manifestation of scheduler crashing with
> memory leak etc.?
> 
> Thanks, Parimi V.
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=137089
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

-- 
Sun Microsystems GmbH             Christian Reissmann
Dr.-Leo-Ritter-Str. 7             Software Engineer
D-93049 Regensburg                Phone: +49 (0)941 3075 112
Germany                           Fax:   +49 (0)941 3075 222
http://www.sun.de                 mailto: Christian.Reissmann at sun.com
                                   http://www.sun.com/gridengine
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=142577

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list