[GE users] Qmaster hangs frequently (v6.2_u2)

crei crei at sun.com
Tue Jun 23 08:55:00 BST 2009


Thanks,

Your qping output shows monitor data. What is your monitoring time?
(qconf -sconf | grep qmaster_params)

After some time every thread should report monitoring data - in your
output only listener001 reports monitoring data. Please check if the
other threads also report monitoring data otherwise there might be some
locking problem.

Have you turned on schedd_job_info?
(qconf -ssconf | grep schedd_job_info
If it is enabled, does it help when you turn off schedd_job_info?

See also: http://gridengine.sunsource.net/issues/show_bug.cgi?id=2464

Are there enough resources (memory/cpu) available on your qmaster host?
(Please check size and cpu usage of sge_ daemons when problem occurs)


Regards,

Christian


On 06/23/09 00:53, parimi wrote:
> -> We do spool on NFS server and there are no file server issues.
> -> Spooling method is flat files
> -> Qmaster has no messages. If we enable log level to log_info from
> log_warning we notice few event client messages immediately after the
> qmaster is restarted and gets dead silent.
> -> qping works
> 
> <<>>
> $ qping -info desans6 5450 qmaster 1
> 06/11/2009 19:13:20:
> SIRM version: 0.1
> SIRM message id: 1
> start time: 06/11/2009 19:09:44 (1244761784)
> run time [s]: 216
> messages in read buffer: 0
> messages in write buffer: 0
> nr. of connected clients: 589
> status: 2
> info: MAIN: W (215.77) | signaler000: W (204.78) | event_master000: W
> (202.24) | timer000: W (190.35) | worker000: W (202.52) | worker001: W
> (200.16) | listener000: W (202.56) | listener001: W (0.04) |
> scheduler000: W (3.34) | ERROR
> Monitor:
> 06/11/2009 19:09:44 | MAIN: no monitoring data available
> 06/11/2009 19:09:55 | signaler000: no monitoring data available
> 06/11/2009 19:09:55 | event_master000: no monitoring data available
> 06/11/2009 19:09:55 | timer000: no monitoring data available
> 06/11/2009 19:09:55 | worker000: no monitoring data available
> 06/11/2009 19:09:55 | worker001: no monitoring data available
> 06/11/2009 19:09:55 | listener000: no monitoring data available
> 06/11/2009 19:12:59 | listener001: runs: 7.32r/s (in (g:7.28 a:0.00
> e:0.00 r:0.00)/s) out: 0.00m/s APT: 0.0000s/m idle: 99.98% wait: 0.00%
> time: 29.52s
> 06/11/2009 19:09:55 | scheduler000: no monitoring data available
> $
> <</>>
> 
> -> Cluster is about 600 nodes
> -> We have about 30k to 50k job turnaround per day.
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203011
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

-- 
Sun Microsystems GmbH             Christian Reissmann
Dr.-Leo-Ritter-Str. 7             Software Engineer
D-93049 Regensburg                Phone: +49 (0)941 3075 112
Germany                           Fax:   +49 (0)941 3075 222
http://www.sun.de                 mailto: Christian.Reissmann at sun.com
                                   http://www.sun.com/gridengine
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=203098

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list