[GE users] sge_qmaster use 99.9% CPU

Christian Reissmann Christian.Reissmann at Sun.COM
Mon May 26 10:09:43 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Simon,

the problem seems to be related to the read error of the previous qstat
command. Did you kill it with SIGKILL?

Did you shutdown/restart the qmaster? If yes, was qmaster still using
100% cpu?

Regards,

Christian


Simon Gao wrote:
> Ravi Chandra Nallan wrote:
>> Hi,
>> What version of SGE are you using? Any hints in the qmaster messages 
>> file?
>>
>> regards,
>> ~Ravi
>> Simon Gao wrote:
>>> Hi,
>>>
>>> Just notices that sge_qmaster has been constantly running 99.9% of 
>>> CPU time. What are the factors that may contribute such high CPU 
>>> usage by sge_qmaster? Where to look to find what's going on?
>>>
> SGE 6.0u6.
> 
> I am not sure if following messages are related:
> 
> 05/20/2008 15:17:01|qmaster|cluster|I|qmaster hard descriptor limit is 
> set to 1024
> 05/20/2008 15:17:01|qmaster|cluster|I|qmaster soft descriptor limit is 
> set to 1024
> 05/20/2008 15:17:01|qmaster|cluster|I|qmaster will use max. 1004 file 
> descriptors for communication
> 05/20/2008 15:17:01|qmaster|cluster|I|qmaster will accept max. 99 
> dynamic event clients
> 05/20/2008 15:17:01|qmaster|cluster|I|starting up 6.0u6
> 05/20/2008 15:17:01|qmaster|cluster|W|FD_SETSIZE is limited to 1024 file 
> descriptors on this system.
> 05/20/2008 15:17:01|qmaster|cluster|W|If you want to support more than 
> 1004 qmaster clients you have to
> 05/20/2008 15:17:01|qmaster|cluster|W|recompile the source code with a 
> higher FD_SETSIZE setting.
> 05/20/2008 15:17:01|qmaster|cluster|W|Bug Link: 
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1502
> 
> 05/20/2008 16:20:01|qmaster|cluster|E|commlib error: got send timeout 
> (closing "clustersub.company.com/qstat/2065")
> 05/20/2008 16:20:01|qmaster|cluster|E|commlib error: got send timeout 
> (closing "clustersub.company.com/qstat/2064")
> 
> 05/20/2008 19:14:49|qmaster|cluster|E|can't send asynchronous message to 
> commproc (qstat:2398) on host "clustersub.company.com": can't send 
> response for this message id - protocol error
> 05/20/2008 19:14:49|qmaster|cluster|E|can't send asynchronous message to 
> commproc (qstat:2397) on host "clustersub.company.com": can't send 
> response for this message id - protocol error
> 
> 
> Besides a main head node, cluster, we also have a submission node, 
> clustersub, from which users can submit jobs.
> 
> Simon
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Sun Microsystems GmbH             Christian Reissmann
Dr.-Leo-Ritter-Str. 7             Software Engineer
D-93049 Regensburg                Phone: +49 (0)941 3075 112
Germany                           Fax:   +49 (0)941 3075 222
http://www.sun.de                 mailto: Christian.Reissmann at sun.com
                                   http://www.sun.com/gridengine
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list