[GE users] sge_qmaster use 99.9% CPU

Christian Reissmann Christian.Reissmann at Sun.COM
Wed May 28 08:08:09 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


Hi Simon,

A segfault qstat should not cause a permanent 100% CPU usage of qmaster
if qmaster is not working on other requests. Perhaps the qmaster was 
busy with requests from other nodes. Let's see if you get the error 
again and then you might check the qmaster traffic by using qping -dump
on the qmaster node ...

Regards,

Christian

Simon Gao wrote:
> I did see at one time the head node crashed due to qstat segfault. I am 
> not sure how it's related to qmaster high CPU usage. Even after I 
> rebooted the head node, the qmaster usage quickly got up to 99.9% and 
> stayed there.
> 
> Would multiple submission nodes cause such problem?
> 
> Next time when it happens,  I will try restart qmaster. Now it's back to 
> normal.
> 
> Simon
> 
> 
> Christian Reissmann wrote:
>> Hi Simon,
>>
>> the problem seems to be related to the read error of the previous qstat
>> command. Did you kill it with SIGKILL?
>>
>> Did you shutdown/restart the qmaster? If yes, was qmaster still using
>> 100% cpu?
>>
>> Regards,
>>
>> Christian
>>
>>
>> Simon Gao wrote:
>>> Ravi Chandra Nallan wrote:
>>>> Hi,
>>>> What version of SGE are you using? Any hints in the qmaster messages 
>>>> file?
>>>>
>>>> regards,
>>>> ~Ravi
>>>> Simon Gao wrote:
>>>>> Hi,
>>>>>
>>>>> Just notices that sge_qmaster has been constantly running 99.9% of 
>>>>> CPU time. What are the factors that may contribute such high CPU 
>>>>> usage by sge_qmaster? Where to look to find what's going on?
>>>>>
>>> SGE 6.0u6.
>>>
>>> I am not sure if following messages are related:
>>>
>>> 05/20/2008 15:17:01|qmaster|cluster|I|qmaster hard descriptor limit 
>>> is set to 1024
>>> 05/20/2008 15:17:01|qmaster|cluster|I|qmaster soft descriptor limit 
>>> is set to 1024
>>> 05/20/2008 15:17:01|qmaster|cluster|I|qmaster will use max. 1004 file 
>>> descriptors for communication
>>> 05/20/2008 15:17:01|qmaster|cluster|I|qmaster will accept max. 99 
>>> dynamic event clients
>>> 05/20/2008 15:17:01|qmaster|cluster|I|starting up 6.0u6
>>> 05/20/2008 15:17:01|qmaster|cluster|W|FD_SETSIZE is limited to 1024 
>>> file descriptors on this system.
>>> 05/20/2008 15:17:01|qmaster|cluster|W|If you want to support more 
>>> than 1004 qmaster clients you have to
>>> 05/20/2008 15:17:01|qmaster|cluster|W|recompile the source code with 
>>> a higher FD_SETSIZE setting.
>>> 05/20/2008 15:17:01|qmaster|cluster|W|Bug Link: 
>>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1502
>>>
>>> 05/20/2008 16:20:01|qmaster|cluster|E|commlib error: got send timeout 
>>> (closing "clustersub.company.com/qstat/2065")
>>> 05/20/2008 16:20:01|qmaster|cluster|E|commlib error: got send timeout 
>>> (closing "clustersub.company.com/qstat/2064")
>>>
>>> 05/20/2008 19:14:49|qmaster|cluster|E|can't send asynchronous message 
>>> to commproc (qstat:2398) on host "clustersub.company.com": can't send 
>>> response for this message id - protocol error
>>> 05/20/2008 19:14:49|qmaster|cluster|E|can't send asynchronous message 
>>> to commproc (qstat:2397) on host "clustersub.company.com": can't send 
>>> response for this message id - protocol error
>>>
>>>
>>> Besides a main head node, cluster, we also have a submission node, 
>>> clustersub, from which users can submit jobs.
>>>
>>> Simon
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Sun Microsystems GmbH             Christian Reissmann
Dr.-Leo-Ritter-Str. 7             Software Engineer
D-93049 Regensburg                Phone: +49 (0)941 3075 112
Germany                           Fax:   +49 (0)941 3075 222
http://www.sun.de                 mailto: Christian.Reissmann at sun.com
                                   http://www.sun.com/gridengine
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list