[GE users] sge_qmaster use 99.9% CPU

Simon Gao gao at schrodinger.com
Tue May 27 20:14:11 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I did see at one time the head node crashed due to qstat segfault. I am 
not sure how it's related to qmaster high CPU usage. Even after I 
rebooted the head node, the qmaster usage quickly got up to 99.9% and 
stayed there.

Would multiple submission nodes cause such problem?

Next time when it happens,  I will try restart qmaster. Now it's back to 
normal.

Simon


Christian Reissmann wrote:
> Hi Simon,
>
> the problem seems to be related to the read error of the previous qstat
> command. Did you kill it with SIGKILL?
>
> Did you shutdown/restart the qmaster? If yes, was qmaster still using
> 100% cpu?
>
> Regards,
>
> Christian
>
>
> Simon Gao wrote:
>> Ravi Chandra Nallan wrote:
>>> Hi,
>>> What version of SGE are you using? Any hints in the qmaster messages 
>>> file?
>>>
>>> regards,
>>> ~Ravi
>>> Simon Gao wrote:
>>>> Hi,
>>>>
>>>> Just notices that sge_qmaster has been constantly running 99.9% of 
>>>> CPU time. What are the factors that may contribute such high CPU 
>>>> usage by sge_qmaster? Where to look to find what's going on?
>>>>
>> SGE 6.0u6.
>>
>> I am not sure if following messages are related:
>>
>> 05/20/2008 15:17:01|qmaster|cluster|I|qmaster hard descriptor limit 
>> is set to 1024
>> 05/20/2008 15:17:01|qmaster|cluster|I|qmaster soft descriptor limit 
>> is set to 1024
>> 05/20/2008 15:17:01|qmaster|cluster|I|qmaster will use max. 1004 file 
>> descriptors for communication
>> 05/20/2008 15:17:01|qmaster|cluster|I|qmaster will accept max. 99 
>> dynamic event clients
>> 05/20/2008 15:17:01|qmaster|cluster|I|starting up 6.0u6
>> 05/20/2008 15:17:01|qmaster|cluster|W|FD_SETSIZE is limited to 1024 
>> file descriptors on this system.
>> 05/20/2008 15:17:01|qmaster|cluster|W|If you want to support more 
>> than 1004 qmaster clients you have to
>> 05/20/2008 15:17:01|qmaster|cluster|W|recompile the source code with 
>> a higher FD_SETSIZE setting.
>> 05/20/2008 15:17:01|qmaster|cluster|W|Bug Link: 
>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1502
>>
>> 05/20/2008 16:20:01|qmaster|cluster|E|commlib error: got send timeout 
>> (closing "clustersub.company.com/qstat/2065")
>> 05/20/2008 16:20:01|qmaster|cluster|E|commlib error: got send timeout 
>> (closing "clustersub.company.com/qstat/2064")
>>
>> 05/20/2008 19:14:49|qmaster|cluster|E|can't send asynchronous message 
>> to commproc (qstat:2398) on host "clustersub.company.com": can't send 
>> response for this message id - protocol error
>> 05/20/2008 19:14:49|qmaster|cluster|E|can't send asynchronous message 
>> to commproc (qstat:2397) on host "clustersub.company.com": can't send 
>> response for this message id - protocol error
>>
>>
>> Besides a main head node, cluster, we also have a submission node, 
>> clustersub, from which users can submit jobs.
>>
>> Simon
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list