[GE users] high CPU load for sge_qmaster

christian reissmann Christian.Reissmann at Sun.COM
Mon May 2 10:37:14 BST 2005


Hi Sean, Stephan,

in order to check the current message transfer of a running
qmaster the qping -dump should be setup like follows:

qping -dump on 60u3 systems
===========================

- login to your qmaster as root
- source $SGE_ROOT/default/common/settings.csh
- setenv SGE_QPING_OUTPUT_FORMAT "h:13 h:12"
- qping -dump gridware $SGE_QMASTER_PORT qmaster 1

The output shows one message per line.

Best regards,

Christian


Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
> Hi,
> this could mean two things:
> 
> - somehow the wait for a message to process does not work and the MT
> thread goes wild
> 
> - you have incomming messages.
> 
> You can check for incomming messages via qping -dump
> 
> Could you please post that output. The jumping EDT time sugests,that
> there are at least some messages going forth and back.
> 
> You can also test, if the bahavior changes, when you are shuting down
> the execds.
> 
> Stephan
> 
> Sean Dilda wrote:
> 
> 
>>Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>> 
>>
>>
>>>You are right, there are no jobs in the system. Could you monitor the 
>>>qping output? Is the MT: allways that low?
>>>If there is nothing to do, I would except higher times than 0.4.
>>>When the system is idel, as yours are, the number should be similar to;:
>>>
>>>EDT:R(x) ~0.9
>>>TET:R(x) > 1
>>>MT:R(x) > 1
>>>
>>>Do you know what triggers this behavior?
>>>What operating system are you using?
>>>   
>>>
>>
>>I ran qping with '-i 10 -f' for a while.  EDT seemed to bounce around, 
>>always > 0.00 and < 1.00.   TET bounced around, just as likely to be 
>>above 1 as below it.  And MT stayed at 0.04 the whole time.  This system 
>>is running CentOS 3, which is essentially RHEL3.
>>
>>I have a much smaller test cluster running the same OS and the same SGE 
>>binaries.  Although at one point I spent a good amount of time trying to 
>>reproduce this there, I've been unable to reproduce the problem on the 
>>test cluster.  I've tried all the configuration options I could think 
>>of.  The same qping command on that box tended to have a similar EDT to 
>>my big cluster.  The TET bounced around a bit, but was almost always 
>>above 1.  It had an MT that bounced around as well, but tended to stay 
>>under 1 the whole time.
>>
>>It looks like some jobs did exit on my big cluster while I was doing 
>>this.   I know for certain that no jobs were submitted or even running 
>>on my test cluster during this.
>>
>>I really have no idea what triggers this.  My big cluster has been in 
>>this state for most of a month.  I tried to restart sge_qmaster a couple 
>>of times to see if it would go away, but that never worked.
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> 
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Christian Reissmann    Tel: +49 (0)941 3075 112  mailto:crei at sun.com
Software Engineer      Fax: +49 (0)941 3075 222
http://www.sun.com/gridengine
Sun Microsystems GmbH, Dr.-Leo-Ritter-Str. 7,
D-93049 Regensburg,    Tel: +49 (0)941 3075 0


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list