[GE users] high CPU load for sge_qmaster

christian reissmann Christian.Reissmann at Sun.COM
Tue May 3 07:52:28 BST 2005


Sean,

the TAG_REPORT_REQUEST messages from the qping -dump are load reports
from the execds. Load reports are also send when a job finishes on an
execd.

The high cpu load of the qmaster may have two reasons:

1) your load_report_time is set to a (too) short value
   (check it with qconf -sconf)

or

2) a qmaster thread does not wait when nothing is to do (which was
already discussed earlier in the mail thread)

In order to find out that point 2) applies you should wait
for the qmaster to get into this high cpu usage condition and
shut down all execds and do the qping -dump (as you already hinted).

Or - disable all queues (qmod -d "*") set the load_report_time to a high
value (qconf -mconf), wait for no jobs to run and check also with qping
-dump what's going on in your cluster.


Best regards,

Christian


Sean Dilda wrote:
> Using Christian's suggestions for how to use qping -dump.  There are 
> definitely messages going through.  I've attached two dumps.  The first 
> one is 'normal', the other one is while shutting down the execd for 
> core-n60.  My cluster is heavily loaded right now, so I only have a few 
> execds that I can take down right now.
> 
> I also have another dump I can send you.  I ran it for 5 minutes and it 
> contains just over 3,000 lines.
> 
> 
> h:13
> h:12
> open connection to "head4/qmaster/1" ... no error happened
>            time|local          |d.|remote                   |format|ack type|               msg tag|msg id|msg rid|msg len|       msg time|
> ---------------|---------------|--|-------------------------|------|--------|----------------------|------|-------|-------|---------------|
> 09:41:52.479253|head4/qmaster/1|->|head4/debug_client/23927 |   crm|     nak|                     0|     0|      0|    218|09:41:52.479253|
> 09:41:52.479999|head4/qmaster/1|<-|bio-n094/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5027|      0|   2146|09:41:52.479999|
> 09:41:52.502021|head4/qmaster/1|<-|cod-n035/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5026|      0|   1993|09:41:52.502021|
> 09:41:52.523750|head4/qmaster/1|<-|bio-n078/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5023|      0|   2412|09:41:52.523750|
> 09:41:52.561002|head4/qmaster/1|<-|stat-n06/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5058|      0|   2146|09:41:52.561002|
> 09:41:52.606856|head4/qmaster/1|<-|bio-n025/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5026|      0|   2260|09:41:52.606856|
> 09:41:52.640461|head4/qmaster/1|<-|core-n35/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5035|      0|   1892|09:41:52.640461|
> 09:41:53.269252|head4/qmaster/1|<-|cod-n011/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5026|      0|   1993|09:41:53.269252|
> 09:41:53.284838|head4/qmaster/1|<-|bio-n082/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5027|      0|   2145|09:41:53.284838|
> 09:41:53.338851|head4/qmaster/1|<-|bio-n066/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5032|      0|   2251|09:41:53.338851|
> 09:41:53.355783|head4/qmaster/1|<-|cbcb-n30/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5030|      0|   2146|09:41:53.355783|
> 09:41:53.458714|head4/qmaster/1|<-|bio-n098/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5023|      0|   2514|09:41:53.458714|
> 09:41:53.586266|head4/qmaster/1|<-|bio-n003/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5069|      0|   2145|09:41:53.586266|
> 09:41:54.068528|head4/qmaster/1|<-|cbcb-n09/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5032|      0|   2251|09:41:54.068528|
> 09:41:54.074935|head4/qmaster/1|<-|stat-n32/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5022|      0|   2251|09:41:54.074935|
> 09:41:54.098854|head4/qmaster/1|<-|bio-n064/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5022|      0|   2412|09:41:54.098854|
> 09:41:54.289125|head4/qmaster/1|<-|nsoe-n07/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5032|      0|   2146|09:41:54.289125|
> 09:41:54.373374|head4/qmaster/1|<-|bio-n033/execd/1         |   bin|     nak|    TAG_REPORT_REQUEST|  5023|      0|   2513|09:41:54.373374|
> 09:41:54.632977|head4/qmaster/1|<-|stat-n26/execd/1         |   sim|     nak|                     0|  2595|      0|     25|09:41:54.632977|
> 09:41:54.636304|head4/qmaster/1|->|stat-n26/execd/1         |  sirm|     nak|                     0|   775|      0|    317|09:41:54.636304|

-- 
Christian Reissmann    Tel: +49 (0)941 3075 112  mailto:crei at sun.com
Software Engineer      Fax: +49 (0)941 3075 222
http://www.sun.com/gridengine
Sun Microsystems GmbH, Dr.-Leo-Ritter-Str. 7,
D-93049 Regensburg,    Tel: +49 (0)941 3075 0


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list