Fwd: [GE users] sge_qmaster memory spike

Ravi Chandra Nallan Ravichandra.Nallan at Sun.COM
Thu May 17 09:44:01 BST 2007


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

The data that is seen can be interpreted as,
The module   :wc - wall clock time the CPU spent while running in this 
module
              utime, stime - the user and system time the CPU spent
              the total utilization, i.e wallclock/(utime+stime)
The data shows how the CPU time is spent over different modules in GE.
 From the data, it seems the system is busy spooling.

It seems the problem is reported before 
(http://gridengine.sunsource.net/issues/show_bug.cgi?id=2062), and the 
possible workaround can be rotating the accounting file after some 
threshold size.
(refer 
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=17465
       
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=18923 )
Hope that helps,
-Ravi

Kirk Patton wrote:
> Can anyone point me to any reference on what the values reported mean when profiling is turned on?
>
> other          : wc =  21219.550s, utime =   3960.600s, stime =    776.310s, utilization =  22%
> communication  : wc =      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
> packing        : wc =      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
> eventclient    : wc =      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
> eventmaster    : wc =      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
> mirror         : wc =      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
> spooling       : wc =      0.350s, utime =      0.020s, stime =      0.340s, utilization = 103%
> spooling-io    : wc =    219.240s, utime =     43.740s, stime =      8.320s, utilization =  24%
> spooling-script: wc =      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
> gdi            : wc =      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
> gdi_request    : wc =      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
> ht-resize      : wc =      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
> total          : wc =  21439.140s, utime =   4004.360s, stime =    784.970s, utilization =  22%
>
> My sge_qmater stopped scheduling once again and had to be restarted.  I am trying to get some idea of where
> to look for the cause.  I change my execd_spool_dir to use local disk rather than NFS, but that did 
> not fix the problem.  Sge_qmaster and sge_execd on the master both continue to grow in memory use.
>
> 8275 sgeadmin 20 0 5889m 4.1g 1748 R 98 52.4 3291:30 sge_schedd
>                          ^^^^
> 8259 sgeadmin 16 0 4893m 3.2g 7372 S 5 40.6 1542:00 sge_qmaster
>                          ^^^^
>
> Thanks
> Kirk
>
> ----- "Kirk Patton" <kpatton at montalvosystems.com> wrote:
>   
>> Hello,
>>
>> We are running SGE 6.0u10.  We have been noticing that sge_qmaster's
>> memory consumption steadily grows for about two days and then spikes
>> up quickly.  Then, after about 45 minutes, the memory gets released
>> and the cycle starts over again.  
>>
>> During the peaks, the system becomes sluggish and unresponsive to user
>> queries.  Our execd_spool_dir has been on NFS and I have been moving
>> it to local disk on each exec host in the hopes of alleviating the
>> problem.  Looking at the utilization graphs we keep to track host
>> performance, the issue still seems to be present.
>>
>> I am wondering what steps I can take to track down what is causing the
>> high memory utilization.  The SGE master has 8Gb of system ram and
>> during the peak of the cycle, memory is maxed out and the system
>> begins swapping.  
>>
>> Profiling is enabled for the scheduler.  I am wondering if there is a
>> how-to or primer for interpreting the profiler metrics.  
>>
>> I have attached a graph illustrating what I am seeing.
>>
>> Thanks for any suggestions.
>> Kirk
>>
>> -- 
>> Kirk Patton x5585
>> Sr. systems Administrator
>> Montalvo Systems
>>     
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list