[GE users] sge_qmaster memory spike

Kirk Patton kpatton at montalvosystems.com
Thu May 17 17:05:35 BST 2007


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Reguarding bug id 2062.  As I noted in the original post, we are running 6.0u10.  This bug references a fix in 6.0u8.  Might anyone know if this fix was included in 6.0u10?

I will try adjusting our log rotation to start a new file more frequently and see if that make any difference with the memory consumption of sge_qmaster.

Thanks
Kirk

----- "Ravi Chandra Nallan" <Ravichandra.Nallan at Sun.COM> wrote:
> The data that is seen can be interpreted as,
> The module   :wc - wall clock time the CPU spent while running in this
> 
> module
>               utime, stime - the user and system time the CPU spent
>               the total utilization, i.e wallclock/(utime+stime)
> The data shows how the CPU time is spent over different modules in
> GE.
>  From the data, it seems the system is busy spooling.
> 
> It seems the problem is reported before 
> (http://gridengine.sunsource.net/issues/show_bug.cgi?id=2062), and the
> 
> possible workaround can be rotating the accounting file after some 
> threshold size.
> (refer 
> http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=17465
>        
> http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=18923
> )
> Hope that helps,
> -Ravi
> 
> Kirk Patton wrote:
> > Can anyone point me to any reference on what the values reported
> mean when profiling is turned on?
> >
> > other          : wc =  21219.550s, utime =   3960.600s, stime =   
> 776.310s, utilization =  22%
> > communication  : wc =      0.000s, utime =      0.000s, stime =     
> 0.000s, utilization =   0%
> > packing        : wc =      0.000s, utime =      0.000s, stime =     
> 0.000s, utilization =   0%
> > eventclient    : wc =      0.000s, utime =      0.000s, stime =     
> 0.000s, utilization =   0%
> > eventmaster    : wc =      0.000s, utime =      0.000s, stime =     
> 0.000s, utilization =   0%
> > mirror         : wc =      0.000s, utime =      0.000s, stime =     
> 0.000s, utilization =   0%
> > spooling       : wc =      0.350s, utime =      0.020s, stime =     
> 0.340s, utilization = 103%
> > spooling-io    : wc =    219.240s, utime =     43.740s, stime =     
> 8.320s, utilization =  24%
> > spooling-script: wc =      0.000s, utime =      0.000s, stime =     
> 0.000s, utilization =   0%
> > gdi            : wc =      0.000s, utime =      0.000s, stime =     
> 0.000s, utilization =   0%
> > gdi_request    : wc =      0.000s, utime =      0.000s, stime =     
> 0.000s, utilization =   0%
> > ht-resize      : wc =      0.000s, utime =      0.000s, stime =     
> 0.000s, utilization =   0%
> > total          : wc =  21439.140s, utime =   4004.360s, stime =   
> 784.970s, utilization =  22%
> >
> > My sge_qmater stopped scheduling once again and had to be restarted.
>  I am trying to get some idea of where
> > to look for the cause.  I change my execd_spool_dir to use local
> disk rather than NFS, but that did 
> > not fix the problem.  Sge_qmaster and sge_execd on the master both
> continue to grow in memory use.
> >
> > 8275 sgeadmin 20 0 5889m 4.1g 1748 R 98 52.4 3291:30 sge_schedd
> >                          ^^^^
> > 8259 sgeadmin 16 0 4893m 3.2g 7372 S 5 40.6 1542:00 sge_qmaster
> >                          ^^^^
> >
> > Thanks
> > Kirk
> >
> > ----- "Kirk Patton" <kpatton at montalvosystems.com> wrote:
> >   
> >> Hello,
> >>
> >> We are running SGE 6.0u10.  We have been noticing that
> sge_qmaster's
> >> memory consumption steadily grows for about two days and then
> spikes
> >> up quickly.  Then, after about 45 minutes, the memory gets
> released
> >> and the cycle starts over again.  
> >>
> >> During the peaks, the system becomes sluggish and unresponsive to
> user
> >> queries.  Our execd_spool_dir has been on NFS and I have been
> moving
> >> it to local disk on each exec host in the hopes of alleviating the
> >> problem.  Looking at the utilization graphs we keep to track host
> >> performance, the issue still seems to be present.
> >>
> >> I am wondering what steps I can take to track down what is causing
> the
> >> high memory utilization.  The SGE master has 8Gb of system ram and
> >> during the peak of the cycle, memory is maxed out and the system
> >> begins swapping.  
> >>
> >> Profiling is enabled for the scheduler.  I am wondering if there is
> a
> >> how-to or primer for interpreting the profiler metrics.  
> >>
> >> I have attached a graph illustrating what I am seeing.
> >>
> >> Thanks for any suggestions.
> >> Kirk
> >>
> >> -- 
> >> Kirk Patton x5585
> >> Sr. systems Administrator
> >> Montalvo Systems
> >>     
> >
> >
> >   
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


-- 
Kirk Patton x5585
Sr. systems Administrator
Montalvo Systems

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list