[GE users] sge_qmaster memory spike

Kirk Patton kpatton at montalvosystems.com
Fri May 25 15:55:07 BST 2007


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

I tried restarting the daemons on the master host a fixed intervals in an attempt to see if the memory spike problem was related to the memory growing.  We had another outage while the memory usage was well below where it has been in the past.

I would like to know how I can force the SGE daemons to dump core?  I tried sending SIGQUIT and SIGSEGV, but the sge_qmaster daemon seems to ignore both signals.

Kirk

----- "Kirk Patton" <kpatton at montalvosystems.com> wrote:
> Reguarding bug id 2062.  As I noted in the original post, we are
> running 6.0u10.  This bug references a fix in 6.0u8.  Might anyone
> know if this fix was included in 6.0u10?
> 
> I will try adjusting our log rotation to start a new file more
> frequently and see if that make any difference with the memory
> consumption of sge_qmaster.
> 
> Thanks
> Kirk
> 
> ----- "Ravi Chandra Nallan" <Ravichandra.Nallan at Sun.COM> wrote:
> > The data that is seen can be interpreted as,
> > The module   :wc - wall clock time the CPU spent while running in
> this
> > 
> > module
> >               utime, stime - the user and system time the CPU spent
> >               the total utilization, i.e wallclock/(utime+stime)
> > The data shows how the CPU time is spent over different modules in
> > GE.
> >  From the data, it seems the system is busy spooling.
> > 
> > It seems the problem is reported before 
> > (http://gridengine.sunsource.net/issues/show_bug.cgi?id=2062), and
> the
> > 
> > possible workaround can be rotating the accounting file after some 
> > threshold size.
> > (refer 
> >
> http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=17465
> >        
> >
> http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=18923
> > )
> > Hope that helps,
> > -Ravi
> > 
> > Kirk Patton wrote:
> > > Can anyone point me to any reference on what the values reported
> > mean when profiling is turned on?
> > >
> > > other          : wc =  21219.550s, utime =   3960.600s, stime =  
> 
> > 776.310s, utilization =  22%
> > > communication  : wc =      0.000s, utime =      0.000s, stime =   
>  
> > 0.000s, utilization =   0%
> > > packing        : wc =      0.000s, utime =      0.000s, stime =   
>  
> > 0.000s, utilization =   0%
> > > eventclient    : wc =      0.000s, utime =      0.000s, stime =   
>  
> > 0.000s, utilization =   0%
> > > eventmaster    : wc =      0.000s, utime =      0.000s, stime =   
>  
> > 0.000s, utilization =   0%
> > > mirror         : wc =      0.000s, utime =      0.000s, stime =   
>  
> > 0.000s, utilization =   0%
> > > spooling       : wc =      0.350s, utime =      0.020s, stime =   
>  
> > 0.340s, utilization = 103%
> > > spooling-io    : wc =    219.240s, utime =     43.740s, stime =   
>  
> > 8.320s, utilization =  24%
> > > spooling-script: wc =      0.000s, utime =      0.000s, stime =   
>  
> > 0.000s, utilization =   0%
> > > gdi            : wc =      0.000s, utime =      0.000s, stime =   
>  
> > 0.000s, utilization =   0%
> > > gdi_request    : wc =      0.000s, utime =      0.000s, stime =   
>  
> > 0.000s, utilization =   0%
> > > ht-resize      : wc =      0.000s, utime =      0.000s, stime =   
>  
> > 0.000s, utilization =   0%
> > > total          : wc =  21439.140s, utime =   4004.360s, stime =  
> 
> > 784.970s, utilization =  22%
> > >
> > > My sge_qmater stopped scheduling once again and had to be
> restarted.
> >  I am trying to get some idea of where
> > > to look for the cause.  I change my execd_spool_dir to use local
> > disk rather than NFS, but that did 
> > > not fix the problem.  Sge_qmaster and sge_execd on the master
> both
> > continue to grow in memory use.
> > >
> > > 8275 sgeadmin 20 0 5889m 4.1g 1748 R 98 52.4 3291:30 sge_schedd
> > >                          ^^^^
> > > 8259 sgeadmin 16 0 4893m 3.2g 7372 S 5 40.6 1542:00 sge_qmaster
> > >                          ^^^^
> > >
> > > Thanks
> > > Kirk
> > >
> > > ----- "Kirk Patton" <kpatton at montalvosystems.com> wrote:
> > >   
> > >> Hello,
> > >>
> > >> We are running SGE 6.0u10.  We have been noticing that
> > sge_qmaster's
> > >> memory consumption steadily grows for about two days and then
> > spikes
> > >> up quickly.  Then, after about 45 minutes, the memory gets
> > released
> > >> and the cycle starts over again.  
> > >>
> > >> During the peaks, the system becomes sluggish and unresponsive
> to
> > user
> > >> queries.  Our execd_spool_dir has been on NFS and I have been
> > moving
> > >> it to local disk on each exec host in the hopes of alleviating
> the
> > >> problem.  Looking at the utilization graphs we keep to track
> host
> > >> performance, the issue still seems to be present.
> > >>
> > >> I am wondering what steps I can take to track down what is
> causing
> > the
> > >> high memory utilization.  The SGE master has 8Gb of system ram
> and
> > >> during the peak of the cycle, memory is maxed out and the system
> > >> begins swapping.  
> > >>
> > >> Profiling is enabled for the scheduler.  I am wondering if there
> is
> > a
> > >> how-to or primer for interpreting the profiler metrics.  
> > >>
> > >> I have attached a graph illustrating what I am seeing.
> > >>
> > >> Thanks for any suggestions.
> > >> Kirk
> > >>
> > >> -- 
> > >> Kirk Patton x5585
> > >> Sr. systems Administrator
> > >> Montalvo Systems
> > >>     
> > >
> > >
> > >   
> > 
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> 
> 
> -- 
> Kirk Patton x5585
> Sr. systems Administrator
> Montalvo Systems
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


-- 
Kirk Patton x5585
Sr. systems Administrator
Montalvo Systems

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list