[GE users] qmaster using HUGE memory

reuti reuti at staff.uni-marburg.de
Fri Aug 27 16:07:06 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

Am 27.08.2010 um 16:50 schrieb seandavi:

> On Mon, Jun 14, 2010 at 7:39 AM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
> 
> 
> On Wed, Jun 9, 2010 at 1:08 PM, kjpursley <kevin.pursley at bp.com> wrote:
> As first pass turn off schedd_job_info. We have seen this use a bunch
> on memory.
> 
> 
> Thanks, Kevin.
> 
> I THINK this fixed things for us.  In any case, we have not had any problems recently.  Thanks for the help.
> 
> 
> This post is a bit dated, but I wanted to follow up.  Turning off schedd_job_info did fix this issue (6.2u5).  However, users miss the job scheduling information.  Any other hints that we could try that would get us the scheduling info back?  This is not a big system and I wouldn't expect scheduling for a few dozen jobs to require more than 32GB of RAM (what our qmaster has).

the users can try:

$ qalter -w p <jobid>

this may give hints why a job is waiting for the top most jobs in the pending list.

-- Reuti


> Thanks,
> Sean
> 
>  
> From: seandavi [mailto:seandavi at gmail.com] 
> Sent: Wednesday, June 09, 2010 12:02 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] qmaster using HUGE memory
> 
> 
> 
> On Wed, Jun 9, 2010 at 11:19 AM, rems0 <Richard.Ems at cape-horn-eng.com> wrote:
> Hi,
> 
> Is your schedd_job_info set to false (qconf -ssconf) ?
> 
> Just for fun, the whole output.  To answer directly, schedd_job_info=true in our setup. 
> 
> algorithm                         default
> schedule_interval                 0:0:10
> maxujobs                          0
> queue_sort_method                 load
> job_load_adjustments              np_load_avg=0.50
> load_adjustment_decay_time        0:7:30
> load_formula                      np_load_avg
> schedd_job_info                   true
> flush_submit_sec                  0
> flush_finish_sec                  0
> params                            none
> 
> halftime                          168
> usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
> compensation_factor               5.000000
> weight_user                       0.250000
> weight_project                    0.250000
> weight_department                 0.250000
> weight_job                        0.250000
> weight_tickets_functional         10000
> weight_tickets_share              0
> share_override_tickets            TRUE
> share_functional_shares           TRUE
> max_functional_jobs_to_schedule   200
> report_pjob_tickets               TRUE
> max_pending_tasks_per_job         50
> halflife_decay_list               none
> policy_hierarchy                  OFS
> weight_ticket                     0.500000
> weight_waiting_time               0.278000
> weight_deadline                   3600000.000000
> weight_urgency                    0.050000
> weight_priority                   0.050000
> max_reservation                   10
> default_duration                  INFINITY
>  
> Are you using parallel environments ?
> 
> Yes.  All jobs are simply SMP jobs, though, so no MPI integration.
> 
> Thanks,
> Sean
>  
> 
> Richard
> 
> On 06/09/2010 12:35 PM, seandavi wrote:
> > Just a followup....
> >
> > The qmaster did finally come back down to a "normal" size (about 20
> > minutes) of 40m or so.  So, I suppose this is expected behavior that I
> > just happened to be around to observe and it may have happened before.
> >  I'm still curious as to why it might happen.  I had just submitted 5
> > new jobs, but they were not array jobs or anything else complicated.
> >
> > Thanks again,
> > Sean
> >
> >
> > On Wed, Jun 9, 2010 at 6:24 AM, Sean Davis <seandavi at gmail.com
> > <mailto:seandavi at gmail.com>> wrote:
> >
> >     Using 6.2u5, I found this AM that jobs were not being scheduled.  I
> >     checked around a bit and it turns out that the qmaster was using
> >     30GB of RAM and the machine was thrashing.  This is with no array
> >     jobs scheduled or running, 10 jobs in the queue, and a very small
> >     cluster with only about 10 nodes.  The messages file is bland, I
> >     think, but I can post an excerpt since last restarting the qmaster
> >     (I have done that a couple of times).  Any suggestions?
> >
> >     Thanks,
> >     Sean
> >
> >
> >     The config looks like:
> >     execd_spool_dir              /import/cluster/sge6_2u5/default/spool
> >     mailer                       /bin/mail
> >     xterm                        /usr/bin/X11/xterm
> >     load_sensor                  none
> >     prolog                       none
> >     epilog                       none
> >     shell_start_mode             posix_compliant
> >     login_shells                 sh,ksh,csh,tcsh
> >     min_uid                      0
> >     min_gid                      0
> >     user_lists                   none
> >     xuser_lists                  none
> >     projects                     none
> >     xprojects                    none
> >     enforce_project              false
> >     enforce_user                 auto
> >     load_report_time             00:00:40
> >     max_unheard                  00:05:00
> >     reschedule_unknown           00:00:00
> >     loglevel                     log_warning
> >     administrator_mail           sdavis2 at mail.nih.gov
> >     <mailto:sdavis2 at mail.nih.gov>
> >     set_token_cmd                none
> >     pag_cmd                      none
> >     token_extend_time            none
> >     shepherd_cmd                 none
> >     qmaster_params               none
> >     execd_params                 none
> >     reporting_params             accounting=true reporting=true \
> >                                  flush_time=00:00:10 joblog=true
> >     sharelog=00:00:00
> >     finished_jobs                100
> >     gid_range                    20200-20300
> >     qlogin_command               builtin
> >     qlogin_daemon                builtin
> >     rlogin_command               builtin
> >     rlogin_daemon                builtin
> >     rsh_command                  builtin
> >     rsh_daemon                   builtin
> >     max_aj_instances             2000
> >     max_aj_tasks                 5000
> >     max_u_jobs                   0
> >     max_jobs                     0
> >     max_advance_reservations     0
> >     auto_user_oticket            0
> >     auto_user_fshare             100
> >     auto_user_default_project    none
> >     auto_user_delete_time        86400
> >     delegated_file_staging       false
> >     reprioritize                 false
> >     libjvm_path
> >      /usr/lib64/jvm/java/jre/lib/amd64/server/libjvm.so
> >     additional_jvm_args          -Xmx2g
> >     jsv_url                      none
> >     jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
> >
> >
> 
> 
> --
> Richard Ems       mail: Richard.Ems at Cape-Horn-Eng.com
> 
> Cape Horn Engineering S.L.
> C/ Dr. J.J. Dómine 1, 5? piso
> 46011 Valencia
> Tel : +34 96 3242923 / Fax 924
> http://www.cape-horn-eng.com
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=261291
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> 
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277455

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list