[GE users] qmaster using HUGE memory

seandavi seandavi at gmail.com
Fri Aug 27 15:50:45 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]



On Mon, Jun 14, 2010 at 7:39 AM, Sean Davis <sdavis2 at mail.nih.gov<mailto:sdavis2 at mail.nih.gov>> wrote:


On Wed, Jun 9, 2010 at 1:08 PM, kjpursley <kevin.pursley at bp.com<mailto:kevin.pursley at bp.com>> wrote:
As first pass turn off schedd_job_info. We have seen this use a bunch
on memory.


Thanks, Kevin.

I THINK this fixed things for us.  In any case, we have not had any problems recently.  Thanks for the help.


This post is a bit dated, but I wanted to follow up.  Turning off schedd_job_info did fix this issue (6.2u5).  However, users miss the job scheduling information.  Any other hints that we could try that would get us the scheduling info back?  This is not a big system and I wouldn't expect scheduling for a few dozen jobs to require more than 32GB of RAM (what our qmaster has).

Thanks,
Sean


________________________________
From: seandavi [mailto:seandavi at gmail.com<mailto:seandavi at gmail.com>]
Sent: Wednesday, June 09, 2010 12:02 PM
To: users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
Subject: Re: [GE users] qmaster using HUGE memory



On Wed, Jun 9, 2010 at 11:19 AM, rems0 <Richard.Ems at cape-horn-eng.com<mailto:Richard.Ems at cape-horn-eng.com>> wrote:
Hi,

Is your schedd_job_info set to false (qconf -ssconf) ?

Just for fun, the whole output.  To answer directly, schedd_job_info=true in our setup.

algorithm                         default
schedule_interval                 0:0:10
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   true
flush_submit_sec                  0
flush_finish_sec                  0
params                            none

halftime                          168
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         10000
weight_tickets_share              0
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                     0.500000
weight_waiting_time               0.278000
weight_deadline                   3600000.000000
weight_urgency                    0.050000
weight_priority                   0.050000
max_reservation                   10
default_duration                  INFINITY

Are you using parallel environments ?

Yes.  All jobs are simply SMP jobs, though, so no MPI integration.

Thanks,
Sean


Richard

On 06/09/2010 12:35 PM, seandavi wrote:
> Just a followup....
>
> The qmaster did finally come back down to a "normal" size (about 20
> minutes) of 40m or so.  So, I suppose this is expected behavior that I
> just happened to be around to observe and it may have happened before.
>  I'm still curious as to why it might happen.  I had just submitted 5
> new jobs, but they were not array jobs or anything else complicated.
>
> Thanks again,
> Sean
>
>
> On Wed, Jun 9, 2010 at 6:24 AM, Sean Davis <seandavi at gmail.com<mailto:seandavi at gmail.com>
> <mailto:seandavi at gmail.com<mailto:seandavi at gmail.com>>> wrote:
>
>     Using 6.2u5, I found this AM that jobs were not being scheduled.  I
>     checked around a bit and it turns out that the qmaster was using
>     30GB of RAM and the machine was thrashing.  This is with no array
>     jobs scheduled or running, 10 jobs in the queue, and a very small
>     cluster with only about 10 nodes.  The messages file is bland, I
>     think, but I can post an excerpt since last restarting the qmaster
>     (I have done that a couple of times).  Any suggestions?
>
>     Thanks,
>     Sean
>
>
>     The config looks like:
>     execd_spool_dir              /import/cluster/sge6_2u5/default/spool
>     mailer                       /bin/mail
>     xterm                        /usr/bin/X11/xterm
>     load_sensor                  none
>     prolog                       none
>     epilog                       none
>     shell_start_mode             posix_compliant
>     login_shells                 sh,ksh,csh,tcsh
>     min_uid                      0
>     min_gid                      0
>     user_lists                   none
>     xuser_lists                  none
>     projects                     none
>     xprojects                    none
>     enforce_project              false
>     enforce_user                 auto
>     load_report_time             00:00:40
>     max_unheard                  00:05:00
>     reschedule_unknown           00:00:00
>     loglevel                     log_warning
>     administrator_mail           sdavis2 at mail.nih.gov<mailto:sdavis2 at mail.nih.gov>
>     <mailto:sdavis2 at mail.nih.gov<mailto:sdavis2 at mail.nih.gov>>
>     set_token_cmd                none
>     pag_cmd                      none
>     token_extend_time            none
>     shepherd_cmd                 none
>     qmaster_params               none
>     execd_params                 none
>     reporting_params             accounting=true reporting=true \
>                                  flush_time=00:00:10 joblog=true
>     sharelog=00:00:00
>     finished_jobs                100
>     gid_range                    20200-20300
>     qlogin_command               builtin
>     qlogin_daemon                builtin
>     rlogin_command               builtin
>     rlogin_daemon                builtin
>     rsh_command                  builtin
>     rsh_daemon                   builtin
>     max_aj_instances             2000
>     max_aj_tasks                 5000
>     max_u_jobs                   0
>     max_jobs                     0
>     max_advance_reservations     0
>     auto_user_oticket            0
>     auto_user_fshare             100
>     auto_user_default_project    none
>     auto_user_delete_time        86400
>     delegated_file_staging       false
>     reprioritize                 false
>     libjvm_path
>      /usr/lib64/jvm/java/jre/lib/amd64/server/libjvm.so
>     additional_jvm_args          -Xmx2g
>     jsv_url                      none
>     jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>
>


--
Richard Ems       mail: Richard.Ems at Cape-Horn-Eng.com

Cape Horn Engineering S.L.
C/ Dr. J.J. Dómine 1, 5? piso
46011 Valencia
Tel : +34 96 3242923 / Fax 924
http://www.cape-horn-eng.com

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=261291

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].






More information about the gridengine-users mailing list