Opened 8 years ago

Closed 3 years ago

#682 closed defect (fixed)

IZ3050: 6.2u2_1 qmaster large memory leak

Reported by: steelah1 Owned by:
Priority: high Milestone:
Component: sge Version: 6.2u2
Severity: minor Keywords: Linux qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3050]

        Issue #:      3050             Platform:     Other    Reporter: steelah1 (steelah1)
       Component:     gridengine          OS:        Linux
     Subcomponent:    qmaster          Version:      6.2u2       CC:    None defined
        Status:       NEW              Priority:     P2
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
          URL:
       * Summary:     6.2u2_1 qmaster large memory leak
   Status whiteboard:
      Attachments:

     Issue 3050 blocks:
   Votes for issue 3050:


   Opened: Mon Jun 15 15:24:00 -0700 2009 
------------------------


I upgraded 6.0u8 to 6.2u2_1, and I keep getting large memory leaks from sge_qmaster. It approaches %100 percent of the memory within a few
minutes, and submitted jobs just sit in queue wait status. I don't see anything in the messages file
(/local/sge/default/common/spool/qmaster/messages), and simply restarting the daemon doesn't fix the problem. I have to kill the sge_execd
on the execution hosts/compute nodes, and then restart them one at a time every few seconds. Any jobs that are running, I can leave their
sge_execd going, but I have to restart all the other ones. This way the memory leak goes away. Any ideas or info would be greatly
appreciated, as this was working fine before with 6.0u8.

   ------- Additional comments from joga Wed Jun 24 05:24:56 -0700 2009 -------
*** Issue 3051 has been marked as a duplicate of this issue. ***

   ------- Additional comments from joga Wed Jun 24 05:34:19 -0700 2009 -------
please provide some more information:
- architecture (as delivered by $SGE_ROOT/util/arch)
- the scheduler configuration (qconf -ssconf) and global configuration (qconf -sconf)
- what type of jobs are you running, e.g. array jobs, parallel jobs, having special resource requests, etc.
- are you using some special configuration options like access lists, a sharetree, etc.
- how big is your cluster (number of exec hosts), and how many jobs are in the cluster when this happens?

Just guessing, if you have enabled the schedd_job_info in the scheduler configuration, try disabling it.
If you need the information why a job cannot be scheduled and now take it from qstat -j <job_id>,
try qalter -w p <job_id> instead.

   ------- Additional comments from steelah1 Wed Jun 24 07:52:31 -0700 2009 -------
/local/sge/util/arch
lx24-amd64

qconf -ssconf
algorithm                         default
schedule_interval                 0:0:15
maxujobs                          0
queue_sort_method                 seqno
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   true
flush_submit_sec                  0
flush_finish_sec                  0
params                            none
reprioritize_interval             0:0:0
halftime                          168
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         0
weight_tickets_share              0
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                     0.010000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   1.000000
max_reservation                   0
default_duration                  0:10:0

qconf -sconf
#global:
execd_spool_dir              /local/sge/default/spool
mailer                       /bin/mail
xterm                        /usr/bin/X11/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           hpcauth@inl.gov,sheljk@inl.gov
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=false \
                             flush_time=00:00:15 joblog=false sharelog=00:00:00
finished_jobs                100
gid_range                    20000-20100
qlogin_command               /usr/local/bin/ssh_qlogin
qlogin_daemon                /usr/sbin/sshd -i
rlogin_daemon                /usr/sbin/sshd -i
rlogin_command               /usr/bin/ssh
rsh_command                  /usr/bin/ssh
rsh_daemon                   /usr/sbin/sshd -i
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
auto_user_oticket            0
auto_user_fshare             0
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 false
jsv_url                      none
jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w

We run mostly parallel and serial jobs, no array jobs, no special requests.

We have some access lists for users for a couple of specific queues, but for the main queue it's wide open, so anyone who can get on the
machine can run jobs.

Our cluster is a combination of 166 dell 1950 dual core and quad core compute nodes running opensuse 11.1, with one login/head node (dell
1950, quadcore, opensuse 11.1)

   ------- Additional comments from steelah1 Wed Jun 24 13:49:01 -0700 2009 -------
Also, I recently changed sched_job_info from true to false using qconf -msconf

   ------- Additional comments from joga Thu Jun 25 09:10:13 -0700 2009 -------
In the given scheduler config, you have the schedd_job_info enabled.
If you disable it, do you still see the problem?

If you can reproduce the issue,
we could try to get a core dump from qmaster when this problem occurs - this might help us to understand where exactly the problem is.
It is possible to get a core dump from a running process via the gcore command usually available on Linux.

So you could either manually call gcore <qmaster_pid> when you see the problem, or use a script I prepared for this purpose:
It monitors the qmaster size, and calls gcore when qmaster reaches a certain size,
and repeats calling gcore for a configureable number of times after certain steps of growth.
When to call gcore, and how often, can be configured at the beginning of the script.
You can download it from the following URL:
http://gridengine.sunsource.net/files/documents/7/202/monitor_qmaster.sh

Change History (9)

comment:1 Changed 4 years ago by ppoilbarbe

  • Severity set to minor

Same issue with version Son of Grid Engine 8.1.3.
Whenever a pe job is submitted (or is already in qw state when restarting qmaster) a memory leak consumes all the available memory (2gb of ram 7Gb of swap) in about 1 hour, during this qmaster does nothing (no scheduling).
Another thing which may be linked, when subitting a pe job with -pe peenv n- (whatever n is) SGE set the max to 9999999 and the job is submitted but after that qmaster does not respond to any command (qsub, qstat...).
I tried many configurations to overcome this but it seems that the schedd_job_info=false is working.

environment/remarks:

  • master: arch is lx-amd64
  • nodes: arch is lx-x86 or lx-amd64
  • Linux distribution: Ubuntu 12.04
  • about 100 nodes
  • there is one shadow master
  • SGE: Son of Grid Engine, 8.1.3 locally compiled with options `aimk -spool-classic -no-java
  • In configuration below, it has the same behaviour with or without jsv script or sensor script.
  • It is the same with or without the share tree (which is 'all users at the same level')
  • Submitting no pe job is working fine
  • /data/SGE is a nfs share
  • jobs on desktop nodes are suspended by calendar between 8:00 to 19:00 from monday to friday

-no-gui-inst -no-herd -no-jni -no-intl`
qconf -ssconf:

algorithm                         default
schedule_interval                 0:0:15
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   false
flush_submit_sec                  0
flush_finish_sec                  0
params                            none
reprioritize_interval             0:0:0
halftime                          24
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         0
weight_tickets_share              20000000
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                     1.000000
weight_waiting_time               0.000000
weight_deadline                   0.000000
weight_urgency                    0.100000
weight_priority                   10.000000
max_reservation                   0
default_duration                  INFINITY

qconf -sconf:

#global:
execd_spool_dir              /var/spool/sge
mailer                       /usr/bin/mail
xterm                        /usr/bin/xterm
load_sensor                  /data/SGE/CLS/bin/sensors.sh
prolog                       /data/SGE/CLS/bin/Prolog.sh
epilog                       /data/SGE/CLS/bin/Epilog.sh
shell_start_mode             unix_behavior
login_shells                 sh,bash,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           none
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=false \
                             flush_time=00:00:15 joblog=false \
                             sharelog=00:00:00 accounting_flush_time=00:00:00
finished_jobs                100
gid_range                    20000-20100
qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
max_advance_reservations     0
auto_user_oticket            0
auto_user_fshare             0
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 false
jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
jsv_url                      /data/SGE/CLS/bin/jsv_script
Last edited 4 years ago by ppoilbarbe (previous) (diff)

comment:2 follow-up: Changed 4 years ago by markdixon

On Wed, 14 Aug 2013, SGE wrote:
...

Same issue with version Son of Grid Engine 8.1.3.
Whenever a pe job is submitted a memory leak consumes all the available
memory (2gb of ram 7Gb of swap) in about 1 hour, during this qmaster
does nothing (no scheduling).

...

Nothing constructive to add, except a "me too" with some more symptoms...

We've seen something similar on 8.1.1, which was also fixed by setting schedd_job_info=false.

In our case, we had been running happily for about 6 months; thousands of parallel jobs had gone through the system (using the "-pe <pe> <num>" form of submission, where "<pe>" is typically wildcarded). Then, all at once, gridengine started consuming silly amounts of memory, e.g. 30Gb. We didn't manage to track down the cause, but a restart of gridengine didn't fix it. schedd_job_info=false resolved it. No evidence of unusual "-pe" options in the accounting file.

We've also seen excessive memory and CPU usage with schedd_job_info=true, when several very large task arrays have been working through the system.

Mark

comment:3 in reply to: ↑ 2 Changed 4 years ago by ppoilbarbe

Replying to markdixon:

...
In our case, we had been running happily for about 6 months; thousands of parallel jobs had gone through the system (using the "-pe <pe> <num>" form of submission, where "<pe>" is typically wildcarded). Then, all at once, gridengine started consuming silly amounts of memory, e.g. 30Gb. We didn't manage to track down the cause, but a restart of gridengine didn't fix it. schedd_job_info=false resolved it. No evidence of unusual "-pe" options in the accounting file.
...

Tested a few minutes ago: even with schedd_job_info=false, submitting with -pe <pe> <num>- (with m missing in the n-m form for range of slots, SGE replaces the upper limit with 9999999) locks qmaster (doing nothing visible) and we cannot correct this with a JSV script because it is locked before. So any user can lock the whole system with single qsub. The only way to unlock is to restart qmaster.

comment:4 follow-up: Changed 4 years ago by markdixon

On Wed, 14 Aug 2013, SGE wrote:
...

Tested a few minutes ago: even with schedd_job_info=false, submitting
with -pe <pe> <num>- (with m missing in the n-m form for range of slots,
SGE replaces the upper limit with 9999999) locks qmaster (doing nothing
visible) and we cannot correct this with a JSV script because it is
locked before. So any user can lock the whole system with single qsub.
The only way to unlock is to restart qmaster.

...

Am I reading correctly, you are reporting two issues?

1) Massive qmaster memory usage which is resolved by schedd_job_info=false

2) qmaster hangs if a job is submitted with "-pe <pe> <num>-"

They sound pretty separate to me :)

Mark
--


Mark Dixon Email : m.c.dixon@…
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK


comment:5 in reply to: ↑ 4 Changed 4 years ago by ppoilbarbe

Replying to markdixon:

On Wed, 14 Aug 2013, SGE wrote:
...

Tested a few minutes ago: even with schedd_job_info=false, submitting
with -pe <pe> <num>- (with m missing in the n-m form for range of slots,
SGE replaces the upper limit with 9999999) locks qmaster (doing nothing
visible) and we cannot correct this with a JSV script because it is
locked before. So any user can lock the whole system with single qsub.
The only way to unlock is to restart qmaster.

...

Am I reading correctly, you are reporting two issues?

1) Massive qmaster memory usage which is resolved by schedd_job_info=false

2) qmaster hangs if a job is submitted with "-pe <pe> <num>-"

They sound pretty separate to me :)
...

Oops... my fault.
Yes, it was different but as I mentioned it in my first message and didn't know if it was linked, I reported it as verified... May I file a new issue ?

comment:6 Changed 4 years ago by markdixon

On Wed, 14 Aug 2013, SGE wrote:
...

Yes, it was different but as I mentioned it in my first message and didn't
know if it was linked, I reported it as verified... May I file a new issue
?

...

I would open another ticket for it, if I were you.

All the best,

Mark

comment:7 Changed 4 years ago by dlove

Just to point out that this isn't a general problem. (I've occasionally
known the qmaster get to a couple of GB or so for no apparent reason,
and then contract, but not exhaust VM.) Our load is mostly tightly
integrated parallel and array jobs with up to ~10k tasks, and I've
always had schedd_job_info true.

Sometime I'll see if I can find some sensible way to diagnose the leak.
I suspect what I'd do with a Lisp-like system won't help :-(. I don't
know off-hand if monitoring (see MONITOR_TIME in sge_conf(5)) will be
any use, especially as qping, or what it gets sent, is partly broken.

Thanks for the fairly comprehensive configuration info, by the way.
There should be a cut-down version of the backup script to generate a
useful tarball for convenience...

comment:8 Changed 3 years ago by Dave Love <d.love@…>

In 4735/sge:

Remove CCT_job_messages element and dependent code
The lists are set but never got. This fixed an instance of an occasional
qmaster space leak, probably responsible for both open issues.
Refs #360, #682.

comment:9 Changed 3 years ago by dlove

  • Resolution set to fixed
  • Status changed from new to closed

Assuming this is fixed by [4735]. The Leeds case is very likely to be the same, anyway.

A symptom is an obvious leak of job ids elements in a core dump when running with a reasonable ulimit -v:

# strings core.27981 | sort | uniq -c | sort -r -n | head
 320735 job ids
  13381 INFINITY
   7161 slots
   5116 cannot run in queue "serial" because PE "openmpi-12" is not in PE list
   3683 allel@no
   3600 CCCC
   3278 pi-ch2" 
   3220 cannot run in queue "serial" because PE "openmpi-ch2" is not in PE list
   3218 cannot run in PE "openmpi-12" because it only offers 0 slots
   3186 np_load_avg

It may show up in creating CCT_job_messages or, for instance, as a segv due to using unchecked
malloc return values (see [4685]) when the VM is exhausted.

I'm not too sure what was going on; it took a long time to catch it again.

Note: See TracTickets for help on using tickets.