Opened 10 years ago

Closed 7 years ago

#681 closed defect (fixed)

IZ3049: memory leak in qmaster when submitting special jobs

Reported by: ah_sunsource Owned by:
Priority: low Milestone:
Component: sge Version: 6.2u2
Severity: minor Keywords: Linux qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3049]

        Issue #:      3049             Platform:     All      Reporter: ah_sunsource (ah_sunsource)
       Component:     gridengine          OS:        Linux
     Subcomponent:    qmaster          Version:      6.2u2       CC:    None defined
        Status:       NEW              Priority:     P4
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
          URL:
       * Summary:     memory leak in qmaster when submitting special jobs
   Status whiteboard:
      Attachments:

     Issue 3049 blocks:
   Votes for issue 3049:


   Opened: Thu Jun 11 23:54:00 -0700 2009 
------------------------


Hi,

qmaster leaks memory when submitting a parallel job with array tasks and requiring reservation. Example:

qsub -t 1-100 -pe mpi 16-32 -R y -l h_cpu=1:00:00,h_vmem=1.2G mpi-job.sh

Within seconds qmaster's memory consumption grows to several gigabytes and finally crashes the system.

Cheers,
Andreas

   ------- Additional comments from joga Wed Jun 24 05:21:22 -0700 2009 -------
Hi Andreas,

I cannot reproduce the issue in my test cluster.
Can you please share some more information, e.g.
- definition of the mpi pe (qconf -mp mpi)
- scheduler and global configuration (qconf -ssconf, qconf -sconf)
- definition of the h_cpu and h_vmem complex variables, is h_vmem consumable?
- does the mpi-job.sh contain any special comments with additional submit options?

Thanks,

  Joachim

   ------- Additional comments from joga Thu Jun 25 09:04:15 -0700 2009 -------
If you can reproduce the issue,
we could try to get a core dump from qmaster when this problem occurs - this might help us to understand where exactly the problem is.
It is possible to get a core dump from a running process via the gcore command,
I prepared a small script which monitors the qmaster size, and calls gcore when qmaster reaches a certain size,
and repeats calling gcore for a configureable number of times after certain steps of growth.
When to call gcore, and how often, can be configured at the beginning of the script.
You can download it from the following URL:
http://gridengine.sunsource.net/files/documents/7/202/monitor_qmaster.sh

   ------- Additional comments from ah_sunsource Wed Jul 1 05:51:45 -0700 2009 -------
Hi,

I've upgraded to 6.2u3 and cannot reproduce the problem right now. But I will answer your questions, the configuration did not change. Maybe
it helps.

[oreade38] ~ % qconf -sp mpi
pe_name            mpi
slots              512
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
[oreade38] ~ % qconf -ssconf
algorithm                         default
schedule_interval                 0:0:10
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              np_load_avg=1.0
load_adjustment_decay_time        00:07:30
load_formula                      np_load_avg
schedd_job_info                   true
flush_submit_sec                  0
flush_finish_sec                  0
params                            none
reprioritize_interval             0:0:0
halftime                          168
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         1000
weight_tickets_share              3000
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                     0.010000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   1.000000
max_reservation                   0
default_duration                  INFINITY
[oreade38] ~ % qconf -sconf
#global:
execd_spool_dir              /usr/gridengine/default/spool
mailer                       /bin/mail
xterm                        /usr/bin/xterm
load_sensor                  none
prolog                       root@/usr/gridengine/util/prolog
epilog                       root@/usr/gridengine/util/epilog
shell_start_mode             posix_compliant
login_shells                 sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:30
max_unheard                  00:2:30
reschedule_unknown           00:05:00
loglevel                     log_info
administrator_mail           ahaupt@ifh.de
set_token_cmd                /usr/gridengine/util/set_token_cmd
pag_cmd                      /usr/heimdal/bin/pagsh
token_extend_time            24:0:0
shepherd_cmd                 none
qmaster_params               none
execd_params                 SHARETREE_RESERVED_USAGE ENABLE_ADDGRP_KILL
reporting_params             accounting=true reporting=false \
                             flush_time=00:00:15 joblog=false sharelog=00:00:00
finished_jobs                100
gid_range                    50000-50100
qlogin_command               ssh -tt -o GSSAPIDelegateCredentials=no
qlogin_daemon                /usr/gridengine/util/rshd-wrapper
rlogin_command               ssh -tt -o GSSAPIDelegateCredentials=no
rlogin_daemon                /usr/gridengine/util/rshd-wrapper
rsh_command                  ssh -tt -o GSSAPIDelegateCredentials=no
rsh_daemon                   /usr/gridengine/util/rshd-wrapper
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
max_advance_reservations     0
auto_user_oticket            0
auto_user_fshare             0
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 false
jsv_url                      /usr/gridengine/util/job_verifier.pl
jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w

[oreade38] ~ % qconf -sc |egrep '(#name|h_cpu|h_vmem)'
#name               shortcut   type        relop requestable consumable default  urgency
h_cpu               h_cpu      TIME        <=    YES         NO         0:0:0    0
h_vmem              h_vmem     MEMORY      <=    YES         YES        512M     0

No other SGE-flags have been set within the job script.

Cheers,
Andreas

Change History (1)

comment:1 Changed 7 years ago by dlove

  • Resolution set to fixed
  • Severity set to minor
  • Status changed from new to closed

Assume fixed; doesn't show up here anyhow.

Note: See TracTickets for help on using tickets.