Opened 6 years ago

Last modified 6 years ago

#1470 new defect

qmaster hang when submitting pe with default max number of processors

Reported by: ppoilbarbe Owned by:
Priority: normal Milestone:
Component: sge Version: 8.1.3
Severity: minor Keywords: Linux qmaster
Cc:

Description

Submitting a job with -pe <pe> <num>- (with m missing in the n-m form for range of slots, SGE replaces the upper limit with 9999999) locks qmaster (which is doing nothing visible but consumes 100% of one CPU) and we cannot correct this with a JSV script because it is locked before. So any user can lock the whole system with a single qsub. The only way to unlock is to restart qmaster.

environment/remarks:

  • master: arch is lx-amd64, 2cores 2Gb RAM as vmware guest
  • nodes: arch is lx-x86 or lx-amd64
  • Linux distribution: Ubuntu 12.04
  • about 100 nodes
  • there is one shadow master
  • SGE: Son of Grid Engine, 8.1.3 locally compiled with options aimk -spool-classic -no-java -no-gui-inst -no-herd -no-jni -no-intl
  • In configuration below, it has the same behaviour with or without jsv script or sensor script.
  • It is the same with or without the share tree (which is 'all users at the same level')
  • Submitting no pe job is working fine
  • /data/SGE is a nfs share
  • jobs on desktop nodes (about 70 nodes) are suspended by calendar between 8:00 to 19:00 from monday to friday

qconf -ssconf:

algorithm                         default
schedule_interval                 0:0:15
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   false
flush_submit_sec                  0
flush_finish_sec                  0
params                            none
reprioritize_interval             0:0:0
halftime                          24
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         0
weight_tickets_share              20000000
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                     1.000000
weight_waiting_time               0.000000
weight_deadline                   0.000000
weight_urgency                    0.100000
weight_priority                   10.000000
max_reservation                   0
default_duration                  INFINITY

qconf -sconf:

#global:
execd_spool_dir              /var/spool/sge
mailer                       /usr/bin/mail
xterm                        /usr/bin/xterm
load_sensor                  /data/SGE/CLS/bin/sensors.sh
prolog                       /data/SGE/CLS/bin/Prolog.sh
epilog                       /data/SGE/CLS/bin/Epilog.sh
shell_start_mode             unix_behavior
login_shells                 sh,bash,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           none
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=false \
                             flush_time=00:00:15 joblog=false \
                             sharelog=00:00:00 accounting_flush_time=00:00:00
finished_jobs                100
gid_range                    20000-20100
qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
max_advance_reservations     0
auto_user_oticket            0
auto_user_fshare             0
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 false
jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
jsv_url                      /data/SGE/CLS/bin/jsv_script

Change History (3)

comment:1 Changed 6 years ago by dlove

I can't immediately reproduce this. In case it makes a difference, what
is the definition of the PE?

Could you prevent the problem with a default client-side JSV, assuming
users won't override it?

comment:2 Changed 6 years ago by ppoilbarbe

I did circumvent this behaviour by modifying the script used to submit jobs (which generates the qsub command).

As users are not supposed to type the qsub directly, I did not test with local JSV and it works fine now.

I created this ticket to register the issue after having found the source of the problem (which was the more difficult) and a workaround.

Here is the PE definition:

pe_name            smp
slots              999999
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $pe_slots
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
qsort_args         NONE

comment:3 Changed 6 years ago by dlove

SGE <sge-bugs@…> writes:

I created this ticket to register the issue after having found the source
of the problem (which was the more difficult) and a workaround.

Yes, thank you.

Unfortunately I can't fix it without being able to reproduce it. I
tried again with that PE definition (I wondered if urgency_slots was the
cause) but it still works for me..

Note: See TracTickets for help on using tickets.