[GE users] sge6.2u3 - scheduler dying intermittantly

rpatterson patterso at mail.nih.gov
Mon Aug 10 18:12:35 BST 2009


Recently, I have been having trouble with the scheduler thread dying on
our master. I assume that this is what's happening because the
sge_qmaster process is still running, and running jobs continue on
without a problem, but client requests (qsub/qstat) can no longer make a
connection, and no new jobs are dispatched. Recently, this has been
happening about once a week.

I started seeing this with SGE 6.2u1 and have recently upgraded to 6.2u3
and have the same problem. Our cluster has about 250 nodes, with a large
number of fairly short jobs running/queued all the time (about 1200
running jobs, and 20K-30K queued. The performance *seems* ok other than
this issue. All server hosts are running SUSE SLES9 x86_64.

Included below is a snippet of the qmaster log just before the outage
begins (nothing is logged after this until I restart qmaster). The
errors looked similar to an issue I saw on this list, so I updated
"qmaster_params" to include SGE_TIMEOUT=1200, but that did not seem to
help. I'm curious if a scheduler_interval of 0:0:15 is too long in our
environment, or if it makes sense to adjust it at all. Right now I'm
running sge_qmaster with "SGE_ND=true" and logging the output. Any other
debugging tips would be appreciated.


08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
08/05/2009 23:52:22|event_|sgemaster02|E|no event client known with id
105 to process acknowledgements
08/05/2009 23:53:32|event_|sgemaster02|E|no event client known with id
150 to process acknowledgements
08/05/2009 23:53:38|event_|sgemaster02|E|no event client known with id
167 to process acknowledgements
08/05/2009 23:54:07|event_|sgemaster02|E|no event client known with id
159 to process acknowledgements
08/05/2009 23:54:08|event_|sgemaster02|E|no event client known with id
420 to process acknowledgements
08/05/2009 23:54:17|worker|sgemaster02|E|event client "qsub"
(genome1.be-md.ncbi.nlm.nih.gov/qsub/49551) reregistered - it will need
a total update
08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
(genome1.be-md.ncbi.nlm.nih.gov/qsub/49555) reregistered - it will need
a total update
08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
(genome1.be-md.ncbi.nlm.nih.gov/qsub/49559) reregistered - it will need
a total update
08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
(genome1.be-md.ncbi.nlm.nih.gov/qsub/49561) reregistered - it will need
a total update
08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
(genome1.be-md.ncbi.nlm.nih.gov/qsub/49569) reregistered - it will need
a total update
08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
(genome1.be-md.ncbi.nlm.nih.gov/qsub/49573) reregistered - it will need
a total update
08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
(genome1.be-md.ncbi.nlm.nih.gov/qsub/49574) reregistered - it will need
a total update
08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
(genome1.be-md.ncbi.nlm.nih.gov/qsub/49576) reregistered - it will need
a total update
08/05/2009 23:54:19|event_|sgemaster02|E|no event client known with id
172 to process acknowledgements


patterso at cfengine1:/panfs/pan1.be-md.ncbi.nlm.nih.gov> qconf -sconf 
#global:
execd_spool_dir              /var/sge/ncbi/spool
mailer                       /netmnt/sge62/util/mailer.sh
xterm                        /usr/bin/X11/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,ksh,bash,csh,tcsh,zsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 false
load_report_time             00:01:00
max_unheard                  00:60:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           sgeadmin at ncbi.nlm.nih.gov
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               ENABLE_FORCED_QDEL=true,MAX_DYN_EC=1024, \
                             SCHEDULER_TIMEOUT=1200
execd_params                 INHERIT_ENV=false,SGE_LIB_PATH=true, \
                             NOTIFY_KILL=TERM
reporting_params             accounting=true reporting=true \
                             flush_time=00:00:15 joblog=true
sharelog=00:00:00
finished_jobs                100
gid_range                    37000-39999
qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   50000
max_jobs                     100000
max_advance_reservations     10000
auto_user_oticket            0
auto_user_fshare             100
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 0
libjvm_path
/usr/java/jdk1.6.0_03/jre/lib/amd64/server/libjvm.so
jsv_url                      none
jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w

patterso at cfengine1:/panfs/pan1.be-md.ncbi.nlm.nih.gov> qconf -ssconf
algorithm                         default
schedule_interval                 0:0:15
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              np_load_avg=0.50,mem_free=2G
load_adjustment_decay_time        0:5:00
load_formula                      np_load_avg
schedd_job_info                   false
flush_submit_sec                  0
flush_finish_sec                  0
params                            none
reprioritize_interval             0:0:0
halftime                          168
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         10000
weight_tickets_share              0
share_override_tickets            FALSE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   1000
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                     0.100000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   0.100000
max_reservation                   0
default_duration                  INFINITY

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211721

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list