[GE users] sge6.2u3 - scheduler dying intermittantly

rpatterson patterso at mail.nih.gov
Mon Aug 17 20:55:34 BST 2009


I finally managed to get the master to run with debug level 2 turned on
and it ran for about an hour before it started to failed again (same
symptoms, sge_qmaster is up and running, but no clients can connect, and
no queued jobs are dispatched).  I now have a 3.4G log. Grepping for
"fail|error" generates over 8000 lines. I've been looking at it myself
trying to see exactly where it goes south, but I can't really tell. Is
there anything else I should be looking for specifically?

# egrep 'fail|error' ./sgemaster-debug.log | <parse first two fields
out> | sort | uniq -c

    243 scheduler000     pthread_cond_timedwait for events failed 110
   1162 worker000     after sge_resolve_host() which returned no error
happened
    357 worker000     error: one of the required parameters is NULL
   1220 worker000 --> centry_list_has_error() {
   1220 worker000 <-- centry_list_has_error()
../libs/sgeobj/sge_centry.c 1347 }
   1059 worker001     after sge_resolve_host() which returned no error
happened
    522 worker001     error: one of the required parameters is NULL
   1116 worker001 --> centry_list_has_error() {
   1116 worker001 <-- centry_list_has_error()
../libs/sgeobj/sge_centry.c 1347 }

Without any debugging enabled, our cluster seems to run for several days
at a time before entering this state again. I'm guessing that the
debugging/logging is slowing down the master enough to trigger the issue
faster...?

Ron

-----Original Message-----
From: templedf [mailto:dan.templeton at sun.com] 
Sent: Monday, August 10, 2009 3:18 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] sge6.2u3 - scheduler dying intermittantly

If you don't mind tons of debug output, I would turn on either debug 
level 2 or 9.  See:

http://blogs.sun.com/templedf/entry/using_debugging_output

That'll give us a much clearer picture of what's happening.

Daniel

rpatterson wrote:
> Recently, I have been having trouble with the scheduler thread dying
on
> our master. I assume that this is what's happening because the
> sge_qmaster process is still running, and running jobs continue on
> without a problem, but client requests (qsub/qstat) can no longer make
a
> connection, and no new jobs are dispatched. Recently, this has been
> happening about once a week.
>
> I started seeing this with SGE 6.2u1 and have recently upgraded to
6.2u3
> and have the same problem. Our cluster has about 250 nodes, with a
large
> number of fairly short jobs running/queued all the time (about 1200
> running jobs, and 20K-30K queued. The performance *seems* ok other
than
> this issue. All server hosts are running SUSE SLES9 x86_64.
>
> Included below is a snippet of the qmaster log just before the outage
> begins (nothing is logged after this until I restart qmaster). The
> errors looked similar to an issue I saw on this list, so I updated
> "qmaster_params" to include SGE_TIMEOUT=1200, but that did not seem to
> help. I'm curious if a scheduler_interval of 0:0:15 is too long in our
> environment, or if it makes sense to adjust it at all. Right now I'm
> running sge_qmaster with "SGE_ND=true" and logging the output. Any
other
> debugging tips would be appreciated.
>
>
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:52:22|event_|sgemaster02|E|no event client known with id
> 105 to process acknowledgements
> 08/05/2009 23:53:32|event_|sgemaster02|E|no event client known with id
> 150 to process acknowledgements
> 08/05/2009 23:53:38|event_|sgemaster02|E|no event client known with id
> 167 to process acknowledgements
> 08/05/2009 23:54:07|event_|sgemaster02|E|no event client known with id
> 159 to process acknowledgements
> 08/05/2009 23:54:08|event_|sgemaster02|E|no event client known with id
> 420 to process acknowledgements
> 08/05/2009 23:54:17|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49551) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49555) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49559) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49561) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49569) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49573) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49574) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49576) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|event_|sgemaster02|E|no event client known with id
> 172 to process acknowledgements
>
>
> patterso at cfengine1:/panfs/pan1.be-md.ncbi.nlm.nih.gov> qconf -sconf 
> #global:
> execd_spool_dir              /var/sge/ncbi/spool
> mailer                       /netmnt/sge62/util/mailer.sh
> xterm                        /usr/bin/X11/xterm
> load_sensor                  none
> prolog                       none
> epilog                       none
> shell_start_mode             posix_compliant
> login_shells                 sh,ksh,bash,csh,tcsh,zsh
> min_uid                      0
> min_gid                      0
> user_lists                   none
> xuser_lists                  none
> projects                     none
> xprojects                    none
> enforce_project              false
> enforce_user                 false
> load_report_time             00:01:00
> max_unheard                  00:60:00
> reschedule_unknown           00:00:00
> loglevel                     log_warning
> administrator_mail           sgeadmin at ncbi.nlm.nih.gov
> set_token_cmd                none
> pag_cmd                      none
> token_extend_time            none
> shepherd_cmd                 none
> qmaster_params               ENABLE_FORCED_QDEL=true,MAX_DYN_EC=1024,
\
>                              SCHEDULER_TIMEOUT=1200
> execd_params                 INHERIT_ENV=false,SGE_LIB_PATH=true, \
>                              NOTIFY_KILL=TERM
> reporting_params             accounting=true reporting=true \
>                              flush_time=00:00:15 joblog=true
> sharelog=00:00:00
> finished_jobs                100
> gid_range                    37000-39999
> qlogin_command               builtin
> qlogin_daemon                builtin
> rlogin_command               builtin
> rlogin_daemon                builtin
> rsh_command                  builtin
> rsh_daemon                   builtin
> max_aj_instances             2000
> max_aj_tasks                 75000
> max_u_jobs                   50000
> max_jobs                     100000
> max_advance_reservations     10000
> auto_user_oticket            0
> auto_user_fshare             100
> auto_user_default_project    none
> auto_user_delete_time        86400
> delegated_file_staging       false
> reprioritize                 0
> libjvm_path
> /usr/java/jdk1.6.0_03/jre/lib/amd64/server/libjvm.so
> jsv_url                      none
> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>
> patterso at cfengine1:/panfs/pan1.be-md.ncbi.nlm.nih.gov> qconf -ssconf
> algorithm                         default
> schedule_interval                 0:0:15
> maxujobs                          0
> queue_sort_method                 load
> job_load_adjustments              np_load_avg=0.50,mem_free=2G
> load_adjustment_decay_time        0:5:00
> load_formula                      np_load_avg
> schedd_job_info                   false
> flush_submit_sec                  0
> flush_finish_sec                  0
> params                            none
> reprioritize_interval             0:0:0
> halftime                          168
> usage_weight_list
cpu=1.000000,mem=0.000000,io=0.000000
> compensation_factor               5.000000
> weight_user                       0.250000
> weight_project                    0.250000
> weight_department                 0.250000
> weight_job                        0.250000
> weight_tickets_functional         10000
> weight_tickets_share              0
> share_override_tickets            FALSE
> share_functional_shares           TRUE
> max_functional_jobs_to_schedule   1000
> report_pjob_tickets               TRUE
> max_pending_tasks_per_job         50
> halflife_decay_list               none
> policy_hierarchy                  OFS
> weight_ticket                     0.100000
> weight_waiting_time               0.000000
> weight_deadline                   3600000.000000
> weight_urgency                    0.100000
> weight_priority                   0.100000
> max_reservation                   0
> default_duration                  INFINITY
>
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=211721
>
> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=211732

To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=212710

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list