[GE users] sge6.2u3 - scheduler dying intermittantly

rpatterson patterso at mail.nih.gov
Tue Aug 11 12:55:50 BST 2009


Dan,

Thanks for the reply. We had network problems yesterday, and the cluster
went down again while I was logging qmaster output with only
SGE_ND="true" set. The last few lines of output from that log are below.


I tried restarting the master with the debug level 2 set and it would
never completely come back up. I have logs for all of that, but they are
a bit too large to include here I think.

I finally unset SGE_DEBUG_LEVEL and restarted with just SGE_ND set again
and it came back up.

Each time I've seen this issue I've also seen evidence of drmaa jobs
running at the time, so I'm wondering if someone might be hammering the
master from a drmaa job and killing the scheduler somehow.

We do keep our $SGE_ROOT on a NFS server (a netapp), so yesterday's
network issues may have been the culprit.

I have increased the "schedule_interval" from 0:0:15 to 0:1:0.

Thanks for the help!
Ron

Q:814, AQ:822 J:26668(26668), H:245(245), C:67, A:19, D:4, P:4, CKPT:0,
US:287, PR:16, RQS:23, AR:0, S:nd:0/lf:0
--------------STOP-SCHEDULER-RUN-------------
Q:814, AQ:822 J:26667(26667), H:245(245), C:67, A:19, D:4, P:4, CKPT:0,
US:287, PR:16, RQS:23, AR:0, S:nd:0/lf:0
--------------STOP-SCHEDULER-RUN-------------
Q:814, AQ:822 J:26667(26667), H:245(245), C:67, A:19, D:4, P:4, CKPT:0,
US:287, PR:16, RQS:23, AR:0, S:nd:0/lf:0
--------------STOP-SCHEDULER-RUN-------------
Q:814, AQ:822 J:26667(26667), H:245(245), C:67, A:19, D:4, P:4, CKPT:0,
US:287, PR:16, RQS:23, AR:0, S:nd:0/lf:0
--------------STOP-SCHEDULER-RUN-------------
Q:814, AQ:822 J:26667(26667), H:245(245), C:67, A:19, D:4, P:4, CKPT:0,
US:287, PR:16, RQS:23, AR:0, S:nd:0/lf:0
--------------STOP-SCHEDULER-RUN-------------
Q:814, AQ:822 J:26666(26666), H:245(245), C:67, A:19, D:4, P:4, CKPT:0,
US:287, PR:16, RQS:23, AR:0, S:nd:0/lf:0
--------------STOP-SCHEDULER-RUN-------------
Q:814, AQ:822 J:26666(26666), H:245(245), C:67, A:19, D:4, P:4, CKPT:0,
US:287, PR:16, RQS:23, AR:0, S:nd:0/lf:0
--------------STOP-SCHEDULER-RUN-------------
Q:814, AQ:822 J:26666(26666), H:245(245), C:67, A:19, D:4, P:4, CKPT:0,
US:287, PR:16, RQS:23, AR:0, S:nd:0/lf:0
acknowledge timeout after 600 seconds for event client (qsub:1534) on
host "gizmo3.be-md.ncbi.nlm.nih.gov"
error: removing event client (qsub:1534) on host
"gizmo3.be-md.ncbi.nlm.nih.gov" after acknowledge timeout from event
client list
--------------STOP-SCHEDULER-RUN-------------
acknowledge timeout after 600 seconds for event client (drmaa:1540) on
host "loutgen.be-md.ncbi.nlm.nih.gov"
error: removing event client (drmaa:1540) on host
"loutgen.be-md.ncbi.nlm.nih.gov" after acknowledge timeout from event
client list
acknowledge timeout after 600 seconds for event client (drmaa:1541) on
host "loutgen.be-md.ncbi.nlm.nih.gov"
error: removing event client (drmaa:1541) on host
"loutgen.be-md.ncbi.nlm.nih.gov" after acknowledge timeout from event
client list
acknowledge timeout after 600 seconds for event client (drmaa:1543) on
host "linkgen.be-md.ncbi.nlm.nih.gov"
error: removing event client (drmaa:1543) on host
"linkgen.be-md.ncbi.nlm.nih.gov" after acknowledge timeout from event
client list
Q:814, AQ:822 J:26665(26665), H:245(245), C:67, A:19, D:4, P:4, CKPT:0,
US:287, PR:16, RQS:23, AR:0, S:nd:0/lf:0
acknowledge timeout after 600 seconds for event client (drmaa:1547) on
host "linkgen.be-md.ncbi.nlm.nih.gov"
error: removing event client (drmaa:1547) on host
"linkgen.be-md.ncbi.nlm.nih.gov" after acknowledge timeout from event
client list
acknowledge timeout after 600 seconds for event client (drmaa:1549) on
host "linkgen.be-md.ncbi.nlm.nih.gov"
error: removing event client (drmaa:1549) on host
"linkgen.be-md.ncbi.nlm.nih.gov" after acknowledge timeout from event
client list
acknowledge timeout after 600 seconds for event client (drmaa:1550) on
host "linkgen.be-md.ncbi.nlm.nih.gov"
error: removing event client (drmaa:1550) on host
"linkgen.be-md.ncbi.nlm.nih.gov" after acknowledge timeout from event
client list
acknowledge timeout after 600 seconds for event client (drmaa:1552) on
host "linkgen.be-md.ncbi.nlm.nih.gov"
error: removing event client (drmaa:1552) on host
"linkgen.be-md.ncbi.nlm.nih.gov" after acknowledge timeout from event
client list
error: commlib error: got select error (Broken pipe)
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4034")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4089")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4090")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4125")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4127")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4216")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4217")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4251")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4252")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4258")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4259")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4310")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4311")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4333")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4334")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4343")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4344")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4352")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4353")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4386")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4387")
error: commlib error: got select error (Broken pipe)
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4412")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4425")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4426")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4460")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4461")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4463")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4464")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4518")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4519")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4524")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4526")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4587")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4589")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4594")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4595")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4596")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4597")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4603")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4604")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4606")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4607")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4615")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4616")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4629")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/4630")
error: commlib error: got select error (Broken pipe)
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/5362")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/5387")
error: commlib error: got read error (closing
"linkgen.be-md.ncbi.nlm.nih.gov/drmaa/5391")
error: commlib error: got select error (Broken pipe)
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7135")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7139")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7140")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7146")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7147")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7230")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7231")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7238")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7239")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7265")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7266")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7358")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7359")
error: commlib error: got select error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7365")
error: commlib error: got read error (closing
"atom0.be-md.ncbi.nlm.nih.gov/qsub/7366")
error: commlib error: got select error (Broken pipe)
error: commlib error: got read error (closing
"linkgen.be-md.ncbi.nlm.nih.gov/drmaa/7974")

-----Original Message-----
From: templedf [mailto:dan.templeton at sun.com]
Sent: Monday, August 10, 2009 3:18 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] sge6.2u3 - scheduler dying intermittantly

If you don't mind tons of debug output, I would turn on either debug
level 2 or 9.  See:

http://blogs.sun.com/templedf/entry/using_debugging_output

That'll give us a much clearer picture of what's happening.

Daniel

rpatterson wrote:
> Recently, I have been having trouble with the scheduler thread dying
on
> our master. I assume that this is what's happening because the
> sge_qmaster process is still running, and running jobs continue on
> without a problem, but client requests (qsub/qstat) can no longer make
a
> connection, and no new jobs are dispatched. Recently, this has been
> happening about once a week.
>
> I started seeing this with SGE 6.2u1 and have recently upgraded to
6.2u3
> and have the same problem. Our cluster has about 250 nodes, with a
large
> number of fairly short jobs running/queued all the time (about 1200
> running jobs, and 20K-30K queued. The performance *seems* ok other
than
> this issue. All server hosts are running SUSE SLES9 x86_64.
>
> Included below is a snippet of the qmaster log just before the outage
> begins (nothing is logged after this until I restart qmaster). The
> errors looked similar to an issue I saw on this list, so I updated
> "qmaster_params" to include SGE_TIMEOUT=1200, but that did not seem to
> help. I'm curious if a scheduler_interval of 0:0:15 is too long in our
> environment, or if it makes sense to adjust it at all. Right now I'm
> running sge_qmaster with "SGE_ND=true" and logging the output. Any
other
> debugging tips would be appreciated.
>
>
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:52:22|event_|sgemaster02|E|no event client known with id
> 105 to process acknowledgements
> 08/05/2009 23:53:32|event_|sgemaster02|E|no event client known with id
> 150 to process acknowledgements
> 08/05/2009 23:53:38|event_|sgemaster02|E|no event client known with id
> 167 to process acknowledgements
> 08/05/2009 23:54:07|event_|sgemaster02|E|no event client known with id
> 159 to process acknowledgements
> 08/05/2009 23:54:08|event_|sgemaster02|E|no event client known with id
> 420 to process acknowledgements
> 08/05/2009 23:54:17|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49551) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49555) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49559) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49561) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49569) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49573) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49574) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49576) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|event_|sgemaster02|E|no event client known with id
> 172 to process acknowledgements
>
>
> patterso at cfengine1:/panfs/pan1.be-md.ncbi.nlm.nih.gov> qconf -sconf
> #global:
> execd_spool_dir              /var/sge/ncbi/spool
> mailer                       /netmnt/sge62/util/mailer.sh
> xterm                        /usr/bin/X11/xterm
> load_sensor                  none
> prolog                       none
> epilog                       none
> shell_start_mode             posix_compliant
> login_shells                 sh,ksh,bash,csh,tcsh,zsh
> min_uid                      0
> min_gid                      0
> user_lists                   none
> xuser_lists                  none
> projects                     none
> xprojects                    none
> enforce_project              false
> enforce_user                 false
> load_report_time             00:01:00
> max_unheard                  00:60:00
> reschedule_unknown           00:00:00
> loglevel                     log_warning
> administrator_mail           sgeadmin at ncbi.nlm.nih.gov
> set_token_cmd                none
> pag_cmd                      none
> token_extend_time            none
> shepherd_cmd                 none
> qmaster_params               ENABLE_FORCED_QDEL=true,MAX_DYN_EC=1024,
\
>                              SCHEDULER_TIMEOUT=1200
> execd_params                 INHERIT_ENV=false,SGE_LIB_PATH=true, \
>                              NOTIFY_KILL=TERM
> reporting_params             accounting=true reporting=true \
>                              flush_time=00:00:15 joblog=true
> sharelog=00:00:00
> finished_jobs                100
> gid_range                    37000-39999
> qlogin_command               builtin
> qlogin_daemon                builtin
> rlogin_command               builtin
> rlogin_daemon                builtin
> rsh_command                  builtin
> rsh_daemon                   builtin
> max_aj_instances             2000
> max_aj_tasks                 75000
> max_u_jobs                   50000
> max_jobs                     100000
> max_advance_reservations     10000
> auto_user_oticket            0
> auto_user_fshare             100
> auto_user_default_project    none
> auto_user_delete_time        86400
> delegated_file_staging       false
> reprioritize                 0
> libjvm_path
> /usr/java/jdk1.6.0_03/jre/lib/amd64/server/libjvm.so
> jsv_url                      none
> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>
> patterso at cfengine1:/panfs/pan1.be-md.ncbi.nlm.nih.gov> qconf -ssconf
> algorithm                         default
> schedule_interval                 0:0:15
> maxujobs                          0
> queue_sort_method                 load
> job_load_adjustments              np_load_avg=0.50,mem_free=2G
> load_adjustment_decay_time        0:5:00
> load_formula                      np_load_avg
> schedd_job_info                   false
> flush_submit_sec                  0
> flush_finish_sec                  0
> params                            none
> reprioritize_interval             0:0:0
> halftime                          168
> usage_weight_list
cpu=1.000000,mem=0.000000,io=0.000000
> compensation_factor               5.000000
> weight_user                       0.250000
> weight_project                    0.250000
> weight_department                 0.250000
> weight_job                        0.250000
> weight_tickets_functional         10000
> weight_tickets_share              0
> share_override_tickets            FALSE
> share_functional_shares           TRUE
> max_functional_jobs_to_schedule   1000
> report_pjob_tickets               TRUE
> max_pending_tasks_per_job         50
> halflife_decay_list               none
> policy_hierarchy                  OFS
> weight_ticket                     0.100000
> weight_waiting_time               0.000000
> weight_deadline                   3600000.000000
> weight_urgency                    0.100000
> weight_priority                   0.100000
> max_reservation                   0
> default_duration                  INFINITY
>
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=211721
>
> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=211732

To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=211814

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list