[GE users] sge6.2u3 - scheduler dying intermittantly

rpatterson patterso at mail.nih.gov
Wed Aug 12 15:35:34 BST 2009


I tried to start SGE using debug level 2 and 1 yesterday, and both
failed. I was finally able to get it started with just SGE_ND="true"
set. During the various attempts to get SGE started, I saw a few error
messages which I'm hoping may offer some clues about what's happening.
During these start attempts, the master enters a state where it's up,
but mostly sleeping, and it's using very little memory or cpu. It will
remain that way indefinitely, and no qhost/qsub/qstat requests will
succeed (they all generate the gdi timeout error).

#### When trying to start with the debug level set to 1 or 2, I would
see the following:

13   3561 182894214976     locale directory: >/netmnt/sge62/locale<
14   3561 182894214976     package file:     >lx24-amd64/gridengine.mo<
15   3561 182894214976     language (LANG):  >C<
16   3561 182894214976     loading message file:
/netmnt/sge62/locale/C/LC_MESSAGES/lx24-amd64/gridengine.mo
17   3561 182894214976     could not open message file - error

There is no "locale" directory in my $SGE_ROOT - should there be?

#### I also see these scheduler timeouts - is this significant? There
around a dozen of these -

831045   2924  listener000 <-- do_gdi_packet()
../daemons/qmaster/sge_qmaster_process_message.c 287 }
831046   2924  listener000 <-- sge_qmaster_process_message()
../daemons/qmaster/sge_qmaster_process_message.c 175 }

831047   2924  listener000 --> sge_qmaster_process_message() {
831048   2924  listener000 --> do_c_ack() {
831049   2924 scheduler000     pthread_cond_timedwait for events failed
110

... or
843852   3561  listener000 <-- do_gdi_packet()
../daemons/qmaster/sge_qmaster_process_message.c 287 }
843853   3561  listener000 <-- sge_qmaster_process_message()
../daemons/qmaster/sge_qmaster_process_message.c 175 }
843854   3561  listener000 --> sge_qmaster_process_message() {
843855   3561 scheduler000     pthread_cond_timedwait for events failed
110
843856   3561 scheduler000 <-- sge_scheduler_wait_for_event()
../daemons/qmaster/sge_thread_scheduler.c 244 }
843857   3561 scheduler000 --> ec2_need_new_registration() {
843858   3561 scheduler000 <-- ec2_need_new_registration()
../libs/evc/sge_event_client.c 1047 }
843859   3561 scheduler000 --> ec2_set_busy() {


Also, I see the following when starting with just "SGE_ND=true" set,
with no debug info:


Reading in Master_Job_List.
........................................................................
.................................................
........................................................................
.................................................
.......................................................................

read job database with 31201 entries in 18 seconds
error: error opening file "/netmnt/sge62/ncbi/spool/qmaster/./sharetree"
for reading: No such file or directory
qmaster hard descriptor limit is set to 8192
qmaster soft descriptor limit is set to 8192
qmaster will use max. 8172 file descriptors for communication
qmaster will accept max. 1024 dynamic event clients
starting up GE 6.2u3 (lx24-amd64)
error: commlib error: got read error (closing
"linkgen.be-md.ncbi.nlm.nih.gov/drmaa/1")
Q:0, AQ:822 J:31201(31201), H:245(245), C:67, A:19, D:4, P:4, CKPT:0,
US:287, PR:16, RQS:23, AR:0, S:nd:0/lf:0
rule "default rule (spool dir)" in spooling context "flatfile spooling"
failed writing an object
rule "default rule (spool dir)" in spooling context "flatfile spooling"
failed writing an object
rule "default rule (spool dir)" in spooling context "flatfile spooling"
failed writing an object
rule "default rule (spool dir)" in spooling context "flatfile spooling"
failed writing an object
rule "default rule (spool dir)" in spooling context "flatfile spooling"
failed writing an object
rule "default rule (spool dir)" in spooling context "flatfile spooling"
failed writing an object
rule "default rule (spool dir)" in spooling context "flatfile spooling"
failed writing an object
rule "default rule (spool dir)" in spooling context "flatfile spooling"
failed writing an object
rule "default rule (spool dir)" in spooling context "flatfile spooling"
failed writing an object
rule "default rule (spool dir)" in spooling context "flatfile spooling"
failed writing an object
rule "default rule (spool dir)" in spooling context "flatfile spooling"
failed writing an object
rule "default rule (spool dir)" in spooling context "flatfile spooling"
failed writing an object
.... many more of  these

We don't use the sharetree as far as I know, but I assume that the
directory should exist under spool/qmaster. As noted before, we use
classic spooling, so I'm not sure what's going on with the "failed
writing an object" errors. They only appear once after the master is
started.

Any pointers would be appreciated!!



-----Original Message-----
From: templedf [mailto:dan.templeton at sun.com]
Sent: Monday, August 10, 2009 3:18 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] sge6.2u3 - scheduler dying intermittantly

If you don't mind tons of debug output, I would turn on either debug
level 2 or 9.  See:

http://blogs.sun.com/templedf/entry/using_debugging_output

That'll give us a much clearer picture of what's happening.

Daniel

rpatterson wrote:
> Recently, I have been having trouble with the scheduler thread dying
on
> our master. I assume that this is what's happening because the
> sge_qmaster process is still running, and running jobs continue on
> without a problem, but client requests (qsub/qstat) can no longer make
a
> connection, and no new jobs are dispatched. Recently, this has been
> happening about once a week.
>
> I started seeing this with SGE 6.2u1 and have recently upgraded to
6.2u3
> and have the same problem. Our cluster has about 250 nodes, with a
large
> number of fairly short jobs running/queued all the time (about 1200
> running jobs, and 20K-30K queued. The performance *seems* ok other
than
> this issue. All server hosts are running SUSE SLES9 x86_64.
>
> Included below is a snippet of the qmaster log just before the outage
> begins (nothing is logged after this until I restart qmaster). The
> errors looked similar to an issue I saw on this list, so I updated
> "qmaster_params" to include SGE_TIMEOUT=1200, but that did not seem to
> help. I'm curious if a scheduler_interval of 0:0:15 is too long in our
> environment, or if it makes sense to adjust it at all. Right now I'm
> running sge_qmaster with "SGE_ND=true" and logging the output. Any
other
> debugging tips would be appreciated.
>
>
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:42:41|worker|sgemaster02|E|There are no jobs registered
> 08/05/2009 23:52:22|event_|sgemaster02|E|no event client known with id
> 105 to process acknowledgements
> 08/05/2009 23:53:32|event_|sgemaster02|E|no event client known with id
> 150 to process acknowledgements
> 08/05/2009 23:53:38|event_|sgemaster02|E|no event client known with id
> 167 to process acknowledgements
> 08/05/2009 23:54:07|event_|sgemaster02|E|no event client known with id
> 159 to process acknowledgements
> 08/05/2009 23:54:08|event_|sgemaster02|E|no event client known with id
> 420 to process acknowledgements
> 08/05/2009 23:54:17|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49551) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49555) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49559) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49561) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49569) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49573) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49574) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|worker|sgemaster02|E|event client "qsub"
> (genome1.be-md.ncbi.nlm.nih.gov/qsub/49576) reregistered - it will
need
> a total update
> 08/05/2009 23:54:19|event_|sgemaster02|E|no event client known with id
> 172 to process acknowledgements
>
>
> patterso at cfengine1:/panfs/pan1.be-md.ncbi.nlm.nih.gov> qconf -sconf
> #global:
> execd_spool_dir              /var/sge/ncbi/spool
> mailer                       /netmnt/sge62/util/mailer.sh
> xterm                        /usr/bin/X11/xterm
> load_sensor                  none
> prolog                       none
> epilog                       none
> shell_start_mode             posix_compliant
> login_shells                 sh,ksh,bash,csh,tcsh,zsh
> min_uid                      0
> min_gid                      0
> user_lists                   none
> xuser_lists                  none
> projects                     none
> xprojects                    none
> enforce_project              false
> enforce_user                 false
> load_report_time             00:01:00
> max_unheard                  00:60:00
> reschedule_unknown           00:00:00
> loglevel                     log_warning
> administrator_mail           sgeadmin at ncbi.nlm.nih.gov
> set_token_cmd                none
> pag_cmd                      none
> token_extend_time            none
> shepherd_cmd                 none
> qmaster_params               ENABLE_FORCED_QDEL=true,MAX_DYN_EC=1024,
\
>                              SCHEDULER_TIMEOUT=1200
> execd_params                 INHERIT_ENV=false,SGE_LIB_PATH=true, \
>                              NOTIFY_KILL=TERM
> reporting_params             accounting=true reporting=true \
>                              flush_time=00:00:15 joblog=true
> sharelog=00:00:00
> finished_jobs                100
> gid_range                    37000-39999
> qlogin_command               builtin
> qlogin_daemon                builtin
> rlogin_command               builtin
> rlogin_daemon                builtin
> rsh_command                  builtin
> rsh_daemon                   builtin
> max_aj_instances             2000
> max_aj_tasks                 75000
> max_u_jobs                   50000
> max_jobs                     100000
> max_advance_reservations     10000
> auto_user_oticket            0
> auto_user_fshare             100
> auto_user_default_project    none
> auto_user_delete_time        86400
> delegated_file_staging       false
> reprioritize                 0
> libjvm_path
> /usr/java/jdk1.6.0_03/jre/lib/amd64/server/libjvm.so
> jsv_url                      none
> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>
> patterso at cfengine1:/panfs/pan1.be-md.ncbi.nlm.nih.gov> qconf -ssconf
> algorithm                         default
> schedule_interval                 0:0:15
> maxujobs                          0
> queue_sort_method                 load
> job_load_adjustments              np_load_avg=0.50,mem_free=2G
> load_adjustment_decay_time        0:5:00
> load_formula                      np_load_avg
> schedd_job_info                   false
> flush_submit_sec                  0
> flush_finish_sec                  0
> params                            none
> reprioritize_interval             0:0:0
> halftime                          168
> usage_weight_list
cpu=1.000000,mem=0.000000,io=0.000000
> compensation_factor               5.000000
> weight_user                       0.250000
> weight_project                    0.250000
> weight_department                 0.250000
> weight_job                        0.250000
> weight_tickets_functional         10000
> weight_tickets_share              0
> share_override_tickets            FALSE
> share_functional_shares           TRUE
> max_functional_jobs_to_schedule   1000
> report_pjob_tickets               TRUE
> max_pending_tasks_per_job         50
> halflife_decay_list               none
> policy_hierarchy                  OFS
> weight_ticket                     0.100000
> weight_waiting_time               0.000000
> weight_deadline                   3600000.000000
> weight_urgency                    0.100000
> weight_priority                   0.100000
> max_reservation                   0
> default_duration                  INFINITY
>
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=211721
>
> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=211732

To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=212006

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list