Opened 8 years ago

Closed 5 years ago

#710 closed defect (fixed)

IZ3113: too many qsub -sync jobs wedge the qmaster on failover

Reported by: templedf Owned by:
Priority: high Milestone:
Component: sge Version: 6.2u3
Severity: minor Keywords: qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3113]

        Issue #:      3113             Platform:     All      Reporter: templedf (templedf)
       Component:     gridengine          OS:        All
     Subcomponent:    qmaster          Version:      6.2u3       CC:    None defined
        Status:       NEW              Priority:     P1
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
          URL:
       * Summary:     too many qsub -sync jobs wedge the qmaster on failover
   Status whiteboard:
      Attachments:

     Issue 3113 blocks:
   Votes for issue 3113:


   Opened: Mon Aug 24 12:48:00 -0700 2009 
------------------------


If the qmaster restarts or fails over with too many qsub -sync processes running, it becomes completely unresponsive to all client commands.
 qping is still able to connect and show the communications traffic, however.  The dtrace script shows that there is no activity in any of
the probe points.  With up to 4 or 5, the problem does not appear.  With more than 30, the problem definitely appears.  I did not test to
find out where the line is for when the problem starts to occur.

To reproduce:

> qsub -sync y $SGE_ROOT/examples/jobs/sleeper.sh 600 &
[1] 16669
> Your job 29 ("Sleeper") has been submitted
[32 qsub's removed for ease of reading]
> qsub -sync y $SGE_ROOT/examples/jobs/sleeper.sh 600 &
[34] 16709
> Your job 59 ("Sleeper") has been submitted
> qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
     29 0.55500 Sleeper    dant         r     08/18/2009 13:15:22 all.q@ultra20                      1
     30 0.55500 Sleeper    dant         r     08/18/2009 13:15:22 all.q@ultra20                      1
     31 0.55500 Sleeper    dant         qw    08/18/2009 13:15:19                                    1
...
     59 0.00000 Sleeper    dant         qw    08/18/2009 13:15:26                                    1
> su -
Password:
Sun Microsystems Inc.   SunOS 5.11      snv_112 November 2008
# qconf -km
root@ultra20 kills qmaster
# pgrep master
# The qmaster has gone down.  Waiting to reconnect.
[33 copies of the error message removed for readability]
error: commlib error: can't connect to service (Connection refused)
[33 copies of the error message removed for readability]
error: unable to send message to qmaster using port 6444 on host "ultra20": got send error
[33 copies of the error message removed for readability]
# default/common/sgemaster
   starting sge_qmaster
#^D
> Reconnected to qmaster.
[33 copies of the message removed for readability]
[Wait about two minutes]
error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
[This error message repeats forever]
> qstat
error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
^C> qping ultra20 6444 qmaster 1
08/18/2009 13:19:23 endpoint ultra20/qmaster/1 at port 6444 is up since 208 seconds
08/18/2009 13:19:24 endpoint ultra20/qmaster/1 at port 6444 is up since 209 seconds
08/18/2009 13:19:25 endpoint ultra20/qmaster/1 at port 6444 is up since 210 seconds
^C>qping -info ultra20 6444 qmaster 1
08/18/2009 13:44:05:
SIRM version:             0.1
SIRM message id:          1
start time:               08/18/2009 13:15:55 (1250626555)
run time [s]:             1690
messages in read buffer:  74
messages in write buffer: 0
nr. of connected clients: 33
status:                   2
info:                     MAIN: E (1689.34) | signaler000: E (1689.27) | event_master000: E (1685.75) | timer000: E (1675.17) | worker000: E
(1685.75) | worker001: E (1685.75) | listener000: E (1685.75) | listener001: E (1685.75) | scheduler000: E (0.17) | ERROR
Monitor:                  disabled

Sample DTrace output after master hangs:
# ./monitor.sh -interval 5
dtrace: script './monitor.d' matched 64 probes
CPU     ID                    FUNCTION:NAME
  0      1                           :BEGIN                 Time |   #wrt  wrt/ms |#rep #gdi #ack|   #dsp  dsp/ms    #sad|   #snd    #rcv|
#in++   #in--  #out++  #out--|  #lck0  #ulck0   #lck1  #ulck1
  0  87367                          :tick-5 2009 Aug 18 13:21:05 |       0       0|   0    0    0|      0       0       0|      0       0|
    0       0       0       0|      0       0       1       1
  0  87367                          :tick-5 2009 Aug 18 13:21:05 |       0       0|   0    0    0|      0       0       0|      0       0|
    0       0       0       0|      0       0       0       0
  0  87367                          :tick-5 2009 Aug 18 13:21:05 |       0       0|   0    0    0|      0       0       0|      0       0|
    0       0       0       0|      0       0       0       0
  0  87367                          :tick-5 2009 Aug 18 13:21:05 |       0       0|   0    0    0|      0       0       0|      0       0|
    0       0       0       0|      0       0       0       0
  0  87367                          :tick-5 2009 Aug 18 13:21:05 |       0       0|   0    0    0|      0       0       0|      0       0|
    0       0       0       0|      0       0       0       0
  0  87367                          :tick-5 2009 Aug 18 13:21:06 |       0       0|   0    0    0|      0       0       0|      0       0|
    0       0       0       0|      0       0       0       0
  0  87367                          :tick-5 2009 Aug 18 13:21:06 |       0       0|   0    0    0|      0       0       0|      0       0|
    0       0       0       0|      0       0       0       0

Evaluation:

The root cause for this issue is the introduction of the concept of a worker
thread in qmaster. As a result, the locking mechanism for GDI request processing
has become much more coarse grained than before. Basically, the entire GDI
processing is protected by GLOBAL_LOCK. The access type of the request (read/write) is not important; we just attempt to grab the lock as
reader or writer.

This works fine with the exception of having event master (i.e. qmaster) trying
to modify an event client. This thread acquires Event_Master_Control.mutex first
and then prior to updating the subscription list holding GLOBAL_LOCK.

Right now, state of the art, routine sge_event_master_process_mod_event_client() is the only place that exhibits this (false) ordering
scheme. Good news: looks like we can safely swap the locking order here without causing damage.

Change History (2)

comment:1 Changed 7 years ago by dlove

  • Priority changed from highest to high
  • Severity set to minor

comment:2 Changed 5 years ago by dlove

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.