[GE issues] [Issue 3113] New - too many qsub -sync jobs wedge the qmaster on failover

templedf dan.templeton at sun.com
Mon Aug 24 20:48:25 BST 2009


http://gridengine.sunsource.net/issues/show_bug.cgi?id=3113
                 Issue #|3113
                 Summary|too many qsub -sync jobs wedge the qmaster on failover
               Component|gridengine
                 Version|6.2u3
                Platform|All
                     URL|
              OS/Version|All
                  Status|NEW
       Status whiteboard|
                Keywords|
              Resolution|
              Issue type|DEFECT
                Priority|P1
            Subcomponent|qmaster
             Assigned to|ernst
             Reported by|templedf






------- Additional comments from templedf at sunsource.net Mon Aug 24 12:48:22 -0700 2009 -------
If the qmaster restarts or fails over with too many qsub -sync processes running, it becomes completely unresponsive to all client commands.
 qping is still able to connect and show the communications traffic, however.  The dtrace script shows that there is no activity in any of
the probe points.  With up to 4 or 5, the problem does not appear.  With more than 30, the problem definitely appears.  I did not test to
find out where the line is for when the problem starts to occur.

To reproduce:

> qsub -sync y $SGE_ROOT/examples/jobs/sleeper.sh 600 &
[1] 16669
> Your job 29 ("Sleeper") has been submitted
[32 qsub's removed for ease of reading]
> qsub -sync y $SGE_ROOT/examples/jobs/sleeper.sh 600 &
[34] 16709
> Your job 59 ("Sleeper") has been submitted
> qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
     29 0.55500 Sleeper    dant         r     08/18/2009 13:15:22 all.q at ultra20                      1        
     30 0.55500 Sleeper    dant         r     08/18/2009 13:15:22 all.q at ultra20                      1        
     31 0.55500 Sleeper    dant         qw    08/18/2009 13:15:19                                    1        
...
     59 0.00000 Sleeper    dant         qw    08/18/2009 13:15:26                                    1        
> su -
Password: 
Sun Microsystems Inc.	SunOS 5.11	snv_112	November 2008
# qconf -km
root at ultra20 kills qmaster
# pgrep master
# The qmaster has gone down.  Waiting to reconnect.
[33 copies of the error message removed for readability]
error: commlib error: can't connect to service (Connection refused)
[33 copies of the error message removed for readability]
error: unable to send message to qmaster using port 6444 on host "ultra20": got send error
[33 copies of the error message removed for readability]
# default/common/sgemaster 
   starting sge_qmaster
#^D
> Reconnected to qmaster.
[33 copies of the message removed for readability]
[Wait about two minutes]
error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
[This error message repeats forever]
> qstat
error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
^C> qping ultra20 6444 qmaster 1
08/18/2009 13:19:23 endpoint ultra20/qmaster/1 at port 6444 is up since 208 seconds
08/18/2009 13:19:24 endpoint ultra20/qmaster/1 at port 6444 is up since 209 seconds
08/18/2009 13:19:25 endpoint ultra20/qmaster/1 at port 6444 is up since 210 seconds
^C>qping -info ultra20 6444 qmaster 1
08/18/2009 13:44:05:
SIRM version:             0.1
SIRM message id:          1
start time:               08/18/2009 13:15:55 (1250626555)
run time [s]:             1690
messages in read buffer:  74
messages in write buffer: 0
nr. of connected clients: 33
status:                   2
info:                     MAIN: E (1689.34) | signaler000: E (1689.27) | event_master000: E (1685.75) | timer000: E (1675.17) | worker000: E
(1685.75) | worker001: E (1685.75) | listener000: E (1685.75) | listener001: E (1685.75) | scheduler000: E (0.17) | ERROR
Monitor:                  disabled

Sample DTrace output after master hangs:
# ./monitor.sh -interval 5
dtrace: script './monitor.d' matched 64 probes
CPU     ID                    FUNCTION:NAME
  0      1                           :BEGIN                 Time |   #wrt  wrt/ms |#rep #gdi #ack|   #dsp  dsp/ms    #sad|   #snd    #rcv| 
#in++   #in--  #out++  #out--|  #lck0  #ulck0   #lck1  #ulck1
  0  87367                          :tick-5 2009 Aug 18 13:21:05 |       0       0|   0    0    0|      0       0       0|      0       0| 
    0       0       0       0|      0       0       1       1
  0  87367                          :tick-5 2009 Aug 18 13:21:05 |       0       0|   0    0    0|      0       0       0|      0       0| 
    0       0       0       0|      0       0       0       0
  0  87367                          :tick-5 2009 Aug 18 13:21:05 |       0       0|   0    0    0|      0       0       0|      0       0| 
    0       0       0       0|      0       0       0       0
  0  87367                          :tick-5 2009 Aug 18 13:21:05 |       0       0|   0    0    0|      0       0       0|      0       0| 
    0       0       0       0|      0       0       0       0
  0  87367                          :tick-5 2009 Aug 18 13:21:05 |       0       0|   0    0    0|      0       0       0|      0       0| 
    0       0       0       0|      0       0       0       0
  0  87367                          :tick-5 2009 Aug 18 13:21:06 |       0       0|   0    0    0|      0       0       0|      0       0| 
    0       0       0       0|      0       0       0       0
  0  87367                          :tick-5 2009 Aug 18 13:21:06 |       0       0|   0    0    0|      0       0       0|      0       0| 
    0       0       0       0|      0       0       0       0

Evaluation:

The root cause for this issue is the introduction of the concept of a worker
thread in qmaster. As a result, the locking mechanism for GDI request processing
has become much more coarse grained than before. Basically, the entire GDI
processing is protected by GLOBAL_LOCK. The access type of the request (read/write) is not important; we just attempt to grab the lock as
reader or writer.

This works fine with the exception of having event master (i.e. qmaster) trying
to modify an event client. This thread acquires Event_Master_Control.mutex first
and then prior to updating the subscription list holding GLOBAL_LOCK.

Right now, state of the art, routine sge_event_master_process_mod_event_client() is the only place that exhibits this (false) ordering
scheme. Good news: looks like we can safely swap the locking order here without causing damage.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=214035

To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list