[GE users] Slotwise subordinate suspension ignores suspend_method

gracklewolf Gary_Smith at vrtx.com
Thu Apr 15 18:31:09 BST 2010

I've configured 3 queues into a slotwise subordinate configuration like so:

E.q: subordinate_list  slots=8(A.q:0:sr)
A.q: subordinate_list  slots=8(P.q:1:sr)
P.q: subordinate_list  NONE

I'm running some OpenMPI jobs in P.q.  I've configured P.q's suspend_method to use SIGTSTP and resume_method to use SIGCONT so that the mpi jobs will suspend all of their children properly.

Everything works perfectly if I suspend an OpenMPI job by hand with `qsub -sj <mpi_job_id>'.  The master mpirun receives the SIGTSTP signal and broadcasts the expected SIGSTOP signal to its children and the children all stop.  `qsub -usj <mpi_job_id>' will start the whole MPI job again.

 However, if E.q or A.q fill up with jobs and there is an OpenMPI job running in P.q, the MPI job will show status (S)ubordinate in qstat, but the suspend_method signal is not sent to the master MPI job as it would if it were suspended.  Is this expected behavior?  Why the disconnect between subordinate status and suspend_method?

thanks for any help.


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list