Opened 13 years ago

Closed 9 years ago

#577 closed feature (fixed)

IZ2740: Parallel jobs should be handled as a group

Reported by: reuti Owned by: Dave Love <…>
Priority: normal Milestone:
Component: sge Version: 6.1u5
Severity: minor Keywords: kernel patch


[Imported from gridengine issuezilla]

        Issue #:      2740             Platform:     All       Reporter: reuti (reuti)
       Component:     gridengine          OS:        All
     Subcomponent:    kernel           Version:      6.1u5        CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    FEATURE
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
       * Summary:     Parallel jobs should be handled as a group
   Status whiteboard:

     Issue 2740 blocks:
   Votes for issue 2740:

   Opened: Sat Sep 27 13:38:00 -0700 2008 

If one process of a parallel job gets suspended for whatever reason, the complete job should be
affected, as a partial parallel job can't continue anyway. For now a suspension of a queue with a slave
process on it is simply not delivered. Neither to the slave process on this node, nor to the master of
this parallel job of course.

Even if the intention is still, that the suspension of a parallel job must be handled by a custom method,
this custom suspend_method should be invoked on the master node of this parallel job (like the
resume_method later on).

From the email discussion:

I meant something different. You have one parallel job with just 2 slots running on node1 (master) and
node2 (slave) in a queue called parallel.q. On node2 a serial job starts in a superordinated queue
serial.q (or another parallel job with a different node allocation). Although the queue instance
parallel@node2 is flagged as "S" suspended, no signal is send to the parallel job running there.

This would lead to a further discussion: should the slave-execd talk to master-execd to suspend the
complete job? Most likely it can't run anyway when one of the slaves is suspended.

IMO, it should, since it really doesn't make sense to suspend part of a job any more than it makes sense
to kill part of a job or adjust the priority of part of a job.  A parallel job, whether it is distributed or not,
should be treated as a group of related processes where the entire job is treated as a unit.

   ------- Additional comments from rayson Sun Sep 28 23:27:46 -0700 2008 -------
If I read the code correctly, suspension on subordinate is triggered by qmaster,
not from the local execds. sge_signal_queue() and signal_slave_jobs_in_queue()
should be able to suspend the whole parallel job when any slave tasks get
subordinate suspended.

sge_signal_queue() calls signal_slave_jobs_in_queue(), which has:

  /* search master queue - needed for signalling of a job */

It eventually calls sge_signal_queue() when it finds the right master queue. And
then, we will hit the bug that Ron found when it calls signal_slave_tasks_of_job():

   if (!jep) {/* signalling a queue ? - handle slave jobs in this queue */
      signal_slave_jobs_in_queue(ctx, how, qep, monitor);
   else {/* is this the master queue of this job to signal ? - then decide
whether slave tasks also must get signalled */
      if (!strcmp(lGetString(lFirst(lGetList(jatep,
            JG_qname), lGetString(qep, QU_full_name))) {
         signal_slave_tasks_of_job(ctx, how, jep, jatep, monitor);

I believe we will wait for the answer from Shannon to see what else is needed...


   ------- Additional comments from reuti Mon Sep 29 06:16:02 -0700 2008 -------
But shouldn't show up all slots of the parallel job as suspended (or more correct: subordinated) then
(please see below)? Maybe there should be another state for the job: "g" - suspended because at least
one process of the group was suspended (for whatever reason). By looking at "qstat -f", it would be
easy to spot out the reason: all processes in "qstat -g t" are "g" for a job, and on at least one them the
queue is in state "S" or "s".

This feature should also only be enabled by an additional switch or a qmaster_params


Strange observation: during my tests with a 4 node parallel job and one serial job, it happened that

- none of the parallel processes shows "S"
- only one of the parallel processes shows "S"
- two or more (even all) show "S"

$ qstat -g t
job-ID  prior   name       user         state submit/start at     queue                          master ja-task-ID
    386 0.61000   reuti        r     09/29/2008 14:39:04 parallel@node10                SLAVE
    386 0.61000   reuti        S     09/29/2008 14:39:04 parallel@node12                SLAVE
    386 0.61000   reuti        S     09/29/2008 14:39:04 parallel@node14                MASTER
    386 0.61000   reuti        S     09/29/2008 14:39:04 parallel@node15                SLAVE
    394 0.50125    reuti        r     09/29/2008 15:06:32 vast@node12                    MASTER

Why not the process on node10?

But indeed: the processes on node 14 get the STOP, although the serial job runs on node 12. So it's
really designated in some way already.

   ------- Additional comments from svdavidson Mon Sep 29 07:11:32 -0700 2008 -------
By looking at the trace files for the slave tasks, I noticed that when
suspending a queue using qmod -s, the local processes all get sent a SIGSTOP
signal, but the remote processes do not.  However, when unsuspending the queue,
both the local and the remote processes receive a SIGCONT signal.

   ------- Additional comments from svdavidson Mon Sep 29 11:05:53 -0700 2008 -------
The signal events are being sent by the qmaster to the parallel queues.

 82040  19471 46922316396864     JOB 62: sent signal STOP (retry after 60
seconds) host: prod-0001
 82136  19471 46922316396864     JOB 62: sent signal STOP (retry after 60
seconds) host: prod-0002

When the execution daemon receives the signal event, in
execd_signal_queue.c:signal_job(), the job state is marked as SUSPENDED and
sge_execd_deliver_signal() is called.

685          state = lGetUlong(jatep, JAT_state);
686          if (!ISSET(state,JSUSPENDED)) {
687             suspend_change = 1;
688          }
689          SETBIT(JSUSPENDED, state);
690          CLEARBIT(JRUNNING, state);
691          lSetUlong(jatep, JAT_state, state);
693          /* if this is a stop signal for a job
694             which is in at least ONE queue
695             which is already stopped we
696             do not deliver the signal */
698          getridofjob = sge_execd_deliver_signal(signal, jep, jatep);

In sge_execd_deliver_signal(), in the SIGSTOP handling code, this same state is
checked and if the state is SUSPENDED, no signal is sent.

294    /* Simply apply signal to all subtasks of the job
295       except in case of SGE_MIGRATE when there is a
296       ckpt env with "migrate on suspend" configured */
297    queue_already_suspended = (lGetUlong(jatep, JAT_state)&JSUSPENDED);
298    if (!(sig == SGE_MIGRATE
299          && (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND))
300          && !queue_already_suspended) {
301       lListElem *petep;
302       /* signal each pe task */
303       for_each (petep, lGetList(jatep, JAT_task_list)) {
304          if (sge_kill((int)lGetUlong(petep, PET_pid), sig,
305             lGetUlong(jep, JB_job_number), lGetUlong(jatep, JAT_task_number),
306             lGetString(petep, PET_id))==-2)
307             getridofjob = 1;
308       }
309    }

For SIGSTOP, the state is always suspended, so no STOP signals are ever sent to
slave tasks.

The variable name is queue_already_suspended, but it's actually checking to see
if the task has been suspended. Can anyone explain this check?

Note: This code also has the same problem reported by Ron, where the
CHECKPOINT_SUSPEND is being compared with an OR instead of an AND.

   ------- Additional comments from svdavidson Mon Sep 29 12:28:36 -0700 2008 -------
The actual problem is that parallel job suspension is broken.  I removed the
check of queue_already_suspended in sge_execd_deliver_signal(), and suspending
of parallel jobs works now. Suspending of queues, jobs, and queue subordination
all started working for parallel jobs. The only remaining question is what is
supposed to be the purpose of the queue_already_suspended check?

   ------- Additional comments from svdavidson Sat Oct 4 13:13:09 -0700 2008 -------
I have been running the patched code for about a week and it is working for
suspending parallel jobs.  The patches I used are included below.  The patches
are based on the SGE 6.1u3 source code.

diff -ru gridengine-V61u3.ORIG/source/daemons/execd/execd_signal_queue.c
--- gridengine-V61u3.ORIG/source/daemons/execd/execd_signal_queue.c
2007-05-09 05:36:51.000000000 -0500
+++ gridengine-V61u3/source/daemons/execd/execd_signal_queue.c  2008-10-04
12:56:55.630669364 -0500
@@ -236,7 +236,9 @@
 lListElem *jep,
 lListElem *jatep
 ) {
+#if 0
    int queue_already_suspended;
    int getridofjob = 0;

    DENTER(TOP_LAYER, "sge_execd_deliver_signal");
@@ -288,19 +290,27 @@

    DPRINTF(("(sig==SGE_MIGRATE) = %d (ckpt on suspend) = %d %d\n",
-      (sig == SGE_MIGRATE), lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND,
+      (sig == SGE_MIGRATE), lGetUlong(jep, JB_checkpoint_attr)&CHECKPOINT_SUSPEND,
          lGetUlong(jep, JB_checkpoint_attr)));
    /* Simply apply signal to all subtasks of the job
       except in case of SGE_MIGRATE when there is a
       ckpt env with "migrate on suspend" configured */
+#if 0
    queue_already_suspended = (lGetUlong(jatep, JAT_state)&JSUSPENDED);
-   if (!(sig == SGE_MIGRATE
-         && (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND))
+   DPRINTF(("queue_already_suspended = %d\n", queue_already_suspended));
+   if (!(sig == SGE_MIGRATE
+         && (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND))
          && !queue_already_suspended) {
+   if (!(sig == SGE_MIGRATE
+         && (lGetUlong(jep, JB_checkpoint_attr)&CHECKPOINT_SUSPEND))) {
       lListElem *petep;
+      DPRINTF(("signaling each pe task with signal %d\n", sig));
       /* signal each pe task */
       for_each (petep, lGetList(jatep, JAT_task_list)) {
+         DPRINTF(("signaling pe task pid=%d\n", lGetUlong(petep, PET_pid)));
          if (sge_kill((int)lGetUlong(petep, PET_pid), sig,
             lGetUlong(jep, JB_job_number), lGetUlong(jatep, JAT_task_number),
             lGetString(petep, PET_id))==-2)
@@ -308,6 +318,8 @@

+   DPRINTF(("lGetUlong(jatep, JAT_status) = %d, JSLAVE = %d\n",
lGetUlong(jatep, JAT_status), JSLAVE));
    if (lGetUlong(jatep, JAT_status)!=JSLAVE)
       if (sge_kill((int)lGetUlong(jatep, JAT_pid), sig, lGetUlong(jep,
                         lGetUlong(jatep, JAT_task_number), NULL)==-2)
diff -ru gridengine-V61u3.ORIG/source/daemons/qmaster/sge_qmod_qmaster.c
--- gridengine-V61u3.ORIG/source/daemons/qmaster/sge_qmod_qmaster.c
2007-10-22 05:05:04.000000000 -0500
+++ gridengine-V61u3/source/daemons/qmaster/sge_qmod_qmaster.c  2008-09-26
15:51:43.473894911 -0500
@@ -1302,7 +1302,7 @@
    /* do not signal slave tasks in case of checkpointing jobs with
       STOP/CONT when suspending means migration */
    if ((how==SGE_SIGCONT || how==SGE_SIGSTOP) &&
-      (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND)!=0) {
+      (lGetUlong(jep, JB_checkpoint_attr)&CHECKPOINT_SUSPEND)!=0) {
       DPRINTF(("omit signaling - checkpoint script does action for whole job\n"));

   ------- Additional comments from reuti Sat Oct 4 14:43:33 -0700 2008 -------
Did you also try with subordination like my configuration which I posted above? All slave tasks are always
honored and none is skipped?

Change History (2)

comment:1 Changed 10 years ago by dlove

  • Keywords patch added; removed
  • Severity set to minor

comment:2 Changed 9 years ago by Dave Love <…>

  • Owner set to Dave Love <…>
  • Resolution set to fixed
  • Status changed from new to closed

In [4278/sge]:

Fix #577: /IZ2740 fix signalling parallel jobs (from Shannon Davidson)
Doesn't remove the queue_already_suspended check

Note: See TracTickets for help on using tickets.