[GE users] Suspending Parallel Jobs

Ron Chen ron_chen_123 at yahoo.com
Thu Sep 25 22:43:35 BST 2008


I think the problem is in signal_slave_tasks_of_job():

   /* do not signal slave tasks in case of checkpointing jobs with
      STOP/CONT when suspending means migration */
   if ((how==SGE_SIGCONT || how==SGE_SIGSTOP) &&
      (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND)!=0) {
      ...
      return;
   }

I think it's for checkpointing jobs, but "(lGetUlong(jep,JB_checkpoint_attr) | CHECKPOINT_SUSPEND)!=0" is always true. 

The reason is that even if lGetUlong(jep,JB_checkpoint_attr) is 0, CHECKPOINT_SUSPEND (#defined to 0x00000004) is always non-zero. And bitwise OR will then give you a non zero value.

I think the fix is to replace "|" with "&":

   if ((how==SGE_SIGCONT || how==SGE_SIGSTOP) &&
      (lGetUlong(jep, JB_checkpoint_attr)&CHECKPOINT_SUSPEND)!=0) {
      ...
      return;
   }

Let me know if changing this code fixes anything.

 -Ron



--- On Fri, 9/26/08, Shannon V. Davidson <svdavidson at charter.net> wrote:
> Thanks Ron - I'll dig thru the code and see if I can
> find it.
> 
> Shannon
> 
> Ron Chen wrote:
> > I remember seeing SGE code that specifically blocks
> sending the suspend signal to the MPI tasks. From the list
> discussions, the reason is that if a MPI job is suspended,
> then the TCP/IP network socket calls will timeout, and the
> job will then fail.
> >
> > I think if we comment out a few lines of code, or only
> enable that code by a switch, then it will make many people
> on this list happy, as it is a FAQ.
> >   
> >  -Ron
> >
> >
> > --- On Fri, 9/26/08, Shannon V. Davidson
> <svdavidson at charter.net> wrote:
> >   
> >> I'm trying to suspend a parallel job using a
> tight PE
> >> integration, but 
> >> the non-local MPI tasks are not being suspended. 
> Is the
> >> tight PE 
> >> integration code supposed to send the SIGSTOP
> signal to
> >> every MPI task 
> >> in the job?  Is the suspend method executed on
> every
> >> execution host in a 
> >> parallel job?
> >>
> >> Thanks,
> >> Shannon
> >>
> >>
> >>
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail:
> >> users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail:
> >> users-help at gridengine.sunsource.net
> >>     
> >
> >
> >       
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> >
> >
> >


      

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list