[GE users] Suspending Parallel Jobs

Ron Chen ron_chen_123 at yahoo.com
Thu Sep 25 21:50:49 BST 2008


I remember seeing SGE code that specifically blocks sending the suspend signal to the MPI tasks. From the list discussions, the reason is that if a MPI job is suspended, then the TCP/IP network socket calls will timeout, and the job will then fail.

I think if we comment out a few lines of code, or only enable that code by a switch, then it will make many people on this list happy, as it is a FAQ.

 -Ron


--- On Fri, 9/26/08, Shannon V. Davidson <svdavidson at charter.net> wrote:
> I'm trying to suspend a parallel job using a tight PE
> integration, but 
> the non-local MPI tasks are not being suspended.  Is the
> tight PE 
> integration code supposed to send the SIGSTOP signal to
> every MPI task 
> in the job?  Is the suspend method executed on every
> execution host in a 
> parallel job?
> 
> Thanks,
> Shannon
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail:
> users-help at gridengine.sunsource.net


      

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list