[GE users] Suspending Parallel Jobs

Reuti reuti at staff.uni-marburg.de
Thu Sep 25 23:10:40 BST 2008


Am 25.09.2008 um 22:50 schrieb Ron Chen:

> I remember seeing SGE code that specifically blocks sending the  
> suspend signal to the MPI tasks. From the list discussions, the  
> reason is that if a MPI job is suspended, then the TCP/IP network  
> socket calls will timeout, and the job will then fail.

This can be one reason why it's not supported out-of-the-box.  
Additional reason might be:

- you can't stop all processes on all nodes at the same time, which  
can lead to lost messages

- qmod -sj can work (hence a complete job), but qmod -sq for queue  
instances would stop only some of the processes (like local  
subordination) leaving others idling and waiting (AFAICS,  
subordination is also only working for the node where the master  
runs, for nodes where slave tasks are running SIGSTOP is also not  
send to the slave processes running there)

-- Reuti


> I think if we comment out a few lines of code, or only enable that  
> code by a switch, then it will make many people on this list happy,  
> as it is a FAQ.
>
>  -Ron
>
>
> --- On Fri, 9/26/08, Shannon V. Davidson <svdavidson at charter.net>  
> wrote:
>> I'm trying to suspend a parallel job using a tight PE
>> integration, but
>> the non-local MPI tasks are not being suspended.  Is the
>> tight PE
>> integration code supposed to send the SIGSTOP signal to
>> every MPI task
>> in the job?  Is the suspend method executed on every
>> execution host in a
>> parallel job?
>>
>> Thanks,
>> Shannon
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list