[GE users] Suspending Parallel Jobs

Reuti reuti at staff.uni-marburg.de
Fri Sep 26 10:55:39 BST 2008


Am 26.09.2008 um 05:23 schrieb Shannon V. Davidson:

> Reuti wrote:
>> Am 25.09.2008 um 22:50 schrieb Ron Chen:
>>
>>> I remember seeing SGE code that specifically blocks sending the  
>>> suspend signal to the MPI tasks. From the list discussions, the  
>>> reason is that if a MPI job is suspended, then the TCP/IP network  
>>> socket calls will timeout, and the job will then fail.
>>
>> This can be one reason why it's not supported out-of-the-box.  
>> Additional reason might be:
>>
>> - you can't stop all processes on all nodes at the same time,  
>> which can lead to lost messages
>>
>> - qmod -sj can work (hence a complete job), but qmod -sq for queue  
>> instances would stop only some of the processes (like local  
>> subordination) leaving others idling and waiting (AFAICS,  
>> subordination is also only working for the node where the master  
>> runs, for nodes where slave tasks are running SIGSTOP is also not  
>> send to the slave processes running there)
>
> Subordination is what I was trying to use.  It seemed like the  
> easiest way to allowing a higher priority parallel job to jump in  
> and steal the resources from lower priority jobs until it  
> completed.   As you mention above, it suspended the local tasks but  
> not the remote tasks.

I meant something different. You have one parallel job with just 2  
slots running on node1 (master) and node2 (slave) in a queue called  
parallel.q. On node2 a serial job starts in a superordinated queue  
serial.q (or another parallel job with a different node allocation).  
Although the queue instance parallel at node2 is flagged as "S"  
suspended, no signal is send to the parallel job running there.

This would lead to a further discussion: should the slave-execd talk  
to master-execd to suspend the complete job? Most likely it can't run  
anyway when one of the slaves is suspended.

-- Reuti


> Shannon
>
>
>>
>>
>> -- Reuti
>>
>>
>>> I think if we comment out a few lines of code, or only enable  
>>> that code by a switch, then it will make many people on this list  
>>> happy, as it is a FAQ.
>>>
>>>  -Ron
>>>
>>>
>>> --- On Fri, 9/26/08, Shannon V. Davidson <svdavidson at charter.net>  
>>> wrote:
>>>> I'm trying to suspend a parallel job using a tight PE
>>>> integration, but
>>>> the non-local MPI tasks are not being suspended.  Is the
>>>> tight PE
>>>> integration code supposed to send the SIGSTOP signal to
>>>> every MPI task
>>>> in the job?  Is the suspend method executed on every
>>>> execution host in a
>>>> parallel job?
>>>>
>>>> Thanks,
>>>> Shannon
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail:
>>>> users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail:
>>>> users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> -- 
> _________________________________________
>
> Shannon V. Davidson <sdavidson at appro.com>
> Software Engineer     Appro International
> 636-633-0380 (office)  443-383-0331 (fax)
> _________________________________________
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list