[GE users] Suspending Parallel Jobs

Reuti reuti at staff.uni-marburg.de
Sat Sep 27 21:40:09 BST 2008


Am 26.09.2008 um 15:09 schrieb Shannon V. Davidson:

> Reuti wrote:
>> Am 26.09.2008 um 05:23 schrieb Shannon V. Davidson:
>>
>>> Reuti wrote:
>>>> Am 25.09.2008 um 22:50 schrieb Ron Chen:
>>>>
>>>>> I remember seeing SGE code that specifically blocks sending the  
>>>>> suspend signal to the MPI tasks. From the list discussions, the  
>>>>> reason is that if a MPI job is suspended, then the TCP/IP  
>>>>> network socket calls will timeout, and the job will then fail.
>>>>
>>>> This can be one reason why it's not supported out-of-the-box.  
>>>> Additional reason might be:
>>>>
>>>> - you can't stop all processes on all nodes at the same time,  
>>>> which can lead to lost messages
>>>>
>>>> - qmod -sj can work (hence a complete job), but qmod -sq for  
>>>> queue instances would stop only some of the processes (like  
>>>> local subordination) leaving others idling and waiting (AFAICS,  
>>>> subordination is also only working for the node where the master  
>>>> runs, for nodes where slave tasks are running SIGSTOP is also  
>>>> not send to the slave processes running there)
>>>
>>> Subordination is what I was trying to use.  It seemed like the  
>>> easiest way to allowing a higher priority parallel job to jump in  
>>> and steal the resources from lower priority jobs until it  
>>> completed.   As you mention above, it suspended the local tasks  
>>> but not the remote tasks.
>>
>> I meant something different. You have one parallel job with just 2  
>> slots running on node1 (master) and node2 (slave) in a queue  
>> called parallel.q. On node2 a serial job starts in a  
>> superordinated queue serial.q (or another parallel job with a  
>> different node allocation). Although the queue instance  
>> parallel at node2 is flagged as "S" suspended, no signal is send to  
>> the parallel job running there.
>>
>> This would lead to a further discussion: should the slave-execd  
>> talk to master-execd to suspend the complete job? Most likely it  
>> can't run anyway when one of the slaves is suspended.
>
> IMO, it should, since it really doesn't make sense to suspend part  
> of a job any more than it makes sense to kill part of a job or  
> adjust the priority of part of a job.  A parallel job, whether it  
> is distributed or not, should be treated as a group of related  
> processes where the entire job is treated as a unit.

I entered an issue: http://gridengine.sunsource.net/issues/ 
show_bug.cgi?id=2740

-- Reuti


> Shannon
>
>
>> -- Reuti
>>
>>
>>> Shannon
>>>
>>>
>>>>
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> I think if we comment out a few lines of code, or only enable  
>>>>> that code by a switch, then it will make many people on this  
>>>>> list happy, as it is a FAQ.
>>>>>
>>>>>  -Ron
>>>>>
>>>>>
>>>>> --- On Fri, 9/26/08, Shannon V. Davidson  
>>>>> <svdavidson at charter.net> wrote:
>>>>>> I'm trying to suspend a parallel job using a tight PE
>>>>>> integration, but
>>>>>> the non-local MPI tasks are not being suspended.  Is the
>>>>>> tight PE
>>>>>> integration code supposed to send the SIGSTOP signal to
>>>>>> every MPI task
>>>>>> in the job?  Is the suspend method executed on every
>>>>>> execution host in a
>>>>>> parallel job?
>>>>>>
>>>>>> Thanks,
>>>>>> Shannon
>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> ----
>>>>>> To unsubscribe, e-mail:
>>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail:
>>>>>> users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>> -- 
>>> _________________________________________
>>>
>>> Shannon V. Davidson <sdavidson at appro.com>
>>> Software Engineer     Appro International
>>> 636-633-0380 (office)  443-383-0331 (fax)
>>> _________________________________________
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> -- 
> _________________________________________
>
> Shannon V. Davidson <sdavidson at appro.com>
> Software Engineer     Appro International
> 636-633-0380 (office)  443-383-0331 (fax)
> _________________________________________
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list