[GE users] Suspending Parallel Jobs

Shannon V. Davidson svdavidson at charter.net
Fri Sep 26 14:09:57 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti wrote:
> Am 26.09.2008 um 05:23 schrieb Shannon V. Davidson:
>
>> Reuti wrote:
>>> Am 25.09.2008 um 22:50 schrieb Ron Chen:
>>>
>>>> I remember seeing SGE code that specifically blocks sending the 
>>>> suspend signal to the MPI tasks. From the list discussions, the 
>>>> reason is that if a MPI job is suspended, then the TCP/IP network 
>>>> socket calls will timeout, and the job will then fail.
>>>
>>> This can be one reason why it's not supported out-of-the-box. 
>>> Additional reason might be:
>>>
>>> - you can't stop all processes on all nodes at the same time, which 
>>> can lead to lost messages
>>>
>>> - qmod -sj can work (hence a complete job), but qmod -sq for queue 
>>> instances would stop only some of the processes (like local 
>>> subordination) leaving others idling and waiting (AFAICS, 
>>> subordination is also only working for the node where the master 
>>> runs, for nodes where slave tasks are running SIGSTOP is also not 
>>> send to the slave processes running there)
>>
>> Subordination is what I was trying to use.  It seemed like the 
>> easiest way to allowing a higher priority parallel job to jump in and 
>> steal the resources from lower priority jobs until it completed.   As 
>> you mention above, it suspended the local tasks but not the remote 
>> tasks.
>
> I meant something different. You have one parallel job with just 2 
> slots running on node1 (master) and node2 (slave) in a queue called 
> parallel.q. On node2 a serial job starts in a superordinated queue 
> serial.q (or another parallel job with a different node allocation). 
> Although the queue instance parallel at node2 is flagged as "S" 
> suspended, no signal is send to the parallel job running there.
>
> This would lead to a further discussion: should the slave-execd talk 
> to master-execd to suspend the complete job? Most likely it can't run 
> anyway when one of the slaves is suspended.

IMO, it should, since it really doesn't make sense to suspend part of a 
job any more than it makes sense to kill part of a job or adjust the 
priority of part of a job.  A parallel job, whether it is distributed or 
not, should be treated as a group of related processes where the entire 
job is treated as a unit.

Shannon


> -- Reuti
>
>
>> Shannon
>>
>>
>>>
>>>
>>> -- Reuti
>>>
>>>
>>>> I think if we comment out a few lines of code, or only enable that 
>>>> code by a switch, then it will make many people on this list happy, 
>>>> as it is a FAQ.
>>>>
>>>>  -Ron
>>>>
>>>>
>>>> --- On Fri, 9/26/08, Shannon V. Davidson <svdavidson at charter.net> 
>>>> wrote:
>>>>> I'm trying to suspend a parallel job using a tight PE
>>>>> integration, but
>>>>> the non-local MPI tasks are not being suspended.  Is the
>>>>> tight PE
>>>>> integration code supposed to send the SIGSTOP signal to
>>>>> every MPI task
>>>>> in the job?  Is the suspend method executed on every
>>>>> execution host in a
>>>>> parallel job?
>>>>>
>>>>> Thanks,
>>>>> Shannon
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail:
>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail:
>>>>> users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>> -- 
>> _________________________________________
>>
>> Shannon V. Davidson <sdavidson at appro.com>
>> Software Engineer     Appro International
>> 636-633-0380 (office)  443-383-0331 (fax)
>> _________________________________________
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

-- 
_________________________________________

Shannon V. Davidson <sdavidson at appro.com>
Software Engineer     Appro International
636-633-0380 (office)  443-383-0331 (fax)
_________________________________________



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list