[GE users] Suspending Parallel Jobs

Shannon V. Davidson svdavidson at charter.net
Fri Sep 26 04:23:17 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti wrote:
> Am 25.09.2008 um 22:50 schrieb Ron Chen:
>
>> I remember seeing SGE code that specifically blocks sending the 
>> suspend signal to the MPI tasks. From the list discussions, the 
>> reason is that if a MPI job is suspended, then the TCP/IP network 
>> socket calls will timeout, and the job will then fail.
>
> This can be one reason why it's not supported out-of-the-box. 
> Additional reason might be:
>
> - you can't stop all processes on all nodes at the same time, which 
> can lead to lost messages
>
> - qmod -sj can work (hence a complete job), but qmod -sq for queue 
> instances would stop only some of the processes (like local 
> subordination) leaving others idling and waiting (AFAICS, 
> subordination is also only working for the node where the master runs, 
> for nodes where slave tasks are running SIGSTOP is also not send to 
> the slave processes running there)

Subordination is what I was trying to use.  It seemed like the easiest 
way to allowing a higher priority parallel job to jump in and steal the 
resources from lower priority jobs until it completed.   As you mention 
above, it suspended the local tasks but not the remote tasks.

Shannon


>
>
> -- Reuti
>
>
>> I think if we comment out a few lines of code, or only enable that 
>> code by a switch, then it will make many people on this list happy, 
>> as it is a FAQ.
>>
>>  -Ron
>>
>>
>> --- On Fri, 9/26/08, Shannon V. Davidson <svdavidson at charter.net> wrote:
>>> I'm trying to suspend a parallel job using a tight PE
>>> integration, but
>>> the non-local MPI tasks are not being suspended.  Is the
>>> tight PE
>>> integration code supposed to send the SIGSTOP signal to
>>> every MPI task
>>> in the job?  Is the suspend method executed on every
>>> execution host in a
>>> parallel job?
>>>
>>> Thanks,
>>> Shannon
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail:
>>> users-help at gridengine.sunsource.net
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

-- 
_________________________________________

Shannon V. Davidson <sdavidson at appro.com>
Software Engineer     Appro International
636-633-0380 (office)  443-383-0331 (fax)
_________________________________________



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list