[GE users] Rogue MPI processes even with tight integration

Reuti reuti at staff.uni-marburg.de
Tue Sep 4 13:39:41 BST 2007


Am 04.09.2007 um 14:15 schrieb Chris Rudge:

> I can't reproduce this with all MPI codes and I'm not convinced  
> that it
> happens every time when the job ends with h_rt. I've not rigorously
> tested this code with different ways of killing it, qdel or killing a
> process etc but I did some tests with another code and the most  
> reliable
> way to reproduce this problem is to manually kill one of the MPI
> processes on a slave node. SGE appears to spot that there's some  
> sort of
> exit from the job but it doesn't kill all of the processes on the  
> slave
> nodes.
>
> I've used the start|stopmpi.sh and sge_mpirun scripts in the mpi/ 
> myrinet
> folder as a basis but these don't quite work so I made minor
> adjustments.

This might be the problem: don't use this folder with the actual  
Myrinet software I would suggest. AFAIK this was only necessary for  
older Myrinet software.

-- Reuti

> Chris
>
> On Mon, 2007-09-03 at 21:59 +0200, Reuti wrote:
>> Well, this looks perfect. All kids of the qrsh_starter have the same
>> group id. Are you experiencing this also if you issue a qdel? There
>> was a race condition for h_cpu, but even then the qrsh_starter
>> disappeared and only an idling process was left. With h_rt this is
>> something I never saw before with this behavior.
>>
>> Did you apply the Myrinet scripts from the mpi folder (which I
>> wouldn't suggest to use with the latest version of the Myrinet
>> software)?
>>
>> -- Reuti
>>
>>
>
> -- 
> Dr Chris Rudge
> chris.rudge at astro.le.ac.uk
>
> UKAFF Facility Manager & Dept. Research Computing Manager
> Dept of Physics & Astronomy
> University of Leicester
> LE1 7RH
>
> web.  www.ukaff.ac.uk
> Tel.  +44 (0)116 2523331
> Fax.  +44 (0)116 2231283
> Mob.  +44 (0)794 1379420
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list