[GE users] Rogue MPI processes even with tight integration

Chris Rudge chris.rudge at astro.le.ac.uk
Tue Sep 4 13:15:00 BST 2007


Reuti,

I can't reproduce this with all MPI codes and I'm not convinced that it
happens every time when the job ends with h_rt. I've not rigorously
tested this code with different ways of killing it, qdel or killing a
process etc but I did some tests with another code and the most reliable
way to reproduce this problem is to manually kill one of the MPI
processes on a slave node. SGE appears to spot that there's some sort of
exit from the job but it doesn't kill all of the processes on the slave
nodes.

I've used the start|stopmpi.sh and sge_mpirun scripts in the mpi/myrinet
folder as a basis but these don't quite work so I made minor
adjustments.

Chris

On Mon, 2007-09-03 at 21:59 +0200, Reuti wrote:
> Well, this looks perfect. All kids of the qrsh_starter have the same  
> group id. Are you experiencing this also if you issue a qdel? There  
> was a race condition for h_cpu, but even then the qrsh_starter  
> disappeared and only an idling process was left. With h_rt this is  
> something I never saw before with this behavior.
> 
> Did you apply the Myrinet scripts from the mpi folder (which I  
> wouldn't suggest to use with the latest version of the Myrinet  
> software)?
> 
> -- Reuti
> 
> 

-- 
Dr Chris Rudge
chris.rudge at astro.le.ac.uk

UKAFF Facility Manager & Dept. Research Computing Manager
Dept of Physics & Astronomy
University of Leicester
LE1 7RH

web.  www.ukaff.ac.uk
Tel.  +44 (0)116 2523331
Fax.  +44 (0)116 2231283
Mob.  +44 (0)794 1379420


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list