[GE users] Rogue MPI processes even with tight integration

Chris Rudge chris.rudge at astro.le.ac.uk
Tue Sep 4 13:15:00 BST 2007


I can't reproduce this with all MPI codes and I'm not convinced that it
happens every time when the job ends with h_rt. I've not rigorously
tested this code with different ways of killing it, qdel or killing a
process etc but I did some tests with another code and the most reliable
way to reproduce this problem is to manually kill one of the MPI
processes on a slave node. SGE appears to spot that there's some sort of
exit from the job but it doesn't kill all of the processes on the slave

I've used the start|stopmpi.sh and sge_mpirun scripts in the mpi/myrinet
folder as a basis but these don't quite work so I made minor


On Mon, 2007-09-03 at 21:59 +0200, Reuti wrote:
> Well, this looks perfect. All kids of the qrsh_starter have the same  
> group id. Are you experiencing this also if you issue a qdel? There  
> was a race condition for h_cpu, but even then the qrsh_starter  
> disappeared and only an idling process was left. With h_rt this is  
> something I never saw before with this behavior.
> Did you apply the Myrinet scripts from the mpi folder (which I  
> wouldn't suggest to use with the latest version of the Myrinet  
> software)?
> -- Reuti

Dr Chris Rudge
chris.rudge at astro.le.ac.uk

UKAFF Facility Manager & Dept. Research Computing Manager
Dept of Physics & Astronomy
University of Leicester

web.  www.ukaff.ac.uk
Tel.  +44 (0)116 2523331
Fax.  +44 (0)116 2231283
Mob.  +44 (0)794 1379420

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list