[GE users] Rogue MPI processes even with tight integration

Chris Rudge chris.rudge at astro.le.ac.uk
Tue Sep 4 14:07:20 BST 2007


This doesn't really make sense. It's not obvious why using
mpi/startmpi.sh should behave any differently to
mpi/myrinet/startmpi.sh. Even the sge_mpirun script is only a useful
wrapper to save the user specifying the number of cpus and machinefile
in their scripts.

I wonder whether the cause of this problem is actually hidden in
mpi/README. In the section about mpich.template (which is similar to
myrinet) it states

   - resource limits are enforced also for tasks at slave hosts
   - can't trigger job finish if application finishes partially

Is it possible that SGE notices a process on a slave host exceed the
walltime limit first and kills this process?? This would be a situation
where the application finishes partially and SGE can't deal with this


On Tue, 2007-09-04 at 14:39 +0200, Reuti wrote:
> >
> > I've used the start|stopmpi.sh and sge_mpirun scripts in the mpi/ 
> > myrinet
> > folder as a basis but these don't quite work so I made minor
> > adjustments.
> This might be the problem: don't use this folder with the actual  
> Myrinet software I would suggest. AFAIK this was only necessary for  
> older Myrinet software.
> -- Reuti

Dr Chris Rudge
chris.rudge at astro.le.ac.uk

UKAFF Facility Manager & Dept. Research Computing Manager
Dept of Physics & Astronomy
University of Leicester

web.  www.ukaff.ac.uk
Tel.  +44 (0)116 2523331
Fax.  +44 (0)116 2231283
Mob.  +44 (0)794 1379420

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list