[GE users] Rogue MPI processes even with tight integration

Reuti reuti at staff.uni-marburg.de
Tue Sep 4 15:09:36 BST 2007


Am 04.09.2007 um 15:07 schrieb Chris Rudge:

> This doesn't really make sense. It's not obvious why using
> mpi/startmpi.sh should behave any differently to
> mpi/myrinet/startmpi.sh. Even the sge_mpirun script is only a useful
> wrapper to save the user specifying the number of cpus and machinefile
> in their scripts.

But I saw some kill options inside (--gm-kill 15). Maybe this could  
explain why the qrsh_starter survives, and the parents are missing.

We have no Myrinet, but some users stated before that it's not  
necessary to use the Myrinet sub-folder.

> I wonder whether the cause of this problem is actually hidden in
> mpi/README. In the section about mpich.template (which is similar to
> myrinet) it states
>
>    - resource limits are enforced also for tasks at slave hosts
> and
>    - can't trigger job finish if application finishes partially
>
> Is it possible that SGE notices a process on a slave host exceed the
> walltime limit first and kills this process?? This would be a  
> situation
> where the application finishes partially and SGE can't deal with this
> properly.

This should only be the case, if one slave finishs with a normal end  
before reaching any time limit.

-- Reuti


> Regards,
> Chris
>
>
> On Tue, 2007-09-04 at 14:39 +0200, Reuti wrote:
>>>
>>> I've used the start|stopmpi.sh and sge_mpirun scripts in the mpi/
>>> myrinet
>>> folder as a basis but these don't quite work so I made minor
>>> adjustments.
>>
>> This might be the problem: don't use this folder with the actual
>> Myrinet software I would suggest. AFAIK this was only necessary for
>> older Myrinet software.
>>
>> -- Reuti
>
> -- 
> Dr Chris Rudge
> chris.rudge at astro.le.ac.uk
>
> UKAFF Facility Manager & Dept. Research Computing Manager
> Dept of Physics & Astronomy
> University of Leicester
> LE1 7RH
>
> web.  www.ukaff.ac.uk
> Tel.  +44 (0)116 2523331
> Fax.  +44 (0)116 2231283
> Mob.  +44 (0)794 1379420
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list