[GE users] Stopping MPICH child processes on qdel

Reuti reuti at staff.uni-marburg.de
Wed Nov 23 21:23:16 GMT 2005


Hi again Brian,

Am 23.11.2005 um 00:14 schrieb Brian Smith:

> To all:
>
> Issuing qdel to an mpich job does not kill the child processes in SGE
> 6.0u6 with Myrinet tight-integration.
>
> PE looks like so:
>
> pe_name           mpich-pgi
> slots             20
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /usr/local/sge/mpi/myrinet/startmpi.sh -catch-rsh
> -unique  \
>                   $pe_hostfile /usr/local/x86_64/pgi/mpich/bin/mpirun
> stop_proc_args    /usr/local/sge/mpi/myrinet/stopmpi.sh
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> I have checked over

you don't need the mpi/myrinet subdirectory any longer, this was only  
necessary in former version of Myrinet. The conventional mpi scripts  
and templates will do.

> http://gridengine.sunsource.net/howto/mpich-integration.html
>
> and it appears quite outdated.  I have been unable to locate any  
> recent
> documentation on resolving this problem.

Yes - no. I discussed with another SGE user the issue of newer  
Myrinet versions already (we have no Myrinet here) and we turned to  
PM. But suddenly he refuses to send me any results he got while we  
tried to solve this issue... :-(

But as I recall, the point was that my suggested patch to  
mpirun.ch_gm.pl is no longer necessary for newer Myrinet versions, as  
it's already included in Myrinet itself :-). As I never got a final  
confirmation of this, I refused up to now to adjust the Howto.

Can you please post the processtree of a running job on the head node  
and one slave with:

ps -e f -o pid,ppid,pgrp,command

after you adjusted your PE setup to use the scripts from $SGE_ROOT/mpi

Cheers - Reuti

> Calling my MPI jobs with
>
> mpirun -np $NSLOTS -machinefile $TMPDIR/machines <binary>
>
> or
>
> sge_mpirun <binary>
>
> Yields identical results.
>
> a) Is there any recent documentation that covers this issue with
> Myrinet?
>
> b) When is this problem _finally_ going to be fixed?  I have been
> dealing with it since 6.0 was initially released.  We're all the  
> way to
> update 6 and we're still incurring issues on our systems because of
> it.
>
> Any and all help is appreciated.
>
> Best Regards,
>
> Brian Smith
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list