[GE users] Mvapich processes not killed on qdel

Reuti reuti at staff.uni-marburg.de
Thu May 10 11:55:18 BST 2007


Hi,

Am 09.05.2007 um 21:53 schrieb Mike Hanby:

> I created a simple helloworld job that prints a message and then  
> sleeps
> for 5 minutes. If I qdel the job after 1 minute, the job is removed  
> from
> the queue but remains running on the nodes for 4 more minutes. I'm  
> using
> rsh in this example I have the ps info below:

but still the processes are not children of the sge_execd/ 
sge_shepherd. So the rsh-wrapper isn't used. Is the path to the rsh  
binary hardcoded somewhere in your MPI scripts? There is /usr/bin/rsh  
mentioned - can you change it somewhere to read just rsh, so that the  
rsh-wrapper will be accessed instead of the binary?

-- Reuti


> I submitted the job using the following job script:
> #!/bin/bash
> #$ -S /bin/bash
> #$ -cwd
> #$ -N TestMVAPICH
> #$ -pe mvapich 4
> #$ -v MPIR_HOME=/usr/local/topspin/mpi/mpich
> #$ -v MPICH_PROCESS_GROUP=no
> #$ -V
> export MPI_HOME=/usr/local/topspin/mpi/mpich
> export
> LD_LIBRARY_PATH=/usr/local/topspin/lib64:$MPI_HOME/lib64: 
> $LD_LIBRARY_PAT
> H
> export PATH=$TMPDIR:$MPI_HOME/bin:$PATH
> MPIRUN=${MPI_HOME}/bin/mpirun_rsh
> $MPIRUN -rsh -np $NSLOTS -machinefile $TMPDIR/machines ./hello-mvapich
>
> This is the ps output on the node while the job is running in the  
> queue:
> $ ssh compute-0-7 "ps -e f -o pid,ppid,pgrp,command|grep myuser| 
> grep -v
> grep"
>  1460  3611  1460  \_ sshd: myuser [priv]
>  1464  1460  1460      \_ sshd: myuser at notty
>   951   947   951  |   \_ bash -c cd /home/myuser/pmemdTest-mvapich;
> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
> MPIRUN_PORT=32826
> MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
> MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
>   954   948   954  |   \_ bash -c cd /home/myuser/pmemdTest-mvapich;
> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
> MPIRUN_PORT=32826
> MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
> MPIRUN_RANK=1 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
>   955   949   955  |   \_ bash -c cd /home/myuser/pmemdTest-mvapich;
> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
> MPIRUN_PORT=32826
> MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
> MPIRUN_RANK=2 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
>   966   950   966      \_ bash -c cd /home/myuser/pmemdTest-mvapich;
> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
> MPIRUN_PORT=32826
> MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
> MPIRUN_RANK=3 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
>   943   942   938              \_ /usr/bin/rsh compute-0-7 cd
> /home/myuser/pmemdTest-mvapich; /usr/bin/env MPIRUN_MPD=0
> MPIRUN_HOST=compute-0-7.local MPIRUN_PORT=32826
> MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
> MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
>   944   942   938              \_ /usr/bin/rsh compute-0-7 cd
> /home/myuser/pmemdTest-mvapich; /usr/bin/env MPIRUN_MPD=0
> MPIRUN_HOST=compute-0-7.local MPIRUN_PORT=32826
> MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
> MPIRUN_RANK=1 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
>   945   942   938              \_ /usr/bin/rsh compute-0-7 cd
> /home/myuser/pmemdTest-mvapich; /usr/bin/env MPIRUN_MPD=0
> MPIRUN_HOST=compute-0-7.local MPIRUN_PORT=32826
> MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
> MPIRUN_RANK=2 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
>   946   942   938              \_ /usr/bin/rsh compute-0-7 cd
> /home/myuser/pmemdTest-mvapich; /usr/bin/env MPIRUN_MPD=0
> MPIRUN_HOST=compute-0-7.local MPIRUN_PORT=32826
> MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
> MPIRUN_RANK=3 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
>
> And the ps after I qdel the job
> $ ssh compute-0-7 "ps -e f -o pid,ppid,pgrp,command|grep myuser| 
> grep -v
> grep"
>  1735  3611  1735  \_ sshd: myuser [priv]
>  1739  1735  1735      \_ sshd: myuser at notty
>   951   947   951  |   \_ bash -c cd /home/myuser/pmemdTest-mvapich;
> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
> MPIRUN_PORT=32826
> MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
> MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
>   954   948   954  |   \_ bash -c cd /home/myuser/pmemdTest-mvapich;
> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
> MPIRUN_PORT=32826
> MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
> MPIRUN_RANK=1 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
>   955   949   955  |   \_ bash -c cd /home/myuser/pmemdTest-mvapich;
> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
> MPIRUN_PORT=32826
> MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
> MPIRUN_RANK=2 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
>   966   950   966      \_ bash -c cd /home/myuser/pmemdTest-mvapich;
> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
> MPIRUN_PORT=32826
> MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
> MPIRUN_RANK=3 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
>
> -----Original Message-----
> From: Mike Hanby [mailto:mhanby at uab.edu]
> Sent: Wednesday, May 09, 2007 11:59
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] Mvapich processes not killed on qdel
>
> Hmm, I changed the mpirun command to mpirun_rsh -rsh and submitted the
> job, it started and failed with a bunch of connections refused. By
> default Rocks disables RSH.
>
> Does tight integration only work with rsh? If so, I'll see if I can  
> get
> that enabled and try again.
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Wednesday, May 09, 2007 11:27
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Mvapich processes not killed on qdel
>
> Hi,
>
> can you please post the processtree (master and slave) of a running
> job on a node by using the ps command:
>
> ps -e f -o pid,ppid,pgrp,command
>
> Are you sure, that the SGE rsh-wrapper is used, as you mentioned
> mpirun_ssh?
>
> -- Reuti
>
>
> Am 09.05.2007 um 17:43 schrieb Mike Hanby:
>
>> Howdy,
>>
>> I have GE 6.0u8 on a Rocks 4.2.1 cluster with Infiniband and the
>> Topspin roll (which includes mvapich).
>>
>>
>>
>> When I qdel an mvapich job, the job immediately is removed from the
>> queue, however most of the processes on the nodes do not get
>> killed. It appears that the mpirun_ssh process does get killed,
>> however all of the actual job executables (sander.MPI) doesn't.
>>
>>
>>
>> I followed the directions for tight integration of Mvapich
>>
>> http://gridengine.sunsource.net/project/gridengine/howto/mvapich/
>> MVAPICH_Integration.html
>>
>>
>>
>> The job runs fine, but again it doesn't kill off processes when
>> qdel'd.
>>
>>
>>
>> Here's the pe:
>>
>> $ qconf -sp mvapich
>>
>> pe_name           mvapich
>>
>> slots             9999
>>
>> user_lists        NONE
>>
>> xuser_lists       NONE
>>
>> start_proc_args   /share/apps/gridengine/mvapich/startmpi.sh -
>> catch_rsh \
>>
>>                   $pe_hostfile
>>
>> stop_proc_args    /share/apps/gridengine/mvapich/stopmpi.sh
>>
>> allocation_rule   $round_robin
>>
>> control_slaves    TRUE
>>
>> job_is_first_task FALSE
>>
>> urgency_slots     min
>>
>>
>>
>> The only modifications made to the startmpi.sh script was to change
>> the location of the hostname and rsh scripts from $SGE_ROOT to /
>> share/apps/gridengine/mvapich
>>
>>
>>
>> Any suggestions on what I should look for?
>>
>>
>>
>> Thanks, MIke
>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list