[GE users] Mvapich processes not killed on qdel

Mike Hanby mhanby at uab.edu
Wed May 9 20:53:25 BST 2007


I created a simple helloworld job that prints a message and then sleeps
for 5 minutes. If I qdel the job after 1 minute, the job is removed from
the queue but remains running on the nodes for 4 more minutes. I'm using
rsh in this example I have the ps info below:

I submitted the job using the following job script:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -N TestMVAPICH
#$ -pe mvapich 4
#$ -v MPIR_HOME=/usr/local/topspin/mpi/mpich
#$ -v MPICH_PROCESS_GROUP=no
#$ -V
export MPI_HOME=/usr/local/topspin/mpi/mpich
export
LD_LIBRARY_PATH=/usr/local/topspin/lib64:$MPI_HOME/lib64:$LD_LIBRARY_PAT
H
export PATH=$TMPDIR:$MPI_HOME/bin:$PATH
MPIRUN=${MPI_HOME}/bin/mpirun_rsh
$MPIRUN -rsh -np $NSLOTS -machinefile $TMPDIR/machines ./hello-mvapich

This is the ps output on the node while the job is running in the queue:
$ ssh compute-0-7 "ps -e f -o pid,ppid,pgrp,command|grep myuser|grep -v
grep" 
 1460  3611  1460  \_ sshd: myuser [priv]
 1464  1460  1460      \_ sshd: myuser at notty
  951   947   951  |   \_ bash -c cd /home/myuser/pmemdTest-mvapich;
/usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
MPIRUN_PORT=32826
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
  954   948   954  |   \_ bash -c cd /home/myuser/pmemdTest-mvapich;
/usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
MPIRUN_PORT=32826
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
MPIRUN_RANK=1 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
  955   949   955  |   \_ bash -c cd /home/myuser/pmemdTest-mvapich;
/usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
MPIRUN_PORT=32826
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
MPIRUN_RANK=2 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
  966   950   966      \_ bash -c cd /home/myuser/pmemdTest-mvapich;
/usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
MPIRUN_PORT=32826
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
MPIRUN_RANK=3 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
  943   942   938              \_ /usr/bin/rsh compute-0-7 cd
/home/myuser/pmemdTest-mvapich; /usr/bin/env MPIRUN_MPD=0
MPIRUN_HOST=compute-0-7.local MPIRUN_PORT=32826
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
  944   942   938              \_ /usr/bin/rsh compute-0-7 cd
/home/myuser/pmemdTest-mvapich; /usr/bin/env MPIRUN_MPD=0
MPIRUN_HOST=compute-0-7.local MPIRUN_PORT=32826
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
MPIRUN_RANK=1 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
  945   942   938              \_ /usr/bin/rsh compute-0-7 cd
/home/myuser/pmemdTest-mvapich; /usr/bin/env MPIRUN_MPD=0
MPIRUN_HOST=compute-0-7.local MPIRUN_PORT=32826
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
MPIRUN_RANK=2 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
  946   942   938              \_ /usr/bin/rsh compute-0-7 cd
/home/myuser/pmemdTest-mvapich; /usr/bin/env MPIRUN_MPD=0
MPIRUN_HOST=compute-0-7.local MPIRUN_PORT=32826
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
MPIRUN_RANK=3 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich

And the ps after I qdel the job
$ ssh compute-0-7 "ps -e f -o pid,ppid,pgrp,command|grep myuser|grep -v
grep"
 1735  3611  1735  \_ sshd: myuser [priv]
 1739  1735  1735      \_ sshd: myuser at notty
  951   947   951  |   \_ bash -c cd /home/myuser/pmemdTest-mvapich;
/usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
MPIRUN_PORT=32826
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
  954   948   954  |   \_ bash -c cd /home/myuser/pmemdTest-mvapich;
/usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
MPIRUN_PORT=32826
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
MPIRUN_RANK=1 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
  955   949   955  |   \_ bash -c cd /home/myuser/pmemdTest-mvapich;
/usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
MPIRUN_PORT=32826
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
MPIRUN_RANK=2 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich
  966   950   966      \_ bash -c cd /home/myuser/pmemdTest-mvapich;
/usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
MPIRUN_PORT=32826
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
MPIRUN_RANK=3 MPIRUN_NPROCS=4 MPIRUN_ID=942      ./hello-mvapich

-----Original Message-----
From: Mike Hanby [mailto:mhanby at uab.edu] 
Sent: Wednesday, May 09, 2007 11:59
To: users at gridengine.sunsource.net
Subject: RE: [GE users] Mvapich processes not killed on qdel

Hmm, I changed the mpirun command to mpirun_rsh -rsh and submitted the
job, it started and failed with a bunch of connections refused. By
default Rocks disables RSH.

Does tight integration only work with rsh? If so, I'll see if I can get
that enabled and try again.

-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Wednesday, May 09, 2007 11:27
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Mvapich processes not killed on qdel

Hi,

can you please post the processtree (master and slave) of a running  
job on a node by using the ps command:

ps -e f -o pid,ppid,pgrp,command

Are you sure, that the SGE rsh-wrapper is used, as you mentioned  
mpirun_ssh?

-- Reuti


Am 09.05.2007 um 17:43 schrieb Mike Hanby:

> Howdy,
>
> I have GE 6.0u8 on a Rocks 4.2.1 cluster with Infiniband and the  
> Topspin roll (which includes mvapich).
>
>
>
> When I qdel an mvapich job, the job immediately is removed from the  
> queue, however most of the processes on the nodes do not get  
> killed. It appears that the mpirun_ssh process does get killed,  
> however all of the actual job executables (sander.MPI) doesn't.
>
>
>
> I followed the directions for tight integration of Mvapich
>
> http://gridengine.sunsource.net/project/gridengine/howto/mvapich/ 
> MVAPICH_Integration.html
>
>
>
> The job runs fine, but again it doesn't kill off processes when  
> qdel'd.
>
>
>
> Here's the pe:
>
> $ qconf -sp mvapich
>
> pe_name           mvapich
>
> slots             9999
>
> user_lists        NONE
>
> xuser_lists       NONE
>
> start_proc_args   /share/apps/gridengine/mvapich/startmpi.sh - 
> catch_rsh \
>
>                   $pe_hostfile
>
> stop_proc_args    /share/apps/gridengine/mvapich/stopmpi.sh
>
> allocation_rule   $round_robin
>
> control_slaves    TRUE
>
> job_is_first_task FALSE
>
> urgency_slots     min
>
>
>
> The only modifications made to the startmpi.sh script was to change  
> the location of the hostname and rsh scripts from $SGE_ROOT to / 
> share/apps/gridengine/mvapich
>
>
>
> Any suggestions on what I should look for?
>
>
>
> Thanks, MIke
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list