[GE users] Mvapich processes not killed on qdel

Mike Hanby mhanby at uab.edu
Thu May 10 17:44:35 BST 2007


I came up with a hacked solution, not pretty but it works on this
specific cluster. The users who use this cluster always use a number of
processors that fill up nodes (quad core, so they always use -np that's
divisible by 4). Thus the processes under their account on a node should
all belong to the job in question.

I modified the stopmpi.sh script as follows:

# Kill the users processes on the scheduled nodes
CURNODE=$(hostname -s)
for i in $(cat $TMPDIR/machines|uniq); do
   if [[ "$i" == "$CURNODE" ]]; then
      echo "Skipping job master node for now, don't want to kill
ourselves: $i"
   else
      echo "Killing job processes on node: $i"
      ssh $i "skill -u $USER"
   fi
done

rm -f $TMPDIR/machines

rshcmd=rsh
case "$ARC" in
   hp|hp10|hp11|hp11-64) rshcmd=remsh ;;
   *) ;;
esac
rm -f $TMPDIR/$rshcmd

# Kill user processes on jobs master node (this node)
echo "Killing processes on the master node: $CURNODE"
skill -u $USER

exit 0

-----Original Message-----
From: Brian R. Smith [mailto:brs at usf.edu] 
Sent: Thursday, May 10, 2007 11:24
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Mvapich processes not killed on qdel

Never mind!  The patch was pretty trivial and the original code called 
execl where as the patch changed it to call execlp.  It is attached, 
just so I don't keep talking about a phantom patch.  Its for version 
mvapich-0.9.5.117.  I can't for the life of me remember where I found 
this...  It will probably look familiar to someone here.

-Brian



Brian R. Smith wrote:
> Reuti,
>
> I haven't taken the time to really look at the code so I don't know 
> why it works the way it does, but I know it calls execlp. Also, the 
> patch for the beta appeared rather non-trivial (if we consider this to

> be an easy-to-solve problem) so I assume there is more to the problem 
> than just the PATH issue.
>
> -Brian
>
> Reuti wrote:
>> Am 10.05.2007 um 16:40 schrieb Brian R. Smith:
>>
>>> Reuti & Mike,
>>>
>>> I dealt with mvapich and SGE-tight integration (and hence abandoned 
>>> mvapich in favor of OpenMPI, which has wonderful SGE integration 
>>> support and a much cooler plugin-based framework, IMHO). The 
>>> mpirun_rsh command is actually a piece of C code where path values 
>>> for rsh and ssh are hard-coded into the program. Because this code 
>>> seems to blow away at least the PATH variable during execution, the 
>>> exec() call
>>
>> If the problem is just the execl() in the source, one could try 
>> execlp() with a plain rsh as it will honor the set path.
>>
>> -- Reuti
>>
>>> to just "rsh" will fail since no paths will be defined (and hence 
>>> attempts to set PATH in $SGE_ROOT/mpi/startmpi.sh will fail). There 
>>> was a patch floating around for a previous beta release, but it will

>>> not apply to the the current release cleanly. The file in question 
>>> in version 0.9.8 is
>>>
>>> mpid/ch_gen2/process/mpirun_rsh.c
>>>
>>> Beginning on line 130, I believe, you will see
>>>
>>> #define RSH_CMD "/usr/bin/rsh"
>>> #define SSH_CMD "/usr/bin/ssh"
>>>
>>> I looked around for fixes (as I said, you cannot just change these 
>>> to "rsh" or "ssh", it will fail) but as of a couple weeks ago, no 
>>> one seems to have resolved this. I hope this helps.
>>>
>>> -Brian
>>>
>>>
>>>
>>> Reuti wrote:
>>>> Hi,
>>>>
>>>> Am 09.05.2007 um 21:53 schrieb Mike Hanby:
>>>>
>>>>> I created a simple helloworld job that prints a message and then 
>>>>> sleeps
>>>>> for 5 minutes. If I qdel the job after 1 minute, the job is 
>>>>> removed from
>>>>> the queue but remains running on the nodes for 4 more minutes. I'm

>>>>> using
>>>>> rsh in this example I have the ps info below:
>>>>
>>>> but still the processes are not children of the 
>>>> sge_execd/sge_shepherd. So the rsh-wrapper isn't used. Is the path 
>>>> to the rsh binary hardcoded somewhere in your MPI scripts? There is

>>>> /usr/bin/rsh mentioned - can you change it somewhere to read just 
>>>> rsh, so that the rsh-wrapper will be accessed instead of the
binary?
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> I submitted the job using the following job script:
>>>>> #!/bin/bash
>>>>> #$ -S /bin/bash
>>>>> #$ -cwd
>>>>> #$ -N TestMVAPICH
>>>>> #$ -pe mvapich 4
>>>>> #$ -v MPIR_HOME=/usr/local/topspin/mpi/mpich
>>>>> #$ -v MPICH_PROCESS_GROUP=no
>>>>> #$ -V
>>>>> export MPI_HOME=/usr/local/topspin/mpi/mpich
>>>>> export
>>>>>
LD_LIBRARY_PATH=/usr/local/topspin/lib64:$MPI_HOME/lib64:$LD_LIBRARY_PAT

>>>>>
>>>>> H
>>>>> export PATH=$TMPDIR:$MPI_HOME/bin:$PATH
>>>>> MPIRUN=${MPI_HOME}/bin/mpirun_rsh
>>>>> $MPIRUN -rsh -np $NSLOTS -machinefile $TMPDIR/machines 
>>>>> ./hello-mvapich
>>>>>
>>>>> This is the ps output on the node while the job is running in the 
>>>>> queue:
>>>>> $ ssh compute-0-7 "ps -e f -o pid,ppid,pgrp,command|grep 
>>>>> myuser|grep -v
>>>>> grep"
>>>>> 1460 3611 1460 \_ sshd: myuser [priv]
>>>>> 1464 1460 1460 \_ sshd: myuser at notty
>>>>> 951 947 951 | \_ bash -c cd /home/myuser/pmemdTest-mvapich;
>>>>> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
>>>>> MPIRUN_PORT=32826
>>>>>
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
>>>>> MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=942 ./hello-mvapich
>>>>> 954 948 954 | \_ bash -c cd /home/myuser/pmemdTest-mvapich;
>>>>> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
>>>>> MPIRUN_PORT=32826
>>>>>
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
>>>>> MPIRUN_RANK=1 MPIRUN_NPROCS=4 MPIRUN_ID=942 ./hello-mvapich
>>>>> 955 949 955 | \_ bash -c cd /home/myuser/pmemdTest-mvapich;
>>>>> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
>>>>> MPIRUN_PORT=32826
>>>>>
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
>>>>> MPIRUN_RANK=2 MPIRUN_NPROCS=4 MPIRUN_ID=942 ./hello-mvapich
>>>>> 966 950 966 \_ bash -c cd /home/myuser/pmemdTest-mvapich;
>>>>> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
>>>>> MPIRUN_PORT=32826
>>>>>
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
>>>>> MPIRUN_RANK=3 MPIRUN_NPROCS=4 MPIRUN_ID=942 ./hello-mvapich
>>>>> 943 942 938 \_ /usr/bin/rsh compute-0-7 cd
>>>>> /home/myuser/pmemdTest-mvapich; /usr/bin/env MPIRUN_MPD=0
>>>>> MPIRUN_HOST=compute-0-7.local MPIRUN_PORT=32826
>>>>>
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
>>>>> MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=942 ./hello-mvapich
>>>>> 944 942 938 \_ /usr/bin/rsh compute-0-7 cd
>>>>> /home/myuser/pmemdTest-mvapich; /usr/bin/env MPIRUN_MPD=0
>>>>> MPIRUN_HOST=compute-0-7.local MPIRUN_PORT=32826
>>>>>
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
>>>>> MPIRUN_RANK=1 MPIRUN_NPROCS=4 MPIRUN_ID=942 ./hello-mvapich
>>>>> 945 942 938 \_ /usr/bin/rsh compute-0-7 cd
>>>>> /home/myuser/pmemdTest-mvapich; /usr/bin/env MPIRUN_MPD=0
>>>>> MPIRUN_HOST=compute-0-7.local MPIRUN_PORT=32826
>>>>>
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
>>>>> MPIRUN_RANK=2 MPIRUN_NPROCS=4 MPIRUN_ID=942 ./hello-mvapich
>>>>> 946 942 938 \_ /usr/bin/rsh compute-0-7 cd
>>>>> /home/myuser/pmemdTest-mvapich; /usr/bin/env MPIRUN_MPD=0
>>>>> MPIRUN_HOST=compute-0-7.local MPIRUN_PORT=32826
>>>>>
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
>>>>> MPIRUN_RANK=3 MPIRUN_NPROCS=4 MPIRUN_ID=942 ./hello-mvapich
>>>>>
>>>>> And the ps after I qdel the job
>>>>> $ ssh compute-0-7 "ps -e f -o pid,ppid,pgrp,command|grep 
>>>>> myuser|grep -v
>>>>> grep"
>>>>> 1735 3611 1735 \_ sshd: myuser [priv]
>>>>> 1739 1735 1735 \_ sshd: myuser at notty
>>>>> 951 947 951 | \_ bash -c cd /home/myuser/pmemdTest-mvapich;
>>>>> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
>>>>> MPIRUN_PORT=32826
>>>>>
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
>>>>> MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=942 ./hello-mvapich
>>>>> 954 948 954 | \_ bash -c cd /home/myuser/pmemdTest-mvapich;
>>>>> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
>>>>> MPIRUN_PORT=32826
>>>>>
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
>>>>> MPIRUN_RANK=1 MPIRUN_NPROCS=4 MPIRUN_ID=942 ./hello-mvapich
>>>>> 955 949 955 | \_ bash -c cd /home/myuser/pmemdTest-mvapich;
>>>>> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
>>>>> MPIRUN_PORT=32826
>>>>>
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
>>>>> MPIRUN_RANK=2 MPIRUN_NPROCS=4 MPIRUN_ID=942 ./hello-mvapich
>>>>> 966 950 966 \_ bash -c cd /home/myuser/pmemdTest-mvapich;
>>>>> /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=compute-0-7.local
>>>>> MPIRUN_PORT=32826
>>>>>
MPIRUN_PROCESSES='compute-0-7:compute-0-7:compute-0-7:compute-0-7:'
>>>>> MPIRUN_RANK=3 MPIRUN_NPROCS=4 MPIRUN_ID=942 ./hello-mvapich
>>>>>
>>>>> -----Original Message-----
>>>>> From: Mike Hanby [mailto:mhanby at uab.edu]
>>>>> Sent: Wednesday, May 09, 2007 11:59
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: RE: [GE users] Mvapich processes not killed on qdel
>>>>>
>>>>> Hmm, I changed the mpirun command to mpirun_rsh -rsh and submitted

>>>>> the
>>>>> job, it started and failed with a bunch of connections refused. By
>>>>> default Rocks disables RSH.
>>>>>
>>>>> Does tight integration only work with rsh? If so, I'll see if I 
>>>>> can get
>>>>> that enabled and try again.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>>> Sent: Wednesday, May 09, 2007 11:27
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] Mvapich processes not killed on qdel
>>>>>
>>>>> Hi,
>>>>>
>>>>> can you please post the processtree (master and slave) of a
running
>>>>> job on a node by using the ps command:
>>>>>
>>>>> ps -e f -o pid,ppid,pgrp,command
>>>>>
>>>>> Are you sure, that the SGE rsh-wrapper is used, as you mentioned
>>>>> mpirun_ssh?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>> Am 09.05.2007 um 17:43 schrieb Mike Hanby:
>>>>>
>>>>>> Howdy,
>>>>>>
>>>>>> I have GE 6.0u8 on a Rocks 4.2.1 cluster with Infiniband and the
>>>>>> Topspin roll (which includes mvapich).
>>>>>>
>>>>>>
>>>>>>
>>>>>> When I qdel an mvapich job, the job immediately is removed from
the
>>>>>> queue, however most of the processes on the nodes do not get
>>>>>> killed. It appears that the mpirun_ssh process does get killed,
>>>>>> however all of the actual job executables (sander.MPI) doesn't.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I followed the directions for tight integration of Mvapich
>>>>>>
>>>>>> http://gridengine.sunsource.net/project/gridengine/howto/mvapich/
>>>>>> MVAPICH_Integration.html
>>>>>>
>>>>>>
>>>>>>
>>>>>> The job runs fine, but again it doesn't kill off processes when
>>>>>> qdel'd.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Here's the pe:
>>>>>>
>>>>>> $ qconf -sp mvapich
>>>>>>
>>>>>> pe_name mvapich
>>>>>>
>>>>>> slots 9999
>>>>>>
>>>>>> user_lists NONE
>>>>>>
>>>>>> xuser_lists NONE
>>>>>>
>>>>>> start_proc_args /share/apps/gridengine/mvapich/startmpi.sh -
>>>>>> catch_rsh \
>>>>>>
>>>>>> $pe_hostfile
>>>>>>
>>>>>> stop_proc_args /share/apps/gridengine/mvapich/stopmpi.sh
>>>>>>
>>>>>> allocation_rule $round_robin
>>>>>>
>>>>>> control_slaves TRUE
>>>>>>
>>>>>> job_is_first_task FALSE
>>>>>>
>>>>>> urgency_slots min
>>>>>>
>>>>>>
>>>>>>
>>>>>> The only modifications made to the startmpi.sh script was to
change
>>>>>> the location of the hostname and rsh scripts from $SGE_ROOT to /
>>>>>> share/apps/gridengine/mvapich
>>>>>>
>>>>>>
>>>>>>
>>>>>> Any suggestions on what I should look for?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks, MIke
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail:
users-help at gridengine.sunsource.net
>>>>>
>>>>>
---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail:
users-help at gridengine.sunsource.net
>>>>>
>>>>>
---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail:
users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>>
---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail:
users-help at gridengine.sunsource.net
>>>
>>>
>>> -- 
>>> --------------------------------------------------------
>>> + Brian R. Smith +
>>> + HPC Systems Analyst & Programmer +
>>> + Research Computing, University of South Florida +
>>> + 4202 E. Fowler Ave. LIB618 +
>>> + Office Phone: 1 (813) 974-1467 +
>>> + Mobile Phone: 1 (813) 230-3441 +
>>> + Organization URL: http://rc.usf.edu +
>>> --------------------------------------------------------
>>>
>>>
---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


-- 
--------------------------------------------------------
+ Brian R. Smith                                       +
+ HPC Systems Analyst & Programmer                     +
+ Research Computing, University of South Florida      +
+ 4202 E. Fowler Ave. LIB618                           +
+ Office Phone: 1 (813) 974-1467                       +
+ Mobile Phone: 1 (813) 230-3441                       +
+ Organization URL: http://rc.usf.edu                  +
--------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list