[GE users] rsh zombies when using mpich2 -- johnny layne

Johnny Layne laynejg at vcu.edu
Tue Jul 31 20:24:54 BST 2007


hi Reuti,
Sorry it took so long to reply, we've had a power outage & other 
distractions.

Well the zombies I've seen that concern me happen so rarely (twice in 
dozens of mpich2 + VASP runs now) I'm having trouble replicating it at 
the moment. I'm trying for daemonless smpd by the way. I'll do my best 
to find out all I can and pass it on. I do remember that it was stray 
rsh processes but that's not much to go on. Yes it would be excellent to 
have a simple process group kill tidy things up nicely!
> when you observe this from time to time, it may be related to:
>
> http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=20877
>
> If you could just check for curiosity, whether a kill of the 
> processgroup removes all remainings of the job. So the best would be, 
> to fix it in SGE.
>
> -- Reuti
Hmm, I'll keep you posted on whatever I find. Perhaps this will be 
helpful. I launched a job just now, mpich2 and VASP on CentOS, to test 
some things in stopmpich2.sh. Well my test blew up on me when I deleted 
the job & left orphans on the slave nodes which I'm now killing "by 
hand". Anyway, getting the process group IDs:
ps -eo pid,user,args,pgrp | head -200 | grep jglayne
19645 jglayne /bin/sh /home/jg 19645
19646 jglayne /bin/sh /home/jg 19646
19703 jglayne /usr/global/vasp 19645
19702 jglayne /usr/global/vasp 19646
20022 root sshd: jglayne [p 20022
20024 jglayne sshd: jglayne at pt 20022
20025 jglayne -bash 20025
20057 jglayne ps -eo pid,user, 20057
20058 jglayne head -200 20057
20059 jglayne grep jglayne 20057

then I killall the ones I'm interested in:
killall -g vasp

worked fine for me; perhaps this could be added in somewhere to automate 
killing "strays". But then if I had another job say running VASP on that 
node using killall like this of course wouldn't be good. So on another 
node, I used the following:
ps -eo pid,user,args,pgrp | head -200 | grep jglayne
31095 jglayne /bin/sh /home/jg 31095
31115 jglayne /usr/global/vasp 31095
31130 root sshd: jglayne [p 31130
31132 jglayne sshd: jglayne at pt 31130
31134 jglayne -bash 31134
31184 jglayne ps -eo pid,user, 31184
31185 jglayne head -200 31184
31186 jglayne grep jglayne 31184
kill -- -31095
to kill the process in the 31095 group (if that's the terminology); pid 
31095 == a little script to actually run VASP in the right place. 
Anyway, that seemed to work like a charm as well and perhaps could be 
used for much better precision than scripting something around killall. 
If I see the rsh strays again I'll try this on the head node as well, 
whatever I can to play around with it.

Well, off to read some of these other entries. Thanks for all the info!
johnny

PS: Here's a script generated by my Perl code which creates this script 
& submits it to the SGE, by the way, in case my command line might be 
helpful:

#!/bin/bash
####################################################################################
# execute script in current directory
#$ -cwd
# want any .e/.o stuff to show up here too
#$ -e ./
#$ -o ./
# shell for qdef to use:
#$ -S /bin/bash
# name for the job; used by qstat
#$ -N junk2
#$ -pe mpich2_smpd 8
####################################################################################
echo "------------------------------------------------------------------"
echo "script.Tue_Jul_31_15_00_56:"
echo " My user name is `whoami`..."
echo " I'm on `hostname`.............."
echo " Beginning @ `date`..."
port=$((JOB_ID % 5000 + 20000))
export PWD=$PWD

# create a little helper script for this job
echo "#!/bin/sh" > script.$port
echo "cd $PWD" >> script.$port
echo "/usr/global/vasp-p/mpichvasptest/vasp.4.6/vasp" >> script.$port
echo "script == script.$port"
chmod +x script.$port
mv script.$port ~/

/usr/global/mpich2smpd-intel/bin/mpiexec\
-rsh\
-nopm\
-machinefile $TMPDIR/machines\
-smpdfile ~/.smpd.$JOB_ID\
-port $port\
-np $NSLOTS\
~/script.$port
rm -f ~/script.$port

echo "$script: Ending @ `date`..."
echo "------------------------------------------------------------------"
####################################################################################



More information about the gridengine-users mailing list