[GE users] rsh zombies when using mpich2 -- johnny layne

Reuti reuti at staff.uni-marburg.de
Wed Aug 1 10:50:02 BST 2007


Hi,

Am 31.07.2007 um 21:24 schrieb Johnny Layne:

>     Well the zombies I've seen that concern me happen so rarely  
> (twice in dozens of mpich2 + VASP runs now) I'm having trouble  
> replicating it at the moment.  I'm trying for daemonless smpd by  
> the way.  I'll do my best to find out all I can and pass it on.  I  
> do remember that it was stray rsh processes but that's not much to  
> go on.  Yes it would be excellent to have a simple process group  
> kill tidy things up nicely!
>> when you observe this from time to time, it may be related to:
>>
>> http://gridengine.sunsource.net/servlets/ReadMsg? 
>> list=users&msgNo=20877
>>
>> If you could just check for curiosity, whether a kill of the  
>> processgroup removes all remainings of the job. So the best would  
>> be, to fix it in SGE.
>>
>> -- Reuti 
>     Hmm, I'll keep you posted on whatever I find.  Perhaps this  
> will be helpful.  I launched a job just now, mpich2 and VASP on  
> CentOS, to test some things in stopmpich2.sh.  Well my test blew up  
> on me when I deleted the job & left orphans on the slave nodes  
> which I'm now killing "by hand".  Anyway, getting the process group  
> IDs:
> ps -eo pid,user,args,pgrp | head -200 | grep jglayne
> 19645 jglayne  /bin/sh /home/jg 19645
> 19646 jglayne  /bin/sh /home/jg 19646
> 19703 jglayne  /usr/global/vasp 19645
> 19702 jglayne  /usr/global/vasp 19646
> 20022 root     sshd: jglayne [p 20022
> 20024 jglayne  sshd: jglayne at pt 20022
> 20025 jglayne  -bash            20025
> 20057 jglayne  ps -eo pid,user, 20057
> 20058 jglayne  head -200        20057
> 20059 jglayne  grep jglayne     20057
>
> then I killall the ones I'm interested in:
> killall -g vasp
>
> worked fine for me; perhaps this could be added in somewhere to  
> automate killing "strays".  But then if I had another job say  
> running VASP on that node using killall like this of course  
> wouldn't be good.  So on another node, I used the following:
> ps -eo pid,user,args,pgrp | head -200 | grep jglayne
> 31095 jglayne  /bin/sh /home/jg 31095
> 31115 jglayne  /usr/global/vasp 31095
> 31130 root     sshd: jglayne [p 31130
> 31132 jglayne  sshd: jglayne at pt 31130
> 31134 jglayne  -bash            31134
> 31184 jglayne  ps -eo pid,user, 31184
> 31185 jglayne  head -200        31184
> 31186 jglayne  grep jglayne     31184
> kill -- -31095
> to kill the process in the 31095 group (if that's the terminology);

to see the dependencies you can use:

ps f -eo pid,ppid,pgrp,command --cols=120

So the processgroup would be the third column, and you could check  
this way, whether all are the same for the started application.

> pid 31095 == a little script to actually run VASP in the right  
> place.  Anyway, that seemed to work like a charm as well and  
> perhaps could be used for much better precision than scripting  
> something around killall.  If I see the rsh strays again I'll try  
> this on the head node as well, whatever I can to play around with it.
>
>     Well, off to read some of these other entries.  Thanks for all  
> the info!
>     johnny
>
> PS:  Here's a script generated by my Perl code which creates this  
> script & submits it to the SGE, by the way, in case my command line  
> might be helpful:
>
> #!/bin/bash
> ###################################################################### 
> ##############
> #  execute script in current directory
> #$ -cwd
> #  want any .e/.o stuff to show up here too
> #$ -e ./
> #$ -o ./

Maybe using -cwd to qsub or in the script "#$ -cwd" might make the:

#$ -e ./
#$ -o ./

and

export PWD=$PWD
cd $PWD

superfluous?

> #  shell for qdef to use:
> #$ -S /bin/bash
> #  name for the job; used by qstat
> #$ -N junk2
> #$ -pe mpich2_smpd 8
> ###################################################################### 
> ##############
> echo  
> "------------------------------------------------------------------"
> echo "script.Tue_Jul_31_15_00_56:"
> echo "   My user name is `whoami`..."
> echo "   I'm on `hostname`.............."
> echo "   Beginning @ `date`..."
> port=$((JOB_ID % 5000 + 20000))
> export PWD=$PWD
>
> #  create a little helper script for this job
> echo "#!/bin/sh" > script.$port
> echo "cd $PWD" >>  script.$port
> echo "/usr/global/vasp-p/mpichvasptest/vasp.4.6/vasp" >> script.$port
> echo "script == script.$port"
> chmod +x script.$port
> mv script.$port ~/
>
> /usr/global/mpich2smpd-intel/bin/mpiexec\
>  -rsh\
>  -nopm\
>  -machinefile $TMPDIR/machines\
>  -smpdfile ~/.smpd.$JOB_ID\
>  -port $port\
>  -np $NSLOTS\
>  ~/script.$port
> rm -f ~/script.$port
>
> echo "$script:  Ending @ `date`..."
> echo  
> "------------------------------------------------------------------"
> ###################################################################### 
> ##############

Why are you using a wrapper scxript for the program? I mean, instead  
of generating and executing the wrapper script, the line:

~/script.$port

could just be replaced with:

/usr/global/vasp-p/mpichvasptest/vasp.4.6/vasp

Did you encounter any error with it? - Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list