[GE users] rsh zombies when using mpich2 -- johnny layne

Reuti reuti at staff.uni-marburg.de
Tue Jul 31 11:46:56 BST 2007


Am 30.07.2007 um 17:03 schrieb Johnny Layne:

>   I'm playing around with mpich2, running some VASP jobs.  I'm  
> noticing that occasionally some rsh processes become zombies,  
> anybody else seeing this?  Right now I suspect it's possibly due to  
> not using a job-specific .smpd file, I'm going to play around & see  
> if creating a specific one for each job seems to help.  So I wonder  
> if launching a bunch of these jobs in quick succession is causing  
> problems when the jobs finish & the .smpd has changed.
>   I've got everything set up following Reuti's tight integration  
> with mpich 2 (http://gridengine.sunsource.net/howto/mpich2- 
> integration/mpich2-integration.html) and in general it works great,  
> I've just noticed this happening a couple times, and couldn't find  
> (so far) any similar postings in the mailing list archive.
>   I could add this guy's solution to my stopmpich2.sh to kill any  
> zombies I suppose:  https://lists.sdsc.edu/pipermail/npaci-rocks- 
> discussion/2004-January/004113.html or do something along those  
> lines anyway in the kill code.
>   It's not a big problem for me as I'll hunt down zombie processes  
> & kill 'em, but I hardly trust our users to do that when we turn  
> this stuff loose to them!  Thanks for any advice & info in  
> advance.  I'll continue playing around with things & post if  
> something seems to work.

when you observe this from time to time, it may be related to:


If you could just check for curiosity, whether a kill of the  
processgroup removes all remainings of the job. So the best would be,  
to fix it in SGE.

-- Reuti

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list