[GE users] MPICH process groups

Jeroen M. Kleijer jeroen.m.kleijer at philips.com
Mon Dec 12 17:21:08 GMT 2005


Hi everyone,

I've followed the instructions for tight integration with MPICH to the 
letter but somehow, qrsh won't pass the variable MPICH_PROCESS_GROUPS 
along causing every new command started from the sge_execd slave node to 
have its own process group.
To be exact, I did the following:
- in the submit script I put the line "export MPICH_PROCESS_GROUP=no" 
before the mpirun line.
- I edited the rsh wrapper to read "qrsh -V ..." (even did "qrsh -V -v 
MPICH_PROCESS_GROUP=no" to be on the safe side but to no avail)
- I edited the submit command to read "qsub -v MPICH_PROCESS_GROUP=no 
....", also to no avail.

The job runs fine when submitted but when it gets killed, the remaning 
processes still keep on running on the slave nodes even though they are 
children of sge_execd. (after doing this for a while, starting jobs and 
killing them, I run out of semaphores which is annoying as hell since I 
then have to clean them up manually, anyone have any advice on how to 
handle this?)

As this particular product is an off the shelf product (Msc.Marc) with 
it's own compiled version of mpich I really don't want to try roll my own 
version of mpich and try integrating that one.
Does anybody know what I'm doing wrong here? Why won't the processes on 
the slave nodes die when a qdel command is issued?

Met vriendelijke groeten / Kind regards

Jeroen Kleijer
Unix Systeembeheer
Philips Applied Technologies



More information about the gridengine-users mailing list