[GE users] MPICH process groups

Reuti reuti at staff.uni-marburg.de
Mon Dec 12 17:45:54 GMT 2005


Hi,

Am 12.12.2005 um 18:21 schrieb Jeroen M. Kleijer:

>
> Hi everyone,
>
> I've followed the instructions for tight integration with MPICH to  
> the letter but somehow, qrsh won't pass the variable  
> MPICH_PROCESS_GROUPS along causing every new command started from  
> the sge_execd slave node to have its own process group.
> To be exact, I did the following:
> - in the submit script I put the line "export  
> MPICH_PROCESS_GROUP=no" before the mpirun line.
> - I edited the rsh wrapper to read "qrsh -V ..." (even did "qrsh -V  
> -v MPICH_PROCESS_GROUP=no" to be on the safe side but to no avail)
> - I edited the submit command to read "qsub -v  
> MPICH_PROCESS_GROUP=no ....", also to no avail.
>
> The job runs fine when submitted but when it gets killed, the  
> remaning processes still keep on running on the slave nodes even  
> though they are children of sge_execd. (after doing this for a  
> while, starting jobs and killing them, I run out of semaphores  
> which is annoying as hell since I then have to clean them up  
> manually, anyone have any advice on how to handle this?)
>
> As this particular product is an off the shelf product (Msc.Marc)  
> with it's own compiled version of mpich I really don't want to try  
> roll my own version of mpich and try integrating that one.
> Does anybody know what I'm doing wrong here? Why won't the  
> processes on the slave nodes die when a qdel command is issued?

when there are semaphores left, this means that usually MPICH was  
compiled to use shared memory on a node. It would also need a  
machinefile of the type:

node01:2
node02:2

instead of

node01
node01
node02
node02

which will allocate the shared memory, but won't use it (at least  
it's allocated in double on each node). Can you please post a  
processtree like mentioned in the Howto of the master and a slave  
node? I found something about run_marc. Is this still used, or is it  
now a plain mpirun?

-- Reuti


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list