[GE users] Child Processes on parallel MPICH jobs

Reuti reuti at staff.uni-marburg.de
Thu Dec 22 18:34:56 GMT 2005

    [ The following text is in the "WINDOWS-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


Am 22.12.2005 um 18:51 schrieb Raymond Chan:

> Hi all,
> I looked through the archives and saw a few messages pertaining to  
> my problem, but not sure if the symptoms were quite the same, and  
> when I tried to follow some of the solutions, the problem still  
> persists.
> Sorry to everyone who is tired of hearing about this same problem  
> again, but hope someone can help:
> I?m running SGE6 & MPIBLAST-1.4.0 on Dual AMD Opteron systems using  
> ROCKS 4.0.0 cluster.  I recently noticed that while
we have no ROCKS. But AFAIR you have to modify the startmpi.sh in the  
beginning, so that instead of:

          echo $host


          echo $host.local

But this is only for ROCKS, as you might get a wrong distribution of  
processes to the nodes otherwise. The command hostname returns the  
name including the .local I heard.
> running MPIBLAST w/ Sun Grid Engine, when I delete the job in SGE,  
> my MPIBLAST processes stay on the compute nodes.  I assume this  
> problem will come back and bite me w/ any app using MPICH & SGE as  
> well when I try to delete a running parallel MPICH job from the  
> queue.  I followed the tight integration instructions here:
> http://gridengine.sunsource.net/howto/mpich-integration.html (by  
> choosing to set the environment variableMPICH_PROCESS_GROUP=no in  
> my own user .bashrc file, in the shell script I?m submitting to  
> SGE, and even in the .profile of the head and compute nodes.  I  
> also added the ?V to the qrsh command in the rsh wrapper).
It shouldn't be necessary to add it to .bashrc or .profile. Can you  
echo the $PATH in your script before the mpirun, to check whether the  
rsh-wrapper is used at all - the first dir should be the $TMPDIR?  
Also the output of the mentioned

ps -e f -o pid,ppid,pgrp,command

in the Howto would be useful to post here (the relevant lines on the  
head node of the parallel job and on one slave). There you can check,  
whether the qrsh was taken and all processes are children of the  
shepherd on all the nodes.
> Upon closer inspection of the stopmpi.sh script that SGE uses for  
> its parallel jobs w/ MPI/MPICH, all the script seems to do is  
> delete the machine file SGE creates for MPI.  It does not even  
> mention anything about killing processes create.  Is there a need to
Removing the machine file in $TMPDIR is in some way cosmetically, as  
the whole $TMPDIR will be removed by the end of the job anyway.
> modify the stopmpi.sh script as well to kill processes, or should  
> what I did above by following the tight integration article be  
> enough?  I?m asking this because I was also working w/ parallel  
> jobs in SGE w/ PVM, and the stoppvm.sh script that is included with  
> SGE does indeed seem to explicitly kill child processes.  I?m  
> probably missing something here?
With a Loose Integration of PVM the kill isn't used, as all processes  
are shut down by a pvm_halt(). Killing the processes is only an  
option, which isn't used unless you set variable use_kill=true. With  
a Tight Integration, the daemons are under full control of SGE and  
shut down in case of a qdel by killing the whole process group which  
includes the pvmd. The usual pvm_halt() is used in case of a proper  
shudown still.

Cheers - Reuti
> If anyone has gotten tight integration working with MPIBLAST where  
> when you kill a job via qdel and all child processes on the compute  
> nodes also get killed, can you point me in the right direction?
> Thank you in advance,
> Ray
> Univ of CA Davis

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list