[GE users] Child Processes on parallel MPICH jobs

Raymond Chan raychan at ucdavis.edu
Thu Dec 22 17:51:45 GMT 2005


Hi all,

 

I looked through the archives and saw a few messages pertaining to my
problem, but not sure if the symptoms were quite the same, and when I tried
to follow some of the solutions, the problem still persists.

Sorry to everyone who is tired of hearing about this same problem again, but
hope someone can help:

 

I'm running SGE6 & MPIBLAST-1.4.0 on Dual AMD Opteron systems using ROCKS
4.0.0 cluster.  I recently noticed that while running MPIBLAST w/ Sun Grid
Engine, when I delete the job in SGE, my MPIBLAST processes stay on the
compute nodes.  I assume this problem will come back and bite me w/ any app
using MPICH & SGE as well when I try to delete a running parallel MPICH job
from the queue.  I followed the tight integration instructions here:

 

http://gridengine.sunsource.net/howto/mpich-integration.html (by choosing to
set the environment variable MPICH_PROCESS_GROUP=no in my own user .bashrc
file, in the shell script I'm submitting to SGE, and even in the .profile of
the head and compute nodes.  I also added the -V to the qrsh command in the
rsh wrapper).  

 

Upon closer inspection of the stopmpi.sh script that SGE uses for its
parallel jobs w/ MPI/MPICH, all the script seems to do is delete the machine
file SGE creates for MPI.  It does not even mention anything about killing
processes create.  Is there a need to modify the stopmpi.sh script as well
to kill processes, or should what I did above by following the tight
integration article be enough?  I'm asking this because I was also working
w/ parallel jobs in SGE w/ PVM, and the stoppvm.sh script that is included
with SGE does indeed seem to explicitly kill child processes.  I'm probably
missing something here.

 

If anyone has gotten tight integration working with MPIBLAST where when you
kill a job via qdel and all child processes on the compute nodes also get
killed, can you point me in the right direction?

 

Thank you in advance,

Ray

Univ of CA Davis

 




More information about the gridengine-users mailing list