[GE users] SGE+mvapich2 tight integration

soliday soliday at aps.anl.gov
Wed Jul 28 16:33:08 BST 2010

I use SGE to submit mvapich2 jobs to our cluster. What I would like is 
to tightly integrate it so that when I use the qdel command it will find 
and delete all the processes. Currently I have it setup so that SGE 
creates a hostfile and then calls mpirun_rsh

/act/mvapich2-1.5/gnu/bin/mpirun_rsh -rsh -hostfile \\\$TMPDIR/machines 
-np $mvapich2 MV2_ENABLE_AFFINITY=0 MV2_ON_DEMAND_THRESHOLD=5000 $command

I really like the mpirun_rsh command because I don't have to have an mpd 
ring already running. We used to do this but a single node going down 
would always screw up the mpd ring.

I have built a special version of qdel that will identify all the 
threads on all the nodes prior to doing a basic qdel. It will then do a 
manual kill on all the left over PIDs. This works but I would prefer a 
tight integration. I've been reading up on it and it looked to me like 
the essential part is to use "qrsh -inherit -V" in place of rsh. So I 
tried editing src/pm/mpirun/mpirun_rsh.c and 
src/pm/mpirun/include/mpirun_rsh.h so that it would use qrsh instead of 
rsh. Unfortunately when I go to launch a program now I get:

(gnome-ssh-askpass:20089): Gtk-WARNING **: cannot open display:
Host key verification failed.
Error in init phase...wait for cleanup! (1/2 mpispawn connections)
Failed in initilization phase, cleaned up all the mpispawn!

So my question is: is it possible to get SGE+mvapich2 tight integration 
working with the mpirun_rsh launch method?

--Bob Soliday


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list