[GE users] SGE+mvapich2 tight integration

soliday soliday at aps.anl.gov
Wed Jul 28 19:27:52 BST 2010


I think I solved the problem.

Sorry, I should have pointed out that the mpirun_rsh line I listed came from a Tcl script we use for submitting SGE jobs. Our Tcl script works with various PE MPI versions. 

usage: csub [-priority <number>] [-mvapich2 <jobs>] [-lam <jobs>] [-name <string>] [-hostList <listOfNames>]  [-noEmail 1] <command>

$mvapich2 is just the number of slots the user specified on the command line.

As for mpirun_rsh, this is a special program not available in mpich2. I don't know how it works exactly. I do have a process ID tree from a two different nodes running the same mvapich2 job.

Node weed19:
 4363     1  4363 /act/sge/bin/lx24-amd64/sge_execd
10473  4363 10473  \_ sge_shepherd-527636 -bg
10508 10473 10508      \_ /bin/sh /act/sge/default/spool/weed19/job_scripts/527636
10510 10508 10508          \_ /act/mvapich2/gnu/bin/mpirun_rsh -rsh -hostfile /tmp/527636.1.all.q/machines -np 5 MV2_ENA
10511 10510 10508              \_ /usr/bin/rsh weed19 cd /home/soliday/oag/apps/src/elegant/examples/Pelegant_ringTracki
10520 10511 10508              |   \_ [rsh] <defunct>
10512 10510 10508              \_ /usr/bin/rsh weed14 cd /home/soliday/oag/apps/src/elegant/examples/Pelegant_ringTracki
10518 10512 10508              |   \_ [rsh] <defunct>
10513 10510 10508              \_ /usr/bin/rsh weed10 cd /home/soliday/oag/apps/src/elegant/examples/Pelegant_ringTracki
10516 10513 10508              |   \_ [rsh] <defunct>
10514 10510 10508              \_ /usr/bin/rsh weed30 cd /home/soliday/oag/apps/src/elegant/examples/Pelegant_ringTracki
10517 10514 10508                  \_ [rsh] <defunct>

 3670     1  3670 xinetd -stayalive -pidfile /var/run/xinetd.pid
10515  3670 10515  \_ in.rshd
10519 10515 10519  |   \_ bash -c cd /home/soliday/oag/apps/src/elegant/examples/Pelegant_ringTracking1; /usr/bin/env LD
10545 10519 10519  |       \_ /act/mvapich2/gnu/bin/mpispawn 0
10546 10545 10519  |           \_ /home/borland/beta/Pelegant manyParticles_p.ele


Node weed14:
 3675     1  3675 xinetd -stayalive -pidfile /var/run/xinetd.pid
14831  3675 14831  \_ in.rshd
14832 14831 14832  |   \_ bash -c cd /home/soliday/oag/apps/src/elegant/examples/Pelegant_ringTracking1; /usr/bin/env LD
14857 14832 14832  |       \_ /act/mvapich2/gnu/bin/mpispawn 0
14858 14857 14832  |           \_ /home/borland/beta/Pelegant manyParticles_p.ele


So this is where I saw it was using /usr/bin/rsh. 

My PE start proc args is set to startmpi.sh -catch_rsh $pe_hostfile
So I did expect it to use the rsh wrapper that points to qrsh. So I looked in the mvapich2 code and saw that /usr/bin/rsh is explicitly defined as the rsh version. I changed this to /act/sge/mvapich2-1.5/rsh which is the qrsh wrapper script.

Now I was still getting:

(gnome-ssh-askpass:13995): Gtk-WARNING **: cannot open display:  
Host key verification failed.
Error in init phase...wait for cleanup! (1/2 mpispawn connections)

I have password less ssh setup but I noticed that if I tried it would query about the name not being in the known_hosts file. So I put all of them in the known_hosts file and now I can get the job to run and complete. I can even do a qdel and all the processes on all the nodes get deleted. Now I just have to read though your link about about how to avoid the known_hosts file issue.

As for your question about rsh_command and rsh_daemon.
$ qconf -sconf | fgrep rsh
rsh_command                  /usr/bin/ssh
rsh_daemon                   /usr/sbin/sshd -i

Thanks,
--Bob Soliday

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=270840

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list