[GE users] Questions on SSH tight integration + Rocks OS 4.3

Reuti reuti at staff.uni-marburg.de
Fri Nov 30 22:37:51 GMT 2007


Am 30.11.2007 um 22:59 schrieb VS Ang:

> More investigation..
>
> I submitted a simple job to list what's in the /tmp directory where  
> the job is running:
>
> $echo "ls -lR /tmp" | qsub
>
> $cat STDIN.o75
> /tmp:
> total 12
> drwxr-xr-x  2 srihari srihari 4096 Nov 30 16:55 75.1.all.q
> -r--r--r--  1 root    root      32 Oct  2 12:07 modprobe.conf.rocks
> drwxr-xr-x  2 root    root    4096 Oct  2 12:07 RCS
>
> /tmp/75.1.all.q:
> total 0  <---- Nothing (no PID file).

qrsh -inherit should do it, but it's not called for now.

-- Reuti


> Who is supposed to create this PID file there?
>
>
> ----- Original Message ----
> From: VS Ang <vs_ang at yahoo.com>
> To: users at gridengine.sunsource.net
> Sent: Friday, November 30, 2007 4:39:39 PM
> Subject: Re: [GE users] Questions on SSH tight integration + Rocks  
> OS 4.3
>
> Ok, I tried another trick. I modified the mpirun script to force  
> "RSHCOMMAND" to "rsh" because it didn't seem like it was reading  
> the environment variable. After this change, my job started by  
> finished very soon. Now, the job output has a bunch of error messages:
>
> /opt/gridengine/bin/lx26-amd64/qrsh -V -inherit -nostdin  
> compute-1-1 cd /home2/srihari && exec env MXMPI_MASTER=compute-1-5.l
> ocal MXMPI_PORT=36521 MX_DISABLE_SHMEM=0 MXMPI_SIGCATCH=1  
> LD_LIBRARY_PATH=/opt/intel/mkl/9.1.021/lib/em64t:/opt/gridengine/li
> b/lx26-amd64:/opt/gridengine/lib/lx26-amd64 MXMPI_MAGIC=8327365  
> MXMPI_ID=6 MXMPI_NP=12 MXMPI_BOARD=-1 MXMPI_SLAVE=<ip_addr> /home2/ 
> srihari/IMB-MPI1-MPICH-MX
> Pseudo-terminal will not be allocated because stdin is not a terminal.
> Pseudo-terminal will not be allocated because stdin is not a terminal.
> Pseudo-terminal will not be allocated because stdin is not a terminal.
> Pseudo-terminal will not be allocated because stdin is not a terminal.
> Pseudo-terminal will not be allocated because stdin is not a terminal.
> ssh_askpass: exec(/usr/local/openssh-sge/libexec/ssh-askpass): No  
> such file or directory
> Host key verification failed.
> can't open file /tmp/73.1.all.q/pid.1.compute-1-5: No such file or  
> directory
> ssh_askpass: exec(/usr/local/openssh-sge/libexec/ssh-askpass): No  
> such file or directory
> Host key verification failed.
> can't open file /tmp/73.1.all.q/pid.2.compute-1-5: No such file or  
> directory
> ssh_askpass: exec(/usr/local/openssh-sge/libexec/ssh-askpass): No  
> such file or directory
> Host key verification failed.
> ssh_askpass: exec(/usr/local/openssh-sge/libexec/ssh-askpass): No  
> such file or directory
> ssh_askpass: exec(/usr/local/openssh-sge/libexec/ssh-askpass): No  
> such file or directory
> Host key verification failed.
> Host key verification failed.
> can't open file /tmp/73.1.all.q/pid.1.compute-1-1: No such file or  
> directory
> can't open file /tmp/73.1.all.q/pid.2.compute-1-1: No such file or  
> directory
> can't open file /tmp/73.1.all.q/pid.3.compute-1-1: No such file or  
> directory
> Pseudo-terminal will not be allocated because stdin is not a terminal.
> ssh_askpass: exec(/usr/local/openssh-sge/libexec/ssh-askpass): No  
> such file or directory
> Host key verification failed.
>
> ----- Original Message ----
> From: VS Ang <vs_ang at yahoo.com>
> To: users at gridengine.sunsource.net
> Sent: Friday, November 30, 2007 4:20:22 PM
> Subject: Re: [GE users] Questions on SSH tight integration + Rocks  
> OS 4.3
>
> Reuti,
>
> Thanks for the suggestions. Here is the update (still not much luck).
>
> My PE configuration looks like this:
>
> pe_name           mpich
> slots             9999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /opt/gridengine/mpi/startmpi.sh -catch_rsh  
> $pe_hostfile
> stop_proc_args    /opt/gridengine/mpi/stopmpi.sh
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task TRUE
> urgency_slots     min
>
> Also, I added the environment variables in the job script:
>
> #!/bin/bash
> #
> #$ -cwd
> #$ -j y
> #$ -S /bin/bash
> #$ -pe mpich 12
> #
> export MPI_HOME=/home/util/mpich-mx-1.2.7..1/icc/9.1.051
> export P4_GLOBMEMSIZE=134217728
> export P4_RSHCOMMAND=rsh
> export LD_LIBRARY_PATH=/opt/intel/mkl/9.1.021/lib/em64t: 
> $LD_LIBRARY_PATH
> export MPICH_PROCESS_GROUP=no
>
> echo "Hostname is: " $HOSTNAME
> echo "ID is:"
> /usr/bin/id -a
> $MPI_HOME/bin/mpirun -np $NSLOTS -machinefile $TMP/machines IMB- 
> MPI1-MPICH-MX
>
> Now the process tree on the first node looks as follows (which is  
> different from before..):
>
> 22638 ?        S      0:07 /opt/gridengine/bin/lx26-amd64/sge_execd
> 14759 ?        S      0:00  \_ sge_shepherd-72 -bg
> 14803 ?        Ss     0:00      \_ /bin/bash /opt/gridengine/ 
> default/spool/compute-1-1/job_scripts/72
> 14805 ?        S      0:00          \_ perl -S -w /home/util/mpich- 
> mx-1.2.7..1/icc/9.1.051/bin/mpirun.ch_mx.pl -np 12 -ma
> 14834 ?        S      0:00              \_ perl -S -w /home/util/ 
> mpich-mx-1.2.7..1/icc/9.1.051/bin/mpirun.ch_mx.pl -np 12
> 14835 ?        S      0:00              \_ ssh compute-1-1 cd / 
> home2/srihari && exec env  MXMPI_MASTER=compute-1-1.local MXMP
> 14836 ?        S      0:00              \_ ssh compute-1-1 -n cd / 
> home2/srihari && exec env  MXMPI_MASTER=compute-1-1.local M
> 14837 ?        S      0:00              \_ ssh compute-1-1 -n cd / 
> home2/srihari && exec env  MXMPI_MASTER=compute-1-1.local M
> 14838 ?        S      0:00              \_ ssh compute-1-1 -n cd / 
> home2/srihari && exec env  MXMPI_MASTER=compute-1-1.local M
> 14839 ?        S      0:00              \_ ssh compute-1-4 -n cd / 
> home2/srihari && exec env  MXMPI_MASTER=compute-1-1.local M
> 14840 ?        S      0:00              \_ ssh compute-1-4 -n cd / 
> home2/srihari && exec env  MXMPI_MASTER=compute-1-1.local M
> 14842 ?        S      0:00              \_ ssh compute-1-4 -n cd / 
> home2/srihari && exec env  MXMPI_MASTER=compute-1-1.local M
> 14843 ?        S      0:00              \_ ssh compute-1-4 -n cd / 
> home2/srihari && exec env  MXMPI_MASTER=compute-1-1.local M
> 14845 ?        S      0:00              \_ ssh compute-1-5 -n cd / 
> home2/srihari && exec env  MXMPI_MASTER=compute-1-1.local M
> 14846 ?        S      0:00              \_ ssh compute-1-5 -n cd / 
> home2/srihari && exec env  MXMPI_MASTER=compute-1-1.local M
> 14849 ?        S      0:00              \_ ssh compute-1-5 -n cd / 
> home2/srihari && exec env  MXMPI_MASTER=compute-1-1.local M
> 14850 ?        S      0:00              \_ ssh compute-1-5 -n cd / 
> home2/srihari && exec env  MXMPI_MASTER=compute-1-1.local M
>
> However, the tree still looks like this on the rest of the nodes:
>
> root      3525  0.0  0.0 21928 1268 ?        Ss   Nov15   0:02 /usr/ 
> sbin/sshd
> root     16384  0.0  0.0 37092 2540 ?        Ss   16:09   0:00  \_  
> sshd: srihari [priv]
> srihari  16392  0.0  0.0 37224 1792 ?        S    16:09   0:00  |    
> \_ sshd: srihari at notty
> srihari  16399 96.0  0.1 24828 10828 ?       Rsl  16:10   0:29   
> |       \_ /home2/srihari/IMB-MPI1-MPICH-MX
> root     16386  0.0  0.0 37092 2540 ?        Ss   16:09   0:00  \_  
> sshd: srihari [priv]
> srihari  16393  0.0  0.0 37224 1808 ?        S    16:09   0:00  |    
> \_ sshd: srihari at notty
> srihari  16401 92.1  0.1 25228 11248 ?       Rsl  16:10   0:28   
> |       \_ /home2/srihari/IMB-MPI1-MPICH-MX
> root     16388  0.0  0.0 37092 2540 ?        Ss   16:09   0:00  \_  
> sshd: srihari [priv]
> srihari  16394  0.0  0.0 37224 1808 ?        S    16:09   0:00  |    
> \_ sshd: srihari at notty
> srihari  16414 65.6  0.1 24716 10720 ?       Rsl  16:10   0:20   
> |       \_ /home2/srihari/IMB-MPI1-MPICH-MX
> root     16389  0.0  0.0 37092 2540 ?        Ss   16:09   0:00  \_  
> sshd: srihari [priv]
> srihari  16395  0.0  0.0 37224 1808 ?        S    16:09   0:00  |    
> \_ sshd: srihari at notty
> srihari  16413 79.4  0.1 25228 11240 ?       Rsl  16:10   0:24   
> |       \_ /home2/srihari/IMB-MPI1-MPICH-MX
> root     17231  0.0  0.0 37092 2592 ?        Ss   16:10   0:00  \_  
> sshd: srihari [priv]
> srihari  17233  0.0  0.0 37092 1760 ?        S    16:10   0:00       
> \_ sshd: srihari at pts/0
> srihari  17234  0.0  0.0 55028 1568 pts/0    Ss   16:10    
> 0:00          \_ -bash
> srihari  17449  0.0  0.0  5440  816 pts/0    R+   16:10    
> 0:00              \_ ps auxf
>
> When I did qdel, it just killed one IMB-MPI1-MPICH-MX process on  
> the first node, leaving the rest of them around.
>
> ----- Original Message ----
> From: Reuti <reuti at staff.uni-marburg.de>
> To: users at gridengine.sunsource.net
> Sent: Friday, November 30, 2007 8:48:11 AM
> Subject: Re: [GE users] Questions on SSH tight integration + Rocks  
> OS 4.3
>
> Hi,
>
> Am 30.11.2007 um 04:38 schrieb VS Ang:
>
> > Ron, thank you for your original response. I have been
> > experimenting with this the last few days. However, I haven't had
> > much success.
> >
> > First, I tried building the code with the tight-integration
> > procedure that was outlined in the presentation you sent and also
> > the Tokyo institute paper. I created the patched OpenSSH binaries,
> > and also specified these for the qlogin, etc. commands in the
> > global configuration. (I installed the patched OpenSSH in /usr/
> > local/openssh-sge).
> >
> > qlogin_command              /opt/gridengine/bin/rocks-qlogin.sh
> > qlogin_daemon                /usr/local/openssh-sge/sbin/sshd -i
> > rlogin_daemon                /usr/local/openssh-sge/sbin/sshd -i
> > qrsh_command                /usr/local/openssh-sge/bin/ssh
> > rsh_command                  /usr/local/openssh-sge/bin/ssh -t -X
> > rlogin_command              /usr/local/openssh-sge/bin/ssh
> > rsh_daemon                  /usr/local/openssh-sge/sbin/sshd -i
> > qrsh_daemon                  /usr/local/openssh-sge/sbin/sshd
>
> qrsh_command
> qrsh_daemon
>
> shouldn't be necessary to set.
>
> > In addition, I also specified the "enable_addgrp_kill" flag with
> > the gid_range parameter on the qmaster:
> >
> > enable_addgrp_kill          true
> > gid_range                    20000-21000
> >
> > Now, when I launch MPICH-MX jobs on the cluster, I observed the
> > process tree. On the first node where the mpirun command was  
> started:
> >
> > root    22638  0.2  0.0 58448 1968 ?        S    22:12  0:01/opt/
> > gridengine/bin/lx26-amd64/sge_execd
> > root    26102  0.0  0.0  8508  944 ?        S    22:22  0:00  \_
> > sge_shepherd-51 -bg
> > srihari  26142  0.0  0.0 53836 1144 ?        Ss  22:22  0:00
> > \_ /bin/bash /opt/gridengine/default/spool/compute-1-1/jo
> > srihari  26144  0.0  0.0 69404 4572 ?        S    22:22
> > 0:00          \_ perl -S -w /home/ibm/util/mpich-mx-1.2.7..1/icc/9.
> > srihari  26146  0.0  0.0 69404 3800 ?        S    22:22
> > 0:00              \_ perl -S -w /home/ibm/util/mpich-mx-1.2.7..1/ic
>
> Seems the mpich job is just calling the conventional ssh, not the one
> you supplied. For this the rsh-wrapper must be used, which is
> symbolical linked to in start_proc_args (of the PE definition) in the
> $TMPDIR of the job.
>
> You can try to set:
>
> export P4_RSHCOMMAND=rsh
> export MPICH_PROCESS_GROUP=no
>
> in your jobscript. Hence the MPICH will be tweaked to use rsh instead
> of the compiled-in ssh, this will call the rsh-wrapper, which in turn
> will issue a "qrsh -inherit", which will then use the defined command
> for qrsh_command, i.e. the new ssh.
>
> http://gridengine.sunsource.net/howto/mpich-integration.html (BTW:
> for actual Myrinet versions you don't need to the change the Perl
> scipt any longer; but other hints might still be useful).
>
> -- Reuti
>
>
> > On the other nodes, the process tree looks like this:
> >
> > root      3525  0.0  0.0 21928 1268 ?        Ss  Nov15  0:02 /usr/
> > sbin/sshd
> > root    31074  0.0  0.0 37092 2540 ?        Ss  22:30  0:00  \_
> > sshd: srihari [priv]
> > srihari  31082  0.0  0.0 37224 1812 ?        S    22:30  0:00  |
> > \_ sshd: srihari at notty
> > srihari  31310  100  0.1 26208 12216 ?      Rsl  22:30  0:29
> > |      \_ /home2/srihari/IMB_3.0/src/IMB-MPI1
> > root    31076  0.0  0.0 37092 2540 ?        Ss  22:30  0:00  \_
> > sshd: srihari [priv]
> > srihari  31083  0.0  0.0 37224 1812 ?        S    22:30  0:00  |
> > \_ sshd: srihari at notty
> > srihari  31318 99.8  0.1 26208 12216 ?      Rsl  22:30  0:28
> > |      \_ /home2/srihari/IMB_3.0/src/IMB-MPI1
> > root    31077  0.0  0.0 37092 2540 ?        Ss  22:30  0:00  \_
> > sshd: srihari [priv]
> > srihari  31084  0.0  0.0 37224 1812 ?        S    22:30  0:00  |
> > \_ sshd: srihari at notty
> > srihari  31328 99.3  0.1 27232 13252 ?      Rsl  22:30  0:28
> > |      \_ /home2/srihari/IMB_3.0/src/IMB-MPI1
> > root    31079  0.0  0.0 37092 2540 ?        Ss  22:30  0:00  \_
> > sshd: srihari [priv]
> > srihari  31085  0.0  0.0 37224 1812 ?        S    22:30  0:00  |
> > \_ sshd: srihari at notty
> > srihari  31331 99.3  0.1 27232 13252 ?      Rsl  22:30  0:28
> > |      \_ /home2/srihari/IMB_3.0/src/IMB-MPI1
> >
> > Now, when I do "qdel" on this job, it doesn't kill all the MPI
> > processes in the tree as I was hoping for. So, I must be still
> > missing something here..
> >
> > Srihari
> >
> > ----- Original Message ----
> > From: Ron Chen <ron_chen_123 at yahoo.com>
> > To: users at gridengine.sunsource.net
> > Sent: Monday, November 19, 2007 1:27:54 PM
> > Subject: Re: [GE users] Questions on SSH tight integration + Rocks
> > OS 4.3
> >
> > --- VS Ang <vs_ang at yahoo.com> wrote:
> > > First, it's not clear to me what supplementary group IDs are.
> > > Also, the release notes refer to the gid_range parameter for
> > > the execd "local" configuration on each node. Are these group
> > > ID ranges supposed to be non-overlapping on each node?
> >
> > The "gid_range" is documented in sge_conf(5). As long as the
> > range is bigger than the max. number of jobs per node, you don't
> > need to change it.
> >
> > Also, the "gid_range" of a node can be overlapped by other
> > nodes, as the group ID space is not shared in the nodes.
> >
> >
> > > Also, when I try to edit the execd configuration on the node
> > > using qconf, I am getting the following errors. Does it mean
> > > the gid_range parameter is not supported in this version (even
> > > though this is 6.0u8)?
> >
> > "gid_range" was supported even before SGE 5.x.
> >
> >
> > > 2) >./aimk -gcc -no-java -no-jni -no-qtcsh -spool-classic
> > > -tight-ssh
> > >
> > > I would appreciate it if someone can point me to the "right"
> > > documentation on implementing the SSH tight integration and
> > > tell me the requirements for building the code with tight
> > > integration support.
> >
> > You can take a look at the SGE workshop presentation:
> >
> > "SGE-openSSH Tight Integration":
> >
> > http://gridengine.sunsource.net/download/workshop10-12_09_07/SGE-
> > WS2007-openSSHTightIntegration_RonChen.pdf
> >
> > -Ron
> >
> >
> >
> > >
> > > Thank you,
> > > Srihari
> > > >
> >  
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > > users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail:
> > users-help at gridengine.sunsource.net
> >
> >
> >
> >
> >  
> ______________________________________________________________________
> > ______________
> > Be a better pen pal.
> > Text or chat with friends inside Yahoo! Mail. See how.  http://
> > overview.mail.yahoo.com/
> >
> >  
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
>
>




More information about the gridengine-users mailing list