[GE users] PVM, SSh and vendor-specific host file

Bisbal, Prentice PBisbal at LexPharma.com
Tue Oct 3 20:24:26 BST 2006


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Again, I apologize for top-posting.

`hostname` does return the FQDN for the hosts on all platforms, which is correct. In my original version of the script, I was using `uname -n`, which returns only the shortname on IRIX. That *did* cause problems. All of my systems are setup to use the FQDN, both at the operating system level and in SGE (confirmed with the output of 'qconf -sel'). 

IRIX doesn't have a ps command capable of duplicating the tree structure of `ps -e f` on Linux, so here's just the output of 'ps -ef' with two processes running (one master and one slave). It's harder to look at, but if you look at PID and PPID columns, you can figure out what's going on. 

This is the output of 'ps -ef' when using loose PVM integration on an IRIX host:

$ ps -ef | egrep "pbisbal|sge" | sort -k 2
sgeadmin     958543          1  0   Aug 15 ?      12:19 /usr/local/share/sge/bin/irix65/sge_execd
 pbisbal    1054503    1080539  0 15:09:37 pts/0   0:00 ps -ef
 pbisbal    1076737          1  0 15:01:24 ?       0:00 /usr/local/share/pvm3/lib/SGI64/pvmd3 /tmp/542.1.all.q/hostfile
 pbisbal    1080539    1080862  0 10:50:02 pts/0   0:01 -bash
 pbisbal    1080862    1080822  0 10:50:02 ?       0:01 /opt/sbin/sshd -R
sgeadmin    1081200     958543  0 15:01:24 ?       0:00 sge_shepherd-542 -bg
 pbisbal    1081403    1081457  0 15:01:34 ?       7:57 /usr/local/share/pvm3/bin/SGI64/omega -in XXXXXX.in -out XXXXXXX.out
 pbisbal    1081457    1081200  0 15:01:34 ?       0:00 /bin/sh /var/local/sge/default/spool/hw-diesel/job_scripts/542
 pbisbal    1081549    1076737  0 15:01:34 ?       8:02 /usr/local/share/pvm3/bin/SGI64/omega run_in_pvm_slave_mode

Here's how the same job looks on a Linux system with loose PVM integration:

$ ps -e f
17576 ?        S     10:36 /usr/local/share/sge/bin/lx24-x86/sge_execd
20089 ?        S      0:00  \_ sge_shepherd-543 -bg
20111 ?        S      0:00      \_ /bin/sh /var/local/sge/default/spool/hw-appsrv05/job_scripts/543
20116 ?        R      0:11          \_ /usr/local/share/pvm3/bin/LINUXI386/omega -in XXXXXXXX.in -out XXXXX.out -pvmconf /tmp/543.1.a
20108 ?        S      0:00 /usr/local/share/pvm3/lib/LINUXI386/pvmd3 /tmp/543.1.all.q/hostfile
20117 ?        R      0:14  \_ /usr/local/share/pvm3/bin/LINUXI386/omega run_in_pvm_slave_mode

And here's how it looks on a Linux system with tight PVM integration:
$ ps -e f
17576 ?        S     10:37 /usr/local/share/sge/bin/lx24-x86/sge_execd
20158 ?        S      0:00  \_ sge_shepherd-544 -bg
20200 ?        S      0:00  |   \_ /bin/sh /var/local/sge/default/spool/hw-appsrv05/job_scripts/544
20206 ?        R      0:09  |       \_ /usr/local/share/pvm3/bin/LINUXI386/omega -in XXXXXX.in -out XXXXXX.out -pvmconf /tmp/544.1.a
20181 ?        S      0:00  \_ sge_shepherd-544 -bg
20182 ?        S      0:00      \_ sshd: pbisbal [priv]
20185 ?        S      0:00          \_ sshd: pbisbal at notty
20186 ?        S      0:00              \_ /usr/local/share/sge/utilbin/lx24-x86/qrsh_starter /var/local/sge/default/spool/hw-appsrv05/active_jobs/5
20198 ?        S      0:00                  \_ /usr/local/share/pvm3/lib/LINUXI386/pvmd3 /tmp/544.1.all.q/hostfile
20207 ?        R      0:10                      \_ /usr/local/share/pvm3/bin/LINUXI386/omega run_in_pvm_slave_mode
20178 ?        S      0:00 /usr/local/share/sge/bin/lx24-x86/qrsh -V -inherit hw-appsrv05.lexpharma.com env PVM_TMP=$TMPDIR /usr/local/share/pvm3/li
20183 ?        S      0:00  \_ /usr/bin/ssh -x -p 42731 hw-appsrv05.lexpharma.com exec '/usr/local/share/sge/utilbin/lx24-x86/qrsh_starter' '/var/lo

I can't show the process tree for tight integration on IRIX, since it never makes it that far. 

Prentice


-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Tue 10/3/2006 1:03 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] PVM, SSh and vendor-specific host file
 
Hi,

Am 03.10.2006 um 18:23 schrieb Bisbal, Prentice:

> I modified my startpvm.sh script as recommened by Reuti (see below,  
> and sorry for top-posting, but I'm forced to use outlook). I  
> borrowed from the startmpi.sh script to create this function:
>
> PeHostfile2OpenEyeHostFile()
> {
>    myname=`hostname`
>    cat $1 | while read line; do
>       # echo $line
>       host=`echo $line|cut -f1 -d" "`
>       nslots=`echo $line|cut -f2 -d" "`
>       i=1
>       if [ "$host" = "$myname" ]; then
>           if [ $nslots -eq 1 ]; then
>               continue
>           elif [ $nslots -gt 1 ]; then
>               nslots=`expr $nslots - 1`
>           fi
>       fi
>       echo "host $host $nslots"
>    done
> }
>
> This function is called later in the script like this (again  
> mimicking startmpi.sh):
>
> oe_hosts="$TMPDIR/oe_hosts"
> PeHostfile2OpenEyeHostFile $pe_hostfile >> $oe_hosts
>
> I've attached a patchfile containing these changes. This works  
> exactly as desired when my PVM PE is configured for loose  
> integration, as described in
> http://gridengine.sunsource.net/howto/pvm-integration/pvm- 
> integration.html
>
> However, when I switch my PVM PE configuration to tight  
> integration,it works fine on my Linux execution hosts, but fails on  
> my IRIX 6.5 hosts. I get the following errors in my .pe### file:
>
> $ more tester_tight.sh.pe535
> [pvmd pid257920] 10/03 11:15:29 usage: pvmd3 [-ddebugmask] [- 
> nhostname] [hostfil
> e]
> [pvmd pid257920] 10/03 11:15:29 pvmbailout(0)
> libpvm [pid1066678] /tmp/535.1.all.q/pvmd.2500: No such file or  
> directory
> libpvm [pid1066678] /tmp/535.1.all.q/pvmd.2500: No such file or  
> directory
> libpvm [pid1066678] /tmp/535.1.all.q/pvmd.2500: No such file or  
> directory
> libpvm [pid1066678]: pvm_mytid(): Can't contact local daemon
>
> My job script looks like this:
>
> #!/bin/sh
> PVM_ROOT=/usr/local/share/pvm3
> PVM_ARCH=`$PVM_ROOT/lib/pvmgetarch`
>
> PVM_TMP=$TMPDIR
> export PVM_TMP
>
> $PVM_ROOT/bin/$PVM_ARCH/omega -in XXXXXXX.in -out XXXXXXX.out -pvm
> conf $TMPDIR/oe_hosts
>
> I did some investigating, and I noticed that the PVM temp dirs do  
> not get created in $TMPDIR. Any idea why this works for Linux, but  
> not IRIX? Again, loose integration works fine. Both architectures  
> are using the same version of PVM compiled/configured the same way.  
> I'm using SSH instead of RSH. Both architectures are using the same  
> version of OpenSSH, but I haven't recompiled it yet with the patch  
> for tight integration. I don't think that's the problem, since  
> tight PVM integration works fine for my Linux systems.
>

did you checked the process tree with:

$ ps -e f

and all PVM generated processes on the slave nodes are kids of a  
sge_shepherd when using ssh?

Is the command `hostname`behaving different on IRIX compared to  
Linux? Are you getting there the FQDN, but SGE and the Lunux boxes  
were setup to work only with the short hostname?

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net






The contents of this communication, including any attachments, may be confidential, privileged or otherwise protected from disclosure.  They are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the intended recipient, please do not read, copy, use or disclose the contents of this communication.  Please notify the sender immediately and delete the communication in its entirety.



    [ Part 2: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list