[GE users] PVM, SSh and vendor-specific host file

Reuti reuti at staff.uni-marburg.de
Thu Oct 5 22:04:32 BST 2006


Am 05.10.2006 um 22:31 schrieb Bisbal, Prentice:

> Which rsh-wrapper are you referring to? Do you mean $SGE_ROOT/pvm/rsh?

Did you replace this with the scripts from the Howto? The SGE  
distribution doesn't contain them.

> If so, I made the change you requested. I looked through that  
> script, and didn't see any bash-specific code. It all looked like  
> standard Bourne-shell code.

It was just an idea ;-)

>
> I think I found a clue. Look at the SSH error on the first line  
> below from the .pe### file of failed job:
>
> head tester_tight.sh.pe579
> ssh_exchange_identification: Connection closed by remote host

I saw this before and without any SGE involved. Can you try to remove  
the line in authorized_hosts for the IRIX machine and login again one  
time to get an updated entry? Another option might be to turn on the  
most verbose -vvv in the ssh call.

-- Reuti


> libpvm [pid1195109] /tmp/579.1.all.q/pvmd.2500: No such file or  
> directory
> libpvm [pid1195109] /tmp/579.1.all.q/pvmd.2500: No such file or  
> directory
> libpvm [pid1195109] /tmp/579.1.all.q/pvmd.2500: No such file or  
> directory
> libpvm [pid1195109]: pvm_mytid(): Can't contact local daemon
>
> I googled on the SSH error, and recompiled SSH w/o TCP wrappers,  
> which was the only advice I could find for this error. When SGE  
> encounters this error, is it running under my username, or the  
> sgeadmin user (the username sge_execd is running as). I've turned  
> the loglevel up all the way on the IRIX execution hosts, but  
> haven't found any usefull error messages there.
>
> Prentice
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Wed 10/4/2006 6:38 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] PVM, SSh and vendor-specific host file
>
> Am 03.10.2006 um 23:36 schrieb Bisbal, Prentice:
>
>> Reuti,
>>
>> It seems that the values for TMPDIR and PVM_TMP aren't getting passed
>> correctly. What could cause that?
>>
>
> Which version of sh and bash are installed on IRIX? Can you try to
> edit the first line of the rsh-wrapper to read:
>
> #!/bin/bash
>
> Under Linux the sh is most often a link to bash. Maybe not on IRIX. -
> Reuti
>
>
>> I am using the same version of OpenSSH for both - version 3.9p1,
>> which I
>> compiled/installed myself. The only difference is that the OpenSSH on
>> the Linux systems was built from the Fedora Core 2 SRPM, so there  
>> were
>> some patches included with that SRPM. I looked through the patches
>> quickly, and don't think they should have an effect. I'm going though
>> the sshd_config and ssh_config files on all the hosts right now, to
>> make
>> sure they're all the same.
>>
>>
>> -- 
>> Prentice
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Tuesday, October 03, 2006 5:10 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] PVM, SSh and vendor-specific host file
>>
>> Am 03.10.2006 um 21:24 schrieb Bisbal, Prentice:
>>
>>> Again, I apologize for top-posting.
>>>
>>> `hostname` does return the FQDN for the hosts on all platforms,  
>>> which
>>> is correct. In my original version of the script, I was using `uname
>>> -n`, which returns only the shortname on IRIX. That *did* cause
>>> problems. All of my systems are setup to use the FQDN, both at the
>>> operating system level and in SGE (confirmed with the output of
>>> 'qconf
>>
>>> -sel').
>>>
>>> IRIX doesn't have a ps command capable of duplicating the tree
>>> structure of `ps -e f` on Linux, so here's just the output of 'ps -
>>> ef' with two processes running (one master and one slave). It's
>>> harder
>>
>>> to look at, but if you look at PID and PPID columns, you can figure
>>> out what's going on.
>>>
>>> This is the output of 'ps -ef' when using loose PVM integration  
>>> on an
>>> IRIX host:
>>>
>>> $ ps -ef | egrep "pbisbal|sge" | sort -k 2
>>> sgeadmin     958543          1  0   Aug 15 ?      12:19 /usr/local/
>>> share/sge/bin/irix65/sge_execd
>>>  pbisbal    1054503    1080539  0 15:09:37 pts/0   0:00 ps -ef
>>>  pbisbal    1076737          1  0 15:01:24 ?       0:00 /usr/local/
>>> share/pvm3/lib/SGI64/pvmd3 /tmp/542.1.all.q/hostfile
>>>  pbisbal    1080539    1080862  0 10:50:02 pts/0   0:01 -bash
>>>  pbisbal    1080862    1080822  0 10:50:02 ?       0:01 /opt/sbin/
>>> sshd -R
>>> sgeadmin    1081200     958543  0 15:01:24 ?       0:00
>>> sge_shepherd-542 -bg
>>>  pbisbal    1081403    1081457  0 15:01:34 ?       7:57 /usr/local/
>>> share/pvm3/bin/SGI64/omega -in XXXXXX.in -out XXXXXXX.out
>>>  pbisbal    1081457    1081200  0 15:01:34 ?       0:00 /bin/sh /
>>> var/local/sge/default/spool/hw-diesel/job_scripts/542
>>>  pbisbal    1081549    1076737  0 15:01:34 ?       8:02 /usr/local/
>>> share/pvm3/bin/SGI64/omega run_in_pvm_slave_mode
>>
>> Can you please check the available options to sshd on IRIX? It
>> might be,
>> that they are different from the Linux ones (I remember an issue on
>> Solaris, where -i wasn't available).
>>
>> In the worst case, the use of OpenSSH might help.
>>
>>> Here's how the same job looks on a Linux system with loose PVM
>>> integration:
>>>
>>> $ ps -e f
>>> 17576 ?        S     10:36 /usr/local/share/sge/bin/lx24-x86/
>>> sge_execd
>>> 20089 ?        S      0:00  \_ sge_shepherd-543 -bg
>>> 20111 ?        S      0:00      \_ /bin/sh /var/local/sge/default/
>>> spool/hw-appsrv05/job_scripts/543
>>> 20116 ?        R      0:11          \_ /usr/local/share/pvm3/bin/
>>> LINUXI386/omega -in XXXXXXXX.in -out XXXXX.out -pvmconf /tmp/543.1.a
>>> 20108 ?        S      0:00 /usr/local/share/pvm3/lib/LINUXI386/
>>> pvmd3 /tmp/543.1.all.q/hostfile
>>> 20117 ?        R      0:14  \_ /usr/local/share/pvm3/bin/LINUXI386/
>>> omega run_in_pvm_slave_mode
>>>
>>> And here's how it looks on a Linux system with tight PVM  
>>> integration:
>>> $ ps -e f
>>> 17576 ?        S     10:37 /usr/local/share/sge/bin/lx24-x86/
>>> sge_execd
>>> 20158 ?        S      0:00  \_ sge_shepherd-544 -bg
>>> 20200 ?        S      0:00  |   \_ /bin/sh /var/local/sge/default/
>>> spool/hw-appsrv05/job_scripts/544
>>> 20206 ?        R      0:09  |       \_ /usr/local/share/pvm3/bin/
>>> LINUXI386/omega -in XXXXXX.in -out XXXXXX.out -pvmconf /tmp/544.1.a
>>> 20181 ?        S      0:00  \_ sge_shepherd-544 -bg
>>> 20182 ?        S      0:00      \_ sshd: pbisbal [priv]
>>> 20185 ?        S      0:00          \_ sshd: pbisbal at notty
>>> 20186 ?        S      0:00              \_ /usr/local/share/sge/
>>> utilbin/lx24-x86/qrsh_starter /var/local/sge/default/spool/hw-
>>> appsrv05/active_jobs/5
>>> 20198 ?        S      0:00                  \_ /usr/local/share/
>>> pvm3/lib/LINUXI386/pvmd3 /tmp/544.1.all.q/hostfile
>>> 20207 ?        R      0:10                      \_ /usr/local/share/
>>> pvm3/bin/LINUXI386/omega run_in_pvm_slave_mode
>>> 20178 ?        S      0:00 /usr/local/share/sge/bin/lx24-x86/qrsh -
>>> V -inherit hw-appsrv05.lexpharma.com env PVM_TMP=$TMPDIR /usr/local/
>>> share/pvm3/li
>>> 20183 ?        S      0:00  \_ /usr/bin/ssh -x -p 42731 hw-
>>> appsrv05.lexpharma.com exec '/usr/local/share/sge/utilbin/lx24-x86/
>>> qrsh_starter' '/var/lo
>>
>> Besides that the accounting will be wrong (missing additonal group ID
>> for these processes - therefore the Tight SSH patch), this looks  
>> okay.
>>
>> -- Reuti
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>>
>> The contents of this communication, including any attachments, may
>> be confidential, privileged or otherwise protected from
>> disclosure.  They are intended solely for the use of the individual
>> or entity to whom they are addressed.  If you are not the intended
>> recipient, please do not read, copy, use or disclose the contents
>> of this communication.  Please notify the sender immediately and
>> delete the communication in its entirety.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
>
>
>
> The contents of this communication, including any attachments, may  
> be confidential, privileged or otherwise protected from  
> disclosure.  They are intended solely for the use of the individual  
> or entity to whom they are addressed.  If you are not the intended  
> recipient, please do not read, copy, use or disclose the contents  
> of this communication.  Please notify the sender immediately and  
> delete the communication in its entirety.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list