[GE users] PVM, SSh and vendor-specific host file

Bisbal, Prentice PBisbal at LexPharma.com
Fri Oct 6 18:49:46 BST 2006


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti, 

I ssh between these hosts all day long, so I know that the host key information is correct. I did add '-vvv' to my ssh commands on the queue. I get lot's of out put, but still no smoking gun. I've attached the output from a recently failed run. Maybe you can see something I didn't. 


Prentice 



-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Thu 10/5/2006 5:04 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] PVM, SSh and vendor-specific host file
 
Am 05.10.2006 um 22:31 schrieb Bisbal, Prentice:

> Which rsh-wrapper are you referring to? Do you mean $SGE_ROOT/pvm/rsh?

Did you replace this with the scripts from the Howto? The SGE  
distribution doesn't contain them.

> If so, I made the change you requested. I looked through that  
> script, and didn't see any bash-specific code. It all looked like  
> standard Bourne-shell code.

It was just an idea ;-)

>
> I think I found a clue. Look at the SSH error on the first line  
> below from the .pe### file of failed job:
>
> head tester_tight.sh.pe579
> ssh_exchange_identification: Connection closed by remote host

I saw this before and without any SGE involved. Can you try to remove  
the line in authorized_hosts for the IRIX machine and login again one  
time to get an updated entry? Another option might be to turn on the  
most verbose -vvv in the ssh call.

-- Reuti


> libpvm [pid1195109] /tmp/579.1.all.q/pvmd.2500: No such file or  
> directory
> libpvm [pid1195109] /tmp/579.1.all.q/pvmd.2500: No such file or  
> directory
> libpvm [pid1195109] /tmp/579.1.all.q/pvmd.2500: No such file or  
> directory
> libpvm [pid1195109]: pvm_mytid(): Can't contact local daemon
>
> I googled on the SSH error, and recompiled SSH w/o TCP wrappers,  
> which was the only advice I could find for this error. When SGE  
> encounters this error, is it running under my username, or the  
> sgeadmin user (the username sge_execd is running as). I've turned  
> the loglevel up all the way on the IRIX execution hosts, but  
> haven't found any usefull error messages there.
>
> Prentice
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Wed 10/4/2006 6:38 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] PVM, SSh and vendor-specific host file
>
> Am 03.10.2006 um 23:36 schrieb Bisbal, Prentice:
>
>> Reuti,
>>
>> It seems that the values for TMPDIR and PVM_TMP aren't getting passed
>> correctly. What could cause that?
>>
>
> Which version of sh and bash are installed on IRIX? Can you try to
> edit the first line of the rsh-wrapper to read:
>
> #!/bin/bash
>
> Under Linux the sh is most often a link to bash. Maybe not on IRIX. -
> Reuti
>
>
>> I am using the same version of OpenSSH for both - version 3.9p1,
>> which I
>> compiled/installed myself. The only difference is that the OpenSSH on
>> the Linux systems was built from the Fedora Core 2 SRPM, so there  
>> were
>> some patches included with that SRPM. I looked through the patches
>> quickly, and don't think they should have an effect. I'm going though
>> the sshd_config and ssh_config files on all the hosts right now, to
>> make
>> sure they're all the same.
>>
>>
>> -- 
>> Prentice
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Tuesday, October 03, 2006 5:10 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] PVM, SSh and vendor-specific host file
>>
>> Am 03.10.2006 um 21:24 schrieb Bisbal, Prentice:
>>
>>> Again, I apologize for top-posting.
>>>
>>> `hostname` does return the FQDN for the hosts on all platforms,  
>>> which
>>> is correct. In my original version of the script, I was using `uname
>>> -n`, which returns only the shortname on IRIX. That *did* cause
>>> problems. All of my systems are setup to use the FQDN, both at the
>>> operating system level and in SGE (confirmed with the output of
>>> 'qconf
>>
>>> -sel').
>>>
>>> IRIX doesn't have a ps command capable of duplicating the tree
>>> structure of `ps -e f` on Linux, so here's just the output of 'ps -
>>> ef' with two processes running (one master and one slave). It's
>>> harder
>>
>>> to look at, but if you look at PID and PPID columns, you can figure
>>> out what's going on.
>>>
>>> This is the output of 'ps -ef' when using loose PVM integration  
>>> on an
>>> IRIX host:
>>>
>>> $ ps -ef | egrep "pbisbal|sge" | sort -k 2
>>> sgeadmin     958543          1  0   Aug 15 ?      12:19 /usr/local/
>>> share/sge/bin/irix65/sge_execd
>>>  pbisbal    1054503    1080539  0 15:09:37 pts/0   0:00 ps -ef
>>>  pbisbal    1076737          1  0 15:01:24 ?       0:00 /usr/local/
>>> share/pvm3/lib/SGI64/pvmd3 /tmp/542.1.all.q/hostfile
>>>  pbisbal    1080539    1080862  0 10:50:02 pts/0   0:01 -bash
>>>  pbisbal    1080862    1080822  0 10:50:02 ?       0:01 /opt/sbin/
>>> sshd -R
>>> sgeadmin    1081200     958543  0 15:01:24 ?       0:00
>>> sge_shepherd-542 -bg
>>>  pbisbal    1081403    1081457  0 15:01:34 ?       7:57 /usr/local/
>>> share/pvm3/bin/SGI64/omega -in XXXXXX.in -out XXXXXXX.out
>>>  pbisbal    1081457    1081200  0 15:01:34 ?       0:00 /bin/sh /
>>> var/local/sge/default/spool/hw-diesel/job_scripts/542
>>>  pbisbal    1081549    1076737  0 15:01:34 ?       8:02 /usr/local/
>>> share/pvm3/bin/SGI64/omega run_in_pvm_slave_mode
>>
>> Can you please check the available options to sshd on IRIX? It
>> might be,
>> that they are different from the Linux ones (I remember an issue on
>> Solaris, where -i wasn't available).
>>
>> In the worst case, the use of OpenSSH might help.
>>
>>> Here's how the same job looks on a Linux system with loose PVM
>>> integration:
>>>
>>> $ ps -e f
>>> 17576 ?        S     10:36 /usr/local/share/sge/bin/lx24-x86/
>>> sge_execd
>>> 20089 ?        S      0:00  \_ sge_shepherd-543 -bg
>>> 20111 ?        S      0:00      \_ /bin/sh /var/local/sge/default/
>>> spool/hw-appsrv05/job_scripts/543
>>> 20116 ?        R      0:11          \_ /usr/local/share/pvm3/bin/
>>> LINUXI386/omega -in XXXXXXXX.in -out XXXXX.out -pvmconf /tmp/543.1.a
>>> 20108 ?        S      0:00 /usr/local/share/pvm3/lib/LINUXI386/
>>> pvmd3 /tmp/543.1.all.q/hostfile
>>> 20117 ?        R      0:14  \_ /usr/local/share/pvm3/bin/LINUXI386/
>>> omega run_in_pvm_slave_mode
>>>
>>> And here's how it looks on a Linux system with tight PVM  
>>> integration:
>>> $ ps -e f
>>> 17576 ?        S     10:37 /usr/local/share/sge/bin/lx24-x86/
>>> sge_execd
>>> 20158 ?        S      0:00  \_ sge_shepherd-544 -bg
>>> 20200 ?        S      0:00  |   \_ /bin/sh /var/local/sge/default/
>>> spool/hw-appsrv05/job_scripts/544
>>> 20206 ?        R      0:09  |       \_ /usr/local/share/pvm3/bin/
>>> LINUXI386/omega -in XXXXXX.in -out XXXXXX.out -pvmconf /tmp/544.1.a
>>> 20181 ?        S      0:00  \_ sge_shepherd-544 -bg
>>> 20182 ?        S      0:00      \_ sshd: pbisbal [priv]
>>> 20185 ?        S      0:00          \_ sshd: pbisbal at notty
>>> 20186 ?        S      0:00              \_ /usr/local/share/sge/
>>> utilbin/lx24-x86/qrsh_starter /var/local/sge/default/spool/hw-
>>> appsrv05/active_jobs/5
>>> 20198 ?        S      0:00                  \_ /usr/local/share/
>>> pvm3/lib/LINUXI386/pvmd3 /tmp/544.1.all.q/hostfile
>>> 20207 ?        R      0:10                      \_ /usr/local/share/
>>> pvm3/bin/LINUXI386/omega run_in_pvm_slave_mode
>>> 20178 ?        S      0:00 /usr/local/share/sge/bin/lx24-x86/qrsh -
>>> V -inherit hw-appsrv05.lexpharma.com env PVM_TMP=$TMPDIR /usr/local/
>>> share/pvm3/li
>>> 20183 ?        S      0:00  \_ /usr/bin/ssh -x -p 42731 hw-
>>> appsrv05.lexpharma.com exec '/usr/local/share/sge/utilbin/lx24-x86/
>>> qrsh_starter' '/var/lo
>>
>> Besides that the accounting will be wrong (missing additonal group ID
>> for these processes - therefore the Tight SSH patch), this looks  
>> okay.
>>
>> -- Reuti
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>>
>> The contents of this communication, including any attachments, may
>> be confidential, privileged or otherwise protected from
>> disclosure.  They are intended solely for the use of the individual
>> or entity to whom they are addressed.  If you are not the intended
>> recipient, please do not read, copy, use or disclose the contents
>> of this communication.  Please notify the sender immediately and
>> delete the communication in its entirety.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
>
>
>
> The contents of this communication, including any attachments, may  
> be confidential, privileged or otherwise protected from  
> disclosure.  They are intended solely for the use of the individual  
> or entity to whom they are addressed.  If you are not the intended  
> recipient, please do not read, copy, use or disclose the contents  
> of this communication.  Please notify the sender immediately and  
> delete the communication in its entirety.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net







The contents of this communication, including any attachments, may be confidential, privileged or otherwise protected from disclosure.  They are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the intended recipient, please do not read, copy, use or disclose the contents of this communication.  Please notify the sender immediately and delete the communication in its entirety.



    [ Part 2: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list