[GE users] qrsh fails

Jean-Paul Minet minet at cism.ucl.ac.be
Fri Jan 27 10:23:04 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti,


>> 'Qrsh' functionality in SGE is rather a crap-shoot.  I've got two
>> systems running SGE V60u3.  On one, a Caos Linux with kernel 2.6.13.4
>> and Glibc 2.3.3, 'qrsh' and 'qlogin' work fine right out of the box.
>> On the other, Red Hat 9 with kernel 2.4.20-31 and Glibc 2.3.2, I have
>> to make '$SGE_ROOT/bin/lx24-x86/qsh' SUID 'root' in order for 'qrsh'
>> to work properly.  On a RHEL 3 system with kernel 2.4.21-32, Glibc
>> 2.3.2, SGE V60u4, again 'qrsh' and 'qlogin' work in the default
>> installation.  On another Caos system, kernel 2.6.13, Glibc 2.3.3,
>> SGE V60u6, 'qlogin' works, but 'qrsh' doesn't, no matter whether
>> '$SGE_ROOT/bin/lx24-x86/qsh' is SUID 'root' or not.
>>
> 
> I never needed to make qrsh/qlogin SUID to any special account. The  
> interactive qrsh needs some tools from the rlogin package and qlogin  
> from the telnet package. In both cases there is no need to startup  any 
> daemons by Linux on its own, just the programs must be there.
> 
> To get a passwordless login for qrsh, on the nodes the one and only  
> entry in /etc/hosts.equiv has to reflect the login/headnode of the  
> cluster.
> 
> Other places to look into are hosts.allow/deny, PAM and firewall  settings.
> 
> What were the error messages you got without SUID? - Reuti

As root, qrsh is working OK.  As normal user, I get:
minet at lmexec-86 ~ >qrsh -verbose -l mem_free=10M -l num_proc=2 -q 
all.q at lmexec-75 date
your job 2496 ("date") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 2496 has been successfully scheduled.
Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to host lmexec-75 ...
rcmd: socket: Permission denied

Note that qlogin works as normal user:
minet at lmexec-86 ~ >qlogin -verbose -l mem_free=10M -l num_proc=2 -q all.q at lmexec-75
your job 2497 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 2497 has been successfully scheduled.
Establishing telnet session to host lmexec-75 ...
Trying 192.168.241.75...
Connected to lmexec-75.
Escape character is '^]'.
Welcome to SUSE LINUX Enterprise Server 9 (x86_64) - Kernel 2.6.5-7.97-smp (1).

This is with SUID on utilbin/rlogin and rsh (as explained in howto's).

Any hint?

jp


> 
>> All in all, a crap-shoot.
>>
>> David S.
>>
>> On Mon, Jan 16, 2006 at 09:16:50AM +0100, Jean-Paul Minet wrote:
>>
>>> Reuti,
>>>
>>>>> I am trying to get tight integration to work (MPICH 1.2.6 et SGE
>>>>> 6.0u6) and face a problem with qrsh.  Trying to debug it separately
>>>>> from the integration bit, I obtain a "poll:protocol failure in
>>>>> circuit setup" on the host initiating the qrsh (cfr. below).   On  the
>>>>> target host, I get the following wierd messages:
>>>>>
>>>>> Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
>>>>> lmexec-92 kernel: Oops: 0000 [2] SMP
>>>>>
>>>>> Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
>>>>> lmexec-92 kernel: CR2: 0000000000000108
>>>>>
>>>>> We use SUSE 9.0 (kernel 2.6.5-7.97-smp) on Sun V20z (bi-opteron).
>>>>>
>>>>
>>>> this looks like a bug in the kernel - was the 2.6.5-7.97-smp  
>>>> kernel  the
>>>> latest for 9.0?
>>>
>>>
>>> We actually use SLES 9 (entreprise version).  The cluster has been
>>> purchased and installed last quarter.  I checked on the Novell  site and
>>> didn't see any subsequent release.
>>>
>>>> - Is this on all hosts or only on one specific one?
>>>
>>>
>>> Just tried with a few hosts, and the behavior is the same...
>>>
>>>> - Is this new and worked before? As 9.0 isn't the latest of 9.x, I'd
>>>> assume that your cluster is already in operation for some time now.
>>>
>>>
>>> It never worked before.  Install is new; SGE configured and more  or 
>>> less
>>> working, except bits and pieces here and there, among which tight
>>> integration for mpich/ethernet interconnect; I have also trouble  
>>> with the
>>> infiniband interconnect integration: the patch for mpich/ infiniband 
>>> and SGE
>>> tight integration, available on the HowTo site, doesn't match the  
>>> version
>>> of mpich supplied and customized by the Infiniband vendor.  I am  
>>> awaiting
>>> support form Infiniband vendor to get latest mpich/mvapich version
>>> installed/customized.
>>>
>>> thnks & rgds
>>>
>>> Jean-Paul
>>>
>>>> -- Reuti
>>>>
>>>>> Would someone have an idea on how to further debug the problem (I
>>>>> have tried using tcpdump between the submit host and the target   
>>>>> host,
>>>>> as well as the qmaster host and the target host, to dig into
>>>>> communication bits, but it's getting complicated...)?
>>>>>
>>>>> Thks for any help
>>>>>
>>>>> Jean-paul
>>>>>
>>>>> ---- qrsh command and output ----
>>>>> lemaitre /gridware/sge/bin/lx24-amd64 # qrsh -verbose -l   
>>>>> mem_free=10M
>>>>> -l num_proc=2 -q all.q at lmexec-92 date
>>>>> local configuration lemaitre.cism.ucl.ac.be not defined - using
>>>>> global configuration
>>>>> your job 1788 ("date") has been submitted
>>>>> waiting for interactive job to be scheduled ...
>>>>> Your interactive job 1788 has been successfully scheduled.
>>>>> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to host
>>>>> lmexec-92 ...
>>>>> poll: protocol failure in circuit setup
>>>>> /gridware/sge/utilbin/lx24-amd64/rsh exited with exit code 1
>>>>> reading exit code from shepherd ... 129
>>>>>
>>>>> -- 
>>>>> Jean-Paul Minet
>>>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage  de  
>>>>> Masse
>>>>> Universit? Catholique de Louvain
>>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>>
>>>>> ------------------------------------------------------------------- --
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------------- -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>>
>>>
>>> -- 
>>> Jean-Paul Minet
>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  Masse
>>> Universit? Catholique de Louvain
>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 

-- 
Jean-Paul Minet
Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de Masse
Université Catholique de Louvain
Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list