[GE users] qrsh fails

Reuti reuti at staff.uni-marburg.de
Fri Jan 27 10:44:47 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Am 27.01.2006 um 11:23 schrieb Jean-Paul Minet:

> Reuti,
>
>
>>> 'Qrsh' functionality in SGE is rather a crap-shoot.  I've got two
>>> systems running SGE V60u3.  On one, a Caos Linux with kernel  
>>> 2.6.13.4
>>> and Glibc 2.3.3, 'qrsh' and 'qlogin' work fine right out of the box.
>>> On the other, Red Hat 9 with kernel 2.4.20-31 and Glibc 2.3.2, I  
>>> have
>>> to make '$SGE_ROOT/bin/lx24-x86/qsh' SUID 'root' in order for 'qrsh'
>>> to work properly.  On a RHEL 3 system with kernel 2.4.21-32, Glibc
>>> 2.3.2, SGE V60u4, again 'qrsh' and 'qlogin' work in the default
>>> installation.  On another Caos system, kernel 2.6.13, Glibc 2.3.3,
>>> SGE V60u6, 'qlogin' works, but 'qrsh' doesn't, no matter whether
>>> '$SGE_ROOT/bin/lx24-x86/qsh' is SUID 'root' or not.
>>>
>> I never needed to make qrsh/qlogin SUID to any special account.  
>> The  interactive qrsh needs some tools from the rlogin package and  
>> qlogin  from the telnet package. In both cases there is no need to  
>> startup  any daemons by Linux on its own, just the programs must  
>> be there.
>> To get a passwordless login for qrsh, on the nodes the one and  
>> only  entry in /etc/hosts.equiv has to reflect the login/headnode  
>> of the  cluster.
>> Other places to look into are hosts.allow/deny, PAM and firewall   
>> settings.
>> What were the error messages you got without SUID? - Reuti
>
> As root, qrsh is working OK.  As normal user, I get:
> minet at lmexec-86 ~ >qrsh -verbose -l mem_free=10M -l num_proc=2 -q  
> all.q at lmexec-75 date
> your job 2496 ("date") has been submitted
> waiting for interactive job to be scheduled ...
> Your interactive job 2496 has been successfully scheduled.
> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to host  
> lmexec-75 ...
> rcmd: socket: Permission denied
>

what is:

$ ls -lh /gridware/sge/utilbin/lx24-amd64/rlogin
$ ls -lh /gridware/sge/utilbin/lx24-amd64/rsh

saying? - Reuti


> Note that qlogin works as normal user:
> minet at lmexec-86 ~ >qlogin -verbose -l mem_free=10M -l num_proc=2 -q  
> all.q at lmexec-75
> your job 2497 ("QLOGIN") has been submitted
> waiting for interactive job to be scheduled ...
> Your interactive job 2497 has been successfully scheduled.
> Establishing telnet session to host lmexec-75 ...
> Trying 192.168.241.75...
> Connected to lmexec-75.
> Escape character is '^]'.
> Welcome to SUSE LINUX Enterprise Server 9 (x86_64) - Kernel  
> 2.6.5-7.97-smp (1).
>
> This is with SUID on utilbin/rlogin and rsh (as explained in howto's).
>
> Any hint?
>
> jp
>
>
>>> All in all, a crap-shoot.
>>>
>>> David S.
>>>
>>> On Mon, Jan 16, 2006 at 09:16:50AM +0100, Jean-Paul Minet wrote:
>>>
>>>> Reuti,
>>>>
>>>>>> I am trying to get tight integration to work (MPICH 1.2.6 et SGE
>>>>>> 6.0u6) and face a problem with qrsh.  Trying to debug it  
>>>>>> separately
>>>>>> from the integration bit, I obtain a "poll:protocol failure in
>>>>>> circuit setup" on the host initiating the qrsh (cfr. below).    
>>>>>> On  the
>>>>>> target host, I get the following wierd messages:
>>>>>>
>>>>>> Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
>>>>>> lmexec-92 kernel: Oops: 0000 [2] SMP
>>>>>>
>>>>>> Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
>>>>>> lmexec-92 kernel: CR2: 0000000000000108
>>>>>>
>>>>>> We use SUSE 9.0 (kernel 2.6.5-7.97-smp) on Sun V20z (bi-opteron).
>>>>>>
>>>>>
>>>>> this looks like a bug in the kernel - was the 2.6.5-7.97-smp   
>>>>> kernel  the
>>>>> latest for 9.0?
>>>>
>>>>
>>>> We actually use SLES 9 (entreprise version).  The cluster has been
>>>> purchased and installed last quarter.  I checked on the Novell   
>>>> site and
>>>> didn't see any subsequent release.
>>>>
>>>>> - Is this on all hosts or only on one specific one?
>>>>
>>>>
>>>> Just tried with a few hosts, and the behavior is the same...
>>>>
>>>>> - Is this new and worked before? As 9.0 isn't the latest of  
>>>>> 9.x, I'd
>>>>> assume that your cluster is already in operation for some time  
>>>>> now.
>>>>
>>>>
>>>> It never worked before.  Install is new; SGE configured and  
>>>> more  or less
>>>> working, except bits and pieces here and there, among which tight
>>>> integration for mpich/ethernet interconnect; I have also  
>>>> trouble  with the
>>>> infiniband interconnect integration: the patch for mpich/  
>>>> infiniband and SGE
>>>> tight integration, available on the HowTo site, doesn't match  
>>>> the  version
>>>> of mpich supplied and customized by the Infiniband vendor.  I  
>>>> am  awaiting
>>>> support form Infiniband vendor to get latest mpich/mvapich version
>>>> installed/customized.
>>>>
>>>> thnks & rgds
>>>>
>>>> Jean-Paul
>>>>
>>>>> -- Reuti
>>>>>
>>>>>> Would someone have an idea on how to further debug the problem (I
>>>>>> have tried using tcpdump between the submit host and the  
>>>>>> target   host,
>>>>>> as well as the qmaster host and the target host, to dig into
>>>>>> communication bits, but it's getting complicated...)?
>>>>>>
>>>>>> Thks for any help
>>>>>>
>>>>>> Jean-paul
>>>>>>
>>>>>> ---- qrsh command and output ----
>>>>>> lemaitre /gridware/sge/bin/lx24-amd64 # qrsh -verbose -l    
>>>>>> mem_free=10M
>>>>>> -l num_proc=2 -q all.q at lmexec-92 date
>>>>>> local configuration lemaitre.cism.ucl.ac.be not defined - using
>>>>>> global configuration
>>>>>> your job 1788 ("date") has been submitted
>>>>>> waiting for interactive job to be scheduled ...
>>>>>> Your interactive job 1788 has been successfully scheduled.
>>>>>> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to host
>>>>>> lmexec-92 ...
>>>>>> poll: protocol failure in circuit setup
>>>>>> /gridware/sge/utilbin/lx24-amd64/rsh exited with exit code 1
>>>>>> reading exit code from shepherd ... 129
>>>>>>
>>>>>> -- 
>>>>>> Jean-Paul Minet
>>>>>> Gestionnaire CISM - Institut de Calcul Intensif et de  
>>>>>> Stockage  de  Masse
>>>>>> Universit? Catholique de Louvain
>>>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> -- --
>>>>>> To unsubscribe, e-mail: users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-  
>>>>>> help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> -- -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> -- 
>>>> Jean-Paul Minet
>>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage  
>>>> de  Masse
>>>> Universit? Catholique de Louvain
>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> -- 
> Jean-Paul Minet
> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  
> Masse
> Université Catholique de Louvain
> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list