[GE users] qrsh fails

Jean-Paul Minet minet at cism.ucl.ac.be
Fri Jan 27 11:09:09 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti,

<snip>

>> As root, qrsh is working OK.  As normal user, I get:
>> minet at lmexec-86 ~ >qrsh -verbose -l mem_free=10M -l num_proc=2 -q  
>> all.q at lmexec-75 date
>> your job 2496 ("date") has been submitted
>> waiting for interactive job to be scheduled ...
>> Your interactive job 2496 has been successfully scheduled.
>> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to host  
>> lmexec-75 ...
>> rcmd: socket: Permission denied
>>
> 
> what is:
> 
> $ ls -lh /gridware/sge/utilbin/lx24-amd64/rlogin
> $ ls -lh /gridware/sge/utilbin/lx24-amd64/rsh

OK, got it.  rsh/rlogin were owned by sgeadmin :-( (... as installed by Sun 
engineers)  Made them owned by root and it now works.  Thks again for your help.

jp
> 
> saying? - Reuti
> 
> 
>> Note that qlogin works as normal user:
>> minet at lmexec-86 ~ >qlogin -verbose -l mem_free=10M -l num_proc=2 -q  
>> all.q at lmexec-75
>> your job 2497 ("QLOGIN") has been submitted
>> waiting for interactive job to be scheduled ...
>> Your interactive job 2497 has been successfully scheduled.
>> Establishing telnet session to host lmexec-75 ...
>> Trying 192.168.241.75...
>> Connected to lmexec-75.
>> Escape character is '^]'.
>> Welcome to SUSE LINUX Enterprise Server 9 (x86_64) - Kernel  
>> 2.6.5-7.97-smp (1).
>>
>> This is with SUID on utilbin/rlogin and rsh (as explained in howto's).
>>
>> Any hint?
>>
>> jp
>>
>>
>>>> All in all, a crap-shoot.
>>>>
>>>> David S.
>>>>
>>>> On Mon, Jan 16, 2006 at 09:16:50AM +0100, Jean-Paul Minet wrote:
>>>>
>>>>> Reuti,
>>>>>
>>>>>>> I am trying to get tight integration to work (MPICH 1.2.6 et SGE
>>>>>>> 6.0u6) and face a problem with qrsh.  Trying to debug it  separately
>>>>>>> from the integration bit, I obtain a "poll:protocol failure in
>>>>>>> circuit setup" on the host initiating the qrsh (cfr. below).    
>>>>>>> On  the
>>>>>>> target host, I get the following wierd messages:
>>>>>>>
>>>>>>> Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
>>>>>>> lmexec-92 kernel: Oops: 0000 [2] SMP
>>>>>>>
>>>>>>> Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
>>>>>>> lmexec-92 kernel: CR2: 0000000000000108
>>>>>>>
>>>>>>> We use SUSE 9.0 (kernel 2.6.5-7.97-smp) on Sun V20z (bi-opteron).
>>>>>>>
>>>>>>
>>>>>> this looks like a bug in the kernel - was the 2.6.5-7.97-smp   
>>>>>> kernel  the
>>>>>> latest for 9.0?
>>>>>
>>>>>
>>>>>
>>>>> We actually use SLES 9 (entreprise version).  The cluster has been
>>>>> purchased and installed last quarter.  I checked on the Novell   
>>>>> site and
>>>>> didn't see any subsequent release.
>>>>>
>>>>>> - Is this on all hosts or only on one specific one?
>>>>>
>>>>>
>>>>>
>>>>> Just tried with a few hosts, and the behavior is the same...
>>>>>
>>>>>> - Is this new and worked before? As 9.0 isn't the latest of  9.x, I'd
>>>>>> assume that your cluster is already in operation for some time  now.
>>>>>
>>>>>
>>>>>
>>>>> It never worked before.  Install is new; SGE configured and  more  
>>>>> or less
>>>>> working, except bits and pieces here and there, among which tight
>>>>> integration for mpich/ethernet interconnect; I have also  trouble  
>>>>> with the
>>>>> infiniband interconnect integration: the patch for mpich/  
>>>>> infiniband and SGE
>>>>> tight integration, available on the HowTo site, doesn't match  the  
>>>>> version
>>>>> of mpich supplied and customized by the Infiniband vendor.  I  am  
>>>>> awaiting
>>>>> support form Infiniband vendor to get latest mpich/mvapich version
>>>>> installed/customized.
>>>>>
>>>>> thnks & rgds
>>>>>
>>>>> Jean-Paul
>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>> Would someone have an idea on how to further debug the problem (I
>>>>>>> have tried using tcpdump between the submit host and the  
>>>>>>> target   host,
>>>>>>> as well as the qmaster host and the target host, to dig into
>>>>>>> communication bits, but it's getting complicated...)?
>>>>>>>
>>>>>>> Thks for any help
>>>>>>>
>>>>>>> Jean-paul
>>>>>>>
>>>>>>> ---- qrsh command and output ----
>>>>>>> lemaitre /gridware/sge/bin/lx24-amd64 # qrsh -verbose -l    
>>>>>>> mem_free=10M
>>>>>>> -l num_proc=2 -q all.q at lmexec-92 date
>>>>>>> local configuration lemaitre.cism.ucl.ac.be not defined - using
>>>>>>> global configuration
>>>>>>> your job 1788 ("date") has been submitted
>>>>>>> waiting for interactive job to be scheduled ...
>>>>>>> Your interactive job 1788 has been successfully scheduled.
>>>>>>> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to host
>>>>>>> lmexec-92 ...
>>>>>>> poll: protocol failure in circuit setup
>>>>>>> /gridware/sge/utilbin/lx24-amd64/rsh exited with exit code 1
>>>>>>> reading exit code from shepherd ... 129
>>>>>>>
>>>>>>> -- 
>>>>>>> Jean-Paul Minet
>>>>>>> Gestionnaire CISM - Institut de Calcul Intensif et de  Stockage  
>>>>>>> de  Masse
>>>>>>> Universit? Catholique de Louvain
>>>>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>>>>
>>>>>>> ----------------------------------------------------------------- 
>>>>>>> -- --
>>>>>>> To unsubscribe, e-mail: users- unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-  
>>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------ 
>>>>>> -- -
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users- help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> -- 
>>>>> Jean-Paul Minet
>>>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage  de  
>>>>> Masse
>>>>> Universit? Catholique de Louvain
>>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>>
>>>>> ------------------------------------------------------------------- --
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------------- -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> -- 
>> Jean-Paul Minet
>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  Masse
>> Université Catholique de Louvain
>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> 

-- 
Jean-Paul Minet
Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de Masse
Université Catholique de Louvain
Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list