[GE users] qrsh fails

Reuti reuti at staff.uni-marburg.de
Fri Jan 13 19:55:55 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

Am 13.01.2006 um 17:49 schrieb Jean-Paul Minet:

> Hi,
>
> I am trying to get tight integration to work (MPICH 1.2.6 et SGE  
> 6.0u6) and face a problem with qrsh.  Trying to debug it separately  
> from the integration bit, I obtain a "poll:protocol failure in  
> circuit setup" on the host initiating the qrsh (cfr. below).  On  
> the target host, I get the following wierd messages:
>
> Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
> lmexec-92 kernel: Oops: 0000 [2] SMP
>
> Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
> lmexec-92 kernel: CR2: 0000000000000108
>
> We use SUSE 9.0 (kernel 2.6.5-7.97-smp) on Sun V20z (bi-opteron).
>

this looks like a bug in the kernel - was the 2.6.5-7.97-smp kernel  
the latest for 9.0?

- Is this on all hosts or only on one specific one?

- Is this new and worked before? As 9.0 isn't the latest of 9.x, I'd  
assume that your cluster is already in operation for some time now.

-- Reuti

> Would someone have an idea on how to further debug the problem (I  
> have tried using tcpdump between the submit host and the target  
> host, as well as the qmaster host and the target host, to dig into  
> communication bits, but it's getting complicated...)?
>
> Thks for any help
>
> Jean-paul
>
> ---- qrsh command and output ----
> lemaitre /gridware/sge/bin/lx24-amd64 # qrsh -verbose -l  
> mem_free=10M -l num_proc=2 -q all.q at lmexec-92 date
> local configuration lemaitre.cism.ucl.ac.be not defined - using  
> global configuration
> your job 1788 ("date") has been submitted
> waiting for interactive job to be scheduled ...
> Your interactive job 1788 has been successfully scheduled.
> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to host  
> lmexec-92 ...
> poll: protocol failure in circuit setup
> /gridware/sge/utilbin/lx24-amd64/rsh exited with exit code 1
> reading exit code from shepherd ... 129
>
> -- 
> Jean-Paul Minet
> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  
> Masse
> Université Catholique de Louvain
> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list