[GE users] qrsh fails

Reuti reuti at staff.uni-marburg.de
Mon Jan 16 14:27:29 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Am 16.01.2006 um 09:16 schrieb Jean-Paul Minet:

> Reuti,
>
>>> I am trying to get tight integration to work (MPICH 1.2.6 et SGE   
>>> 6.0u6) and face a problem with qrsh.  Trying to debug it  
>>> separately  from the integration bit, I obtain a "poll:protocol  
>>> failure in  circuit setup" on the host initiating the qrsh (cfr.  
>>> below).  On  the target host, I get the following wierd messages:
>>>
>>> Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
>>> lmexec-92 kernel: Oops: 0000 [2] SMP
>>>
>>> Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
>>> lmexec-92 kernel: CR2: 0000000000000108
>>>
>>> We use SUSE 9.0 (kernel 2.6.5-7.97-smp) on Sun V20z (bi-opteron).
>>>
>> this looks like a bug in the kernel - was the 2.6.5-7.97-smp  
>> kernel  the latest for 9.0?
>
> We actually use SLES 9 (entreprise version).  The cluster has been  
> purchased and installed last quarter.  I checked on the Novell site  
> and didn't see any subsequent release.
>
>> - Is this on all hosts or only on one specific one?
>
> Just tried with a few hosts, and the behavior is the same...
>

Okay, so it's not a hardware problem. Can you please check the the / 
var/log/messages on the nodes (not the messages file from SGE). What  
type of network card is installed, and which modul is loaded for it?

lsmod
lspci

might give you some hints. - Reuti

>> - Is this new and worked before? As 9.0 isn't the latest of 9.x,  
>> I'd  assume that your cluster is already in operation for some  
>> time now.
>
> It never worked before.  Install is new; SGE configured and more or  
> less working, except bits and pieces here and there, among which  
> tight integration for mpich/ethernet interconnect; I have also  
> trouble with the infiniband interconnect integration: the patch for  
> mpich/infiniband and SGE tight integration, available on the HowTo  
> site, doesn't match the version of mpich supplied and customized by  
> the Infiniband vendor.  I am awaiting support form Infiniband  
> vendor to get latest mpich/mvapich version installed/customized.
>
> thnks & rgds
>
> Jean-Paul
>
>> -- Reuti
>>> Would someone have an idea on how to further debug the problem  
>>> (I  have tried using tcpdump between the submit host and the  
>>> target  host, as well as the qmaster host and the target host, to  
>>> dig into  communication bits, but it's getting complicated...)?
>>>
>>> Thks for any help
>>>
>>> Jean-paul
>>>
>>> ---- qrsh command and output ----
>>> lemaitre /gridware/sge/bin/lx24-amd64 # qrsh -verbose -l   
>>> mem_free=10M -l num_proc=2 -q all.q at lmexec-92 date
>>> local configuration lemaitre.cism.ucl.ac.be not defined - using   
>>> global configuration
>>> your job 1788 ("date") has been submitted
>>> waiting for interactive job to be scheduled ...
>>> Your interactive job 1788 has been successfully scheduled.
>>> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to  
>>> host  lmexec-92 ...
>>> poll: protocol failure in circuit setup
>>> /gridware/sge/utilbin/lx24-amd64/rsh exited with exit code 1
>>> reading exit code from shepherd ... 129
>>>
>>> -- 
>>> Jean-Paul Minet
>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage  
>>> de  Masse
>>> Université Catholique de Louvain
>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> -- 
> Jean-Paul Minet
> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  
> Masse
> Université Catholique de Louvain
> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list