[GE users] qrsh fails

Jean-Paul Minet minet at cism.ucl.ac.be
Mon Jan 16 08:16:50 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti,

>> I am trying to get tight integration to work (MPICH 1.2.6 et SGE  
>> 6.0u6) and face a problem with qrsh.  Trying to debug it separately  
>> from the integration bit, I obtain a "poll:protocol failure in  
>> circuit setup" on the host initiating the qrsh (cfr. below).  On  the 
>> target host, I get the following wierd messages:
>>
>> Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
>> lmexec-92 kernel: Oops: 0000 [2] SMP
>>
>> Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
>> lmexec-92 kernel: CR2: 0000000000000108
>>
>> We use SUSE 9.0 (kernel 2.6.5-7.97-smp) on Sun V20z (bi-opteron).
>>
> 
> this looks like a bug in the kernel - was the 2.6.5-7.97-smp kernel  the 
> latest for 9.0?

We actually use SLES 9 (entreprise version).  The cluster has been purchased and 
installed last quarter.  I checked on the Novell site and didn't see any 
subsequent release.

> - Is this on all hosts or only on one specific one?

Just tried with a few hosts, and the behavior is the same...

> - Is this new and worked before? As 9.0 isn't the latest of 9.x, I'd  
> assume that your cluster is already in operation for some time now.

It never worked before.  Install is new; SGE configured and more or less 
working, except bits and pieces here and there, among which tight integration 
for mpich/ethernet interconnect; I have also trouble with the infiniband 
interconnect integration: the patch for mpich/infiniband and SGE tight 
integration, available on the HowTo site, doesn't match the version of mpich 
supplied and customized by the Infiniband vendor.  I am awaiting support form 
Infiniband vendor to get latest mpich/mvapich version installed/customized.

thnks & rgds

Jean-Paul

> -- Reuti
> 
>> Would someone have an idea on how to further debug the problem (I  
>> have tried using tcpdump between the submit host and the target  host, 
>> as well as the qmaster host and the target host, to dig into  
>> communication bits, but it's getting complicated...)?
>>
>> Thks for any help
>>
>> Jean-paul
>>
>> ---- qrsh command and output ----
>> lemaitre /gridware/sge/bin/lx24-amd64 # qrsh -verbose -l  mem_free=10M 
>> -l num_proc=2 -q all.q at lmexec-92 date
>> local configuration lemaitre.cism.ucl.ac.be not defined - using  
>> global configuration
>> your job 1788 ("date") has been submitted
>> waiting for interactive job to be scheduled ...
>> Your interactive job 1788 has been successfully scheduled.
>> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to host  
>> lmexec-92 ...
>> poll: protocol failure in circuit setup
>> /gridware/sge/utilbin/lx24-amd64/rsh exited with exit code 1
>> reading exit code from shepherd ... 129
>>
>> -- 
>> Jean-Paul Minet
>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  Masse
>> Université Catholique de Louvain
>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> 

-- 
Jean-Paul Minet
Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de Masse
Université Catholique de Louvain
Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list