[GE users] qrsh fails

DGS dgs at gs.washington.edu
Mon Jan 16 08:54:06 GMT 2006


'Qrsh' functionality in SGE is rather a crap-shoot.  I've got two 
systems running SGE V60u3.  On one, a Caos Linux with kernel 2.6.13.4
and Glibc 2.3.3, 'qrsh' and 'qlogin' work fine right out of the box.
On the other, Red Hat 9 with kernel 2.4.20-31 and Glibc 2.3.2, I have
to make '$SGE_ROOT/bin/lx24-x86/qsh' SUID 'root' in order for 'qrsh'
to work properly.  On a RHEL 3 system with kernel 2.4.21-32, Glibc
2.3.2, SGE V60u4, again 'qrsh' and 'qlogin' work in the default 
installation.  On another Caos system, kernel 2.6.13, Glibc 2.3.3,
SGE V60u6, 'qlogin' works, but 'qrsh' doesn't, no matter whether
'$SGE_ROOT/bin/lx24-x86/qsh' is SUID 'root' or not.

All in all, a crap-shoot. 

David S.

On Mon, Jan 16, 2006 at 09:16:50AM +0100, Jean-Paul Minet wrote:
> Reuti,
> 
> >>I am trying to get tight integration to work (MPICH 1.2.6 et SGE  
> >>6.0u6) and face a problem with qrsh.  Trying to debug it separately  
> >>from the integration bit, I obtain a "poll:protocol failure in  
> >>circuit setup" on the host initiating the qrsh (cfr. below).  On  the 
> >>target host, I get the following wierd messages:
> >>
> >>Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
> >>lmexec-92 kernel: Oops: 0000 [2] SMP
> >>
> >>Message from syslogd at lmexec-92 at Fri Jan 13 10:47:21 2006 ...
> >>lmexec-92 kernel: CR2: 0000000000000108
> >>
> >>We use SUSE 9.0 (kernel 2.6.5-7.97-smp) on Sun V20z (bi-opteron).
> >>
> >
> >this looks like a bug in the kernel - was the 2.6.5-7.97-smp kernel  the 
> >latest for 9.0?
> 
> We actually use SLES 9 (entreprise version).  The cluster has been 
> purchased and installed last quarter.  I checked on the Novell site and 
> didn't see any subsequent release.
> 
> >- Is this on all hosts or only on one specific one?
> 
> Just tried with a few hosts, and the behavior is the same...
> 
> >- Is this new and worked before? As 9.0 isn't the latest of 9.x, I'd  
> >assume that your cluster is already in operation for some time now.
> 
> It never worked before.  Install is new; SGE configured and more or less 
> working, except bits and pieces here and there, among which tight 
> integration for mpich/ethernet interconnect; I have also trouble with the 
> infiniband interconnect integration: the patch for mpich/infiniband and SGE 
> tight integration, available on the HowTo site, doesn't match the version 
> of mpich supplied and customized by the Infiniband vendor.  I am awaiting 
> support form Infiniband vendor to get latest mpich/mvapich version 
> installed/customized.
> 
> thnks & rgds
> 
> Jean-Paul
> 
> >-- Reuti
> >
> >>Would someone have an idea on how to further debug the problem (I  
> >>have tried using tcpdump between the submit host and the target  host, 
> >>as well as the qmaster host and the target host, to dig into  
> >>communication bits, but it's getting complicated...)?
> >>
> >>Thks for any help
> >>
> >>Jean-paul
> >>
> >>---- qrsh command and output ----
> >>lemaitre /gridware/sge/bin/lx24-amd64 # qrsh -verbose -l  mem_free=10M 
> >>-l num_proc=2 -q all.q at lmexec-92 date
> >>local configuration lemaitre.cism.ucl.ac.be not defined - using  
> >>global configuration
> >>your job 1788 ("date") has been submitted
> >>waiting for interactive job to be scheduled ...
> >>Your interactive job 1788 has been successfully scheduled.
> >>Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to host  
> >>lmexec-92 ...
> >>poll: protocol failure in circuit setup
> >>/gridware/sge/utilbin/lx24-amd64/rsh exited with exit code 1
> >>reading exit code from shepherd ... 129
> >>
> >>-- 
> >>Jean-Paul Minet
> >>Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  Masse
> >>Universit? Catholique de Louvain
> >>Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> >
> >
> 
> -- 
> Jean-Paul Minet
> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de Masse
> Universit? Catholique de Louvain
> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list