[GE users] GE 5.3p6 on Centos 3.6/ia64

James Chamberlain jamesc at exa.com
Tue Jan 10 23:38:33 GMT 2006


Hi folks,

I'm having a bit of trouble with SGE on a cluster of Itaniums running CentOS 
3.6 (essentially, RHEL 3).  I can start the qmaster on the head node, but the 
execd processes hang on all the compute nodes, just after the following 
output from rcsge:

[root at copper30 root]# /etc/init.d/rcsge start
    starting sge_execd
starting program: /opt/sge/bin/ia64linux/sge_commd
using service "sge_commd"
bound to port 536

Running "qstat -f" at this point sometimes tells me that copper30 is down, 
and sometimes tells me "failed sending gdi request".  The head node's queue 
shows up as being up and running, with everything (near as I can tell) 
correct.  If I hit '^C' to break out of the rcsge script, I can see that 
sge_commd is running - but not sge_execd.  If I then ask rcsge to stop, I get 
output as follows:

[root at copper30 root]# /etc/init.d/rcsge stop
ls: /opt/sge/default/spool/copper30/active_jobs: No such file or directory
    Shutting down Grid Engine communication daemon

There is a firewall running on the head node, but it is doing masquerading 
and no filtering.  I can see 536/tcp open if I nmap the head node from the 
compute node.

Anyone have any thoughts?

Thanks,

James

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list