[GE users] GE 5.3p6 on Centos 3.6/ia64

Chris Dagdigian dag at sonsorol.org
Tue Jan 10 23:46:30 GMT 2006


My $.02

Root causes for things like this can usually be traced to:

- forward and reverse DNS resolution failures

- routing/naming issues (cluster nodes trying to speak to the wrong  
NIC on the sge master node because the master wrote its *public*  
hostname into $SGE_ROOT/default/common/act_qmaster. The fix for this  
is using the SGE 'host_aliases' file to point the compute nodes at  
the proper IP/hostname for the master node.

  - firewalls on the nodes or the qmaster

  - NFS exports with root-squashing enabled. The sge_execd daemons  
need to be started by root.

-chris





On Jan 10, 2006, at 6:38 PM, James Chamberlain wrote:

> Hi folks,
>
> I'm having a bit of trouble with SGE on a cluster of Itaniums  
> running CentOS 3.6 (essentially, RHEL 3).  I can start the qmaster  
> on the head node, but the execd processes hang on all the compute  
> nodes, just after the following output from rcsge:
>
> [root at copper30 root]# /etc/init.d/rcsge start
>    starting sge_execd
> starting program: /opt/sge/bin/ia64linux/sge_commd
> using service "sge_commd"
> bound to port 536
>
> Running "qstat -f" at this point sometimes tells me that copper30  
> is down, and sometimes tells me "failed sending gdi request".  The  
> head node's queue shows up as being up and running, with everything  
> (near as I can tell) correct.  If I hit '^C' to break out of the  
> rcsge script, I can see that sge_commd is running - but not  
> sge_execd.  If I then ask rcsge to stop, I get output as follows:
>
> [root at copper30 root]# /etc/init.d/rcsge stop
> ls: /opt/sge/default/spool/copper30/active_jobs: No such file or  
> directory
>    Shutting down Grid Engine communication daemon
>
> There is a firewall running on the head node, but it is doing  
> masquerading and no filtering.  I can see 536/tcp open if I nmap  
> the head node from the compute node.
>
> Anyone have any thoughts?
>
> Thanks,
>
> James
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list