[GE users] GE 5.3p6 on Centos 3.6/ia64

James Chamberlain jamesc at exa.com
Wed Jan 11 00:03:21 GMT 2006


Thanks Chris, that's done it.  I had an oops in my /etc/hosts file.  That's 
working a lot better now.

James

On Tue, 10 Jan 2006, Chris Dagdigian wrote:

>
> My $.02
>
> Root causes for things like this can usually be traced to:
>
> - forward and reverse DNS resolution failures
>
> - routing/naming issues (cluster nodes trying to speak to the wrong NIC on 
> the sge master node because the master wrote its *public* hostname into 
> $SGE_ROOT/default/common/act_qmaster. The fix for this is using the SGE 
> 'host_aliases' file to point the compute nodes at the proper IP/hostname for 
> the master node.
>
> - firewalls on the nodes or the qmaster
>
> - NFS exports with root-squashing enabled. The sge_execd daemons need to be 
> started by root.
>
> -chris
>
>
>
>
>
> On Jan 10, 2006, at 6:38 PM, James Chamberlain wrote:
>
>> Hi folks,
>> 
>> I'm having a bit of trouble with SGE on a cluster of Itaniums running 
>> CentOS 3.6 (essentially, RHEL 3).  I can start the qmaster on the head 
>> node, but the execd processes hang on all the compute nodes, just after the 
>> following output from rcsge:
>> 
>> [root at copper30 root]# /etc/init.d/rcsge start
>>   starting sge_execd
>> starting program: /opt/sge/bin/ia64linux/sge_commd
>> using service "sge_commd"
>> bound to port 536
>> 
>> Running "qstat -f" at this point sometimes tells me that copper30 is down, 
>> and sometimes tells me "failed sending gdi request".  The head node's queue 
>> shows up as being up and running, with everything (near as I can tell) 
>> correct.  If I hit '^C' to break out of the rcsge script, I can see that 
>> sge_commd is running - but not sge_execd.  If I then ask rcsge to stop, I 
>> get output as follows:
>> 
>> [root at copper30 root]# /etc/init.d/rcsge stop
>> ls: /opt/sge/default/spool/copper30/active_jobs: No such file or directory
>>   Shutting down Grid Engine communication daemon
>> 
>> There is a firewall running on the head node, but it is doing masquerading 
>> and no filtering.  I can see 536/tcp open if I nmap the head node from the 
>> compute node.
>> 
>> Anyone have any thoughts?
>> 
>> Thanks,
>> 
>> James
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list