[GE users] Cannot start SGE on master

Richard Bohn rxbeee at rit.edu
Thu Mar 15 21:22:11 GMT 2007


Hi Reuti,

I solved the original problem and have the master and scheduler running.
But now when I submit a job it never gets executed because the scheduler
cannot connect to the compute nodes. I see messages like the following
in the log:

qmaster|cluster|E|got max. unheard timeout for target "execd" on host
"compute-17.local", can't delivering job "3"

I can do a qstat on the remote node and see the job waiting in the
queue.

We did upgrade the local LAN switch during the move and set the local
machines and head node to use jumbo frames which the switch supports. I
don't know if SGE would be sensitive to this. I haven't seen any other
problems with the local LAN and talking to the remote nodes.

I rebooted the head node and looking at the qmaster log there are a
number of entries saying the execd on the various compute nodes were
registered. Doing a qstat -f shows nothing in the status column. When I
do the 
qsub -b y /bin/hostname the job stays pending and a few nodes have au
show up in the status field and I see the above error in the qmaster
log.

I also have restarted sge execd on all the compute nodes.

Thanks for the help.

Rick


-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Thursday, March 15, 2007 1:44 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Cannot start SGE on master

Hi,

Am 15.03.2007 um 17:44 schrieb Richard Bohn:

> Hello All,
>
>
>
> I'm running SGE (version 6 u6) under the ROCKS clustering software.  
> It had been working fine until we moved the cluster which meant  
> changing IP address of head node public interface. Now when I try  
> to start SGE I get the following error:
>
>
>
> ./sgemaster start
>
>    starting sge_qmaster
>
>    starting sge_schedd
>
> error: unable to read qmaster name: qmaster hostname in "/opt/ 
> gridengine/default/common/act_qmaster" has zero length
>
> critical error: unable to read qmaster name: /opt/gridengine/ 
> default/common/act_qmaster
>
>
>
> Indeed the act_qmaster is zero length but if I try setting it the  
> fqdn hostname of the machine and then try restarting SGE, the file  
> gets reset back to zero length. The configuration file in the same  
> directory is also zero length.
you adjusted also the /etc/hosts file and/or DNS entry to reflect the  
new (or old name) under the changed TCP/IP address? You can check  
this with the tools in $SGE_ROOT/utilbin/<your_arch>/

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list