[GE users] Cannot start SGE on master

Richard Bohn rxbeee at rit.edu
Fri Mar 16 13:24:18 GMT 2007


Hello,

It appears that I fixed the "got max. unheard timeout for target"
problem by using a configuration file from another SGE (6.0u8)
installation and replacing the one in $SGE_ROOT/default/common. 

My question now is; how do you regenerate the configuration file
normally. The file says "Do not modify this file manually"
My configuration file for some reason got corrupted and only contained
the following lines:

--------------------------------
When I stop/start SGE this is the
# Version: 6.0u6
#
# DO NOT MODIFY THIS FILE MANUALLY!
#
conf_version              2
----------------------------------

Thanks
Rick


-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Thursday, March 15, 2007 5:59 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Cannot start SGE on master

Hi,

Am 15.03.2007 um 22:22 schrieb Richard Bohn:

> Hi Reuti,
>
> I solved the original problem and have the master and scheduler  
> running.
> But now when I submit a job it never gets executed because the  
> scheduler
> cannot connect to the compute nodes. I see messages like the following
> in the log:
>
> qmaster|cluster|E|got max. unheard timeout for target "execd" on host
> "compute-17.local", can't delivering job "3"

all the nodes are also aware of the new address of the master? Are  
you using a host_aliases file where still the old name is mentioned?

-- Reuti


> I can do a qstat on the remote node and see the job waiting in the
> queue.
>
> We did upgrade the local LAN switch during the move and set the local
> machines and head node to use jumbo frames which the switch  
> supports. I
> don't know if SGE would be sensitive to this. I haven't seen any other
> problems with the local LAN and talking to the remote nodes.
>
> I rebooted the head node and looking at the qmaster log there are a
> number of entries saying the execd on the various compute nodes were
> registered. Doing a qstat -f shows nothing in the status column.  
> When I
> do the
> qsub -b y /bin/hostname the job stays pending and a few nodes have au
> show up in the status field and I see the above error in the qmaster
> log.
>
> I also have restarted sge execd on all the compute nodes.
>
> Thanks for the help.
>
> Rick
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Thursday, March 15, 2007 1:44 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Cannot start SGE on master
>
> Hi,
>
> Am 15.03.2007 um 17:44 schrieb Richard Bohn:
>
>> Hello All,
>>
>>
>>
>> I'm running SGE (version 6 u6) under the ROCKS clustering software.
>> It had been working fine until we moved the cluster which meant
>> changing IP address of head node public interface. Now when I try
>> to start SGE I get the following error:
>>
>>
>>
>> ./sgemaster start
>>
>>    starting sge_qmaster
>>
>>    starting sge_schedd
>>
>> error: unable to read qmaster name: qmaster hostname in "/opt/
>> gridengine/default/common/act_qmaster" has zero length
>>
>> critical error: unable to read qmaster name: /opt/gridengine/
>> default/common/act_qmaster
>>
>>
>>
>> Indeed the act_qmaster is zero length but if I try setting it the
>> fqdn hostname of the machine and then try restarting SGE, the file
>> gets reset back to zero length. The configuration file in the same
>> directory is also zero length.
> you adjusted also the /etc/hosts file and/or DNS entry to reflect the
> new (or old name) under the changed TCP/IP address? You can check
> this with the tools in $SGE_ROOT/utilbin/<your_arch>/
>
> -- Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list