[GE users] Cannot start SGE on master

Reuti reuti at staff.uni-marburg.de
Thu Mar 15 21:58:54 GMT 2007


Hi,

Am 15.03.2007 um 22:22 schrieb Richard Bohn:

> Hi Reuti,
>
> I solved the original problem and have the master and scheduler  
> running.
> But now when I submit a job it never gets executed because the  
> scheduler
> cannot connect to the compute nodes. I see messages like the following
> in the log:
>
> qmaster|cluster|E|got max. unheard timeout for target "execd" on host
> "compute-17.local", can't delivering job "3"

all the nodes are also aware of the new address of the master? Are  
you using a host_aliases file where still the old name is mentioned?

-- Reuti


> I can do a qstat on the remote node and see the job waiting in the
> queue.
>
> We did upgrade the local LAN switch during the move and set the local
> machines and head node to use jumbo frames which the switch  
> supports. I
> don't know if SGE would be sensitive to this. I haven't seen any other
> problems with the local LAN and talking to the remote nodes.
>
> I rebooted the head node and looking at the qmaster log there are a
> number of entries saying the execd on the various compute nodes were
> registered. Doing a qstat -f shows nothing in the status column.  
> When I
> do the
> qsub -b y /bin/hostname the job stays pending and a few nodes have au
> show up in the status field and I see the above error in the qmaster
> log.
>
> I also have restarted sge execd on all the compute nodes.
>
> Thanks for the help.
>
> Rick
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Thursday, March 15, 2007 1:44 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Cannot start SGE on master
>
> Hi,
>
> Am 15.03.2007 um 17:44 schrieb Richard Bohn:
>
>> Hello All,
>>
>>
>>
>> I'm running SGE (version 6 u6) under the ROCKS clustering software.
>> It had been working fine until we moved the cluster which meant
>> changing IP address of head node public interface. Now when I try
>> to start SGE I get the following error:
>>
>>
>>
>> ./sgemaster start
>>
>>    starting sge_qmaster
>>
>>    starting sge_schedd
>>
>> error: unable to read qmaster name: qmaster hostname in "/opt/
>> gridengine/default/common/act_qmaster" has zero length
>>
>> critical error: unable to read qmaster name: /opt/gridengine/
>> default/common/act_qmaster
>>
>>
>>
>> Indeed the act_qmaster is zero length but if I try setting it the
>> fqdn hostname of the machine and then try restarting SGE, the file
>> gets reset back to zero length. The configuration file in the same
>> directory is also zero length.
> you adjusted also the /etc/hosts file and/or DNS entry to reflect the
> new (or old name) under the changed TCP/IP address? You can check
> this with the tools in $SGE_ROOT/utilbin/<your_arch>/
>
> -- Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list