[GE users] Cannot start SGE on master

Daniel Templeton Dan.Templeton at Sun.COM
Thu Mar 15 23:20:54 GMT 2007


    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

If you're ever able to do a qstat and see the queues not in "au" state, 
then the execution daemon(s) contacted the qmaster within the last 
timeout period.

Daniel

Richard Bohn wrote:
> Hi,
>  
> There is no host_aliaseses file on any of the nodes. I can ping the master using its fqdn from a compute node and a dns lookup on the compute node resolves correctly. SSH works from the compute node to the master using its dns name.
>  
> If the sge master is logging that the compute nodes are registered this has to be an indication that the nodes are contacting the master on port 536. Would this be a correct statement?
>  
> Rick 
>
> ________________________________
>
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Thu 3/15/2007 5:58 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Cannot start SGE on master
>
>
>
> Hi,
>
> Am 15.03.2007 um 22:22 schrieb Richard Bohn:
>
>   
>> Hi Reuti,
>>
>> I solved the original problem and have the master and scheduler 
>> running.
>> But now when I submit a job it never gets executed because the 
>> scheduler
>> cannot connect to the compute nodes. I see messages like the following
>> in the log:
>>
>> qmaster|cluster|E|got max. unheard timeout for target "execd" on host
>> "compute-17.local", can't delivering job "3"
>>     
>
> all the nodes are also aware of the new address of the master? Are 
> you using a host_aliases file where still the old name is mentioned?
>
> -- Reuti
>
>
>   
>> I can do a qstat on the remote node and see the job waiting in the
>> queue.
>>
>> We did upgrade the local LAN switch during the move and set the local
>> machines and head node to use jumbo frames which the switch 
>> supports. I
>> don't know if SGE would be sensitive to this. I haven't seen any other
>> problems with the local LAN and talking to the remote nodes.
>>
>> I rebooted the head node and looking at the qmaster log there are a
>> number of entries saying the execd on the various compute nodes were
>> registered. Doing a qstat -f shows nothing in the status column. 
>> When I
>> do the
>> qsub -b y /bin/hostname the job stays pending and a few nodes have au
>> show up in the status field and I see the above error in the qmaster
>> log.
>>
>> I also have restarted sge execd on all the compute nodes.
>>
>> Thanks for the help.
>>
>> Rick
>>
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Thursday, March 15, 2007 1:44 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Cannot start SGE on master
>>
>> Hi,
>>
>> Am 15.03.2007 um 17:44 schrieb Richard Bohn:
>>
>>     
>>> Hello All,
>>>
>>>
>>>
>>> I'm running SGE (version 6 u6) under the ROCKS clustering software.
>>> It had been working fine until we moved the cluster which meant
>>> changing IP address of head node public interface. Now when I try
>>> to start SGE I get the following error:
>>>
>>>
>>>
>>> ./sgemaster start
>>>
>>>    starting sge_qmaster
>>>
>>>    starting sge_schedd
>>>
>>> error: unable to read qmaster name: qmaster hostname in "/opt/
>>> gridengine/default/common/act_qmaster" has zero length
>>>
>>> critical error: unable to read qmaster name: /opt/gridengine/
>>> default/common/act_qmaster
>>>
>>>
>>>
>>> Indeed the act_qmaster is zero length but if I try setting it the
>>> fqdn hostname of the machine and then try restarting SGE, the file
>>> gets reset back to zero length. The configuration file in the same
>>> directory is also zero length.
>>>       
>> you adjusted also the /etc/hosts file and/or DNS entry to reflect the
>> new (or old name) under the changed TCP/IP address? You can check
>> this with the tools in $SGE_ROOT/utilbin/<your_arch>/
>>
>> -- Reuti
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
>
>   
> ------------------------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list