[GE users] Cannot start SGE on master

Daniel Templeton Dan.Templeton at Sun.COM
Thu Mar 15 21:26:53 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

What does qstat -j <jobid> tell you about why the job isn't being scheduled?

Daniel

Richard Bohn wrote:
> Hi Reuti,
>
> I solved the original problem and have the master and scheduler running.
> But now when I submit a job it never gets executed because the scheduler
> cannot connect to the compute nodes. I see messages like the following
> in the log:
>
> qmaster|cluster|E|got max. unheard timeout for target "execd" on host
> "compute-17.local", can't delivering job "3"
>
> I can do a qstat on the remote node and see the job waiting in the
> queue.
>
> We did upgrade the local LAN switch during the move and set the local
> machines and head node to use jumbo frames which the switch supports. I
> don't know if SGE would be sensitive to this. I haven't seen any other
> problems with the local LAN and talking to the remote nodes.
>
> I rebooted the head node and looking at the qmaster log there are a
> number of entries saying the execd on the various compute nodes were
> registered. Doing a qstat -f shows nothing in the status column. When I
> do the 
> qsub -b y /bin/hostname the job stays pending and a few nodes have au
> show up in the status field and I see the above error in the qmaster
> log.
>
> I also have restarted sge execd on all the compute nodes.
>
> Thanks for the help.
>
> Rick
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de] 
> Sent: Thursday, March 15, 2007 1:44 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Cannot start SGE on master
>
> Hi,
>
> Am 15.03.2007 um 17:44 schrieb Richard Bohn:
>
>   
>> Hello All,
>>
>>
>>
>> I'm running SGE (version 6 u6) under the ROCKS clustering software.  
>> It had been working fine until we moved the cluster which meant  
>> changing IP address of head node public interface. Now when I try  
>> to start SGE I get the following error:
>>
>>
>>
>> ./sgemaster start
>>
>>    starting sge_qmaster
>>
>>    starting sge_schedd
>>
>> error: unable to read qmaster name: qmaster hostname in "/opt/ 
>> gridengine/default/common/act_qmaster" has zero length
>>
>> critical error: unable to read qmaster name: /opt/gridengine/ 
>> default/common/act_qmaster
>>
>>
>>
>> Indeed the act_qmaster is zero length but if I try setting it the  
>> fqdn hostname of the machine and then try restarting SGE, the file  
>> gets reset back to zero length. The configuration file in the same  
>> directory is also zero length.
>>     
> you adjusted also the /etc/hosts file and/or DNS entry to reflect the  
> new (or old name) under the changed TCP/IP address? You can check  
> this with the tools in $SGE_ROOT/utilbin/<your_arch>/
>
> -- Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list