[GE users] Cannot start SGE on master

Richard Bohn rxbeee at rit.edu
Thu Mar 15 23:05:28 GMT 2007


    [ The following text is in the "iso-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,
 
There is no host_aliaseses file on any of the nodes. I can ping the master using its fqdn from a compute node and a dns lookup on the compute node resolves correctly. SSH works from the compute node to the master using its dns name.
 
If the sge master is logging that the compute nodes are registered this has to be an indication that the nodes are contacting the master on port 536. Would this be a correct statement?
 
Rick 

________________________________

From: Reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Thu 3/15/2007 5:58 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Cannot start SGE on master



Hi,

Am 15.03.2007 um 22:22 schrieb Richard Bohn:

> Hi Reuti,
>
> I solved the original problem and have the master and scheduler 
> running.
> But now when I submit a job it never gets executed because the 
> scheduler
> cannot connect to the compute nodes. I see messages like the following
> in the log:
>
> qmaster|cluster|E|got max. unheard timeout for target "execd" on host
> "compute-17.local", can't delivering job "3"

all the nodes are also aware of the new address of the master? Are 
you using a host_aliases file where still the old name is mentioned?

-- Reuti


> I can do a qstat on the remote node and see the job waiting in the
> queue.
>
> We did upgrade the local LAN switch during the move and set the local
> machines and head node to use jumbo frames which the switch 
> supports. I
> don't know if SGE would be sensitive to this. I haven't seen any other
> problems with the local LAN and talking to the remote nodes.
>
> I rebooted the head node and looking at the qmaster log there are a
> number of entries saying the execd on the various compute nodes were
> registered. Doing a qstat -f shows nothing in the status column. 
> When I
> do the
> qsub -b y /bin/hostname the job stays pending and a few nodes have au
> show up in the status field and I see the above error in the qmaster
> log.
>
> I also have restarted sge execd on all the compute nodes.
>
> Thanks for the help.
>
> Rick
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Thursday, March 15, 2007 1:44 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Cannot start SGE on master
>
> Hi,
>
> Am 15.03.2007 um 17:44 schrieb Richard Bohn:
>
>> Hello All,
>>
>>
>>
>> I'm running SGE (version 6 u6) under the ROCKS clustering software.
>> It had been working fine until we moved the cluster which meant
>> changing IP address of head node public interface. Now when I try
>> to start SGE I get the following error:
>>
>>
>>
>> ./sgemaster start
>>
>>    starting sge_qmaster
>>
>>    starting sge_schedd
>>
>> error: unable to read qmaster name: qmaster hostname in "/opt/
>> gridengine/default/common/act_qmaster" has zero length
>>
>> critical error: unable to read qmaster name: /opt/gridengine/
>> default/common/act_qmaster
>>
>>
>>
>> Indeed the act_qmaster is zero length but if I try setting it the
>> fqdn hostname of the machine and then try restarting SGE, the file
>> gets reset back to zero length. The configuration file in the same
>> directory is also zero length.
> you adjusted also the /etc/hosts file and/or DNS entry to reflect the
> new (or old name) under the changed TCP/IP address? You can check
> this with the tools in $SGE_ROOT/utilbin/<your_arch>/
>
> -- Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net






    [ Part 2: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list