[GE users] Odd behavior with act_qmaster - file contents change

Chris Dagdigian dag at sonsorol.org
Wed Oct 22 16:37:39 BST 2008


Brett,

The act_qmaster file lists the hostname of the currently running  
qmaster system. The SGE clients read this file at startup to learn  
what host to connect to. The only reason the hostname would change  
automatically would be if you had configured shadow masters - in that  
case if the qmaster can't be contacted within a timeout period, a new  
qmaster starts up, reads the spool and then writes its hostname to the  
act_qmaster file.

If you are looking for docs at this I'd look for "shadow master".  
Sounds like you did not intentionally set up or expect to see failover  
behavior.

If shadow master is not the culprit than the only other reason would  
be that someone manually tried to start the Qmaster on a different  
host -- youd' see the same symptoms then (new hostname in act_qmaster)

-Chris

On Oct 22, 2008, at 11:27 AM, Brett W Grant wrote:

>
> I am running 6.1 on a cluster of macs.  All but two of the macs are  
> 10.4 Tiger OS, two are the 10.5.5 Leopard.  At 7:16 local this  
> morning, the file act_qmasters contents changed from the qmaster to  
> one of these Leopard macs.  In the spool/qmaster/messages file at  
> 7:17 there is a message about a corrupted database detected, and  
> then a DB_RUNRECOVERY message and then a number of messages where  
> gethostbyname fails.
>
> If I look at the message file in the host that the was found in the  
> self-modified act_qmaster file, it simply says at 7:20 that it  
> couldn't connect to service.
>
> There was no longer a sgemaster process running on the original  
> qmaster host.
>
> This system has been running just fine for over a year, however, I  
> did add the two leopard clients about 1 month ago, but they have  
> been working fine since then.
>
> I guess that I don't really understand what the act_qmaster file is  
> for.  I didn't see an entry in the Manual section.  How could it  
> change by itself?  What should I do to prevent this from happening  
> in the future?  Where else can I look to see what happened?  I  
> didn't see anything at all in the system logs.
>
> Thanks,
> Brett Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list