[GE users] Odd behavior with act_qmaster - file contents change

Brett W Grant Brett_W_Grant at raytheon.com
Wed Oct 22 16:49:49 BST 2008


I do not have shadow masters or failover setup.  I did look at both of the 
leopard machines, which I set up with the plists files in LaunchDaemons, 
and noticed that on one of them I had only put in a sgeexecd.plist file, 
but in the machine that "took over" the act_qmaster file, I had put a 
sgemaster.plist file in it.  I removed this file and restarted the 
machine.  I am wondering if there was some sort of blurp and that this 
machine then started sgemaster?  Could this be seen as a manual start of 
the sgemaster that you mentioned?

Thanks,
Brett Grant




Chris Dagdigian <dag at sonsorol.org> 
10/22/08 08:39 AM
Please respond to
users at gridengine.sunsource.net


To
users at gridengine.sunsource.net
cc

Subject
Re: [GE users] Odd behavior with act_qmaster - file contents change






Brett,

The act_qmaster file lists the hostname of the currently running 
qmaster system. The SGE clients read this file at startup to learn 
what host to connect to. The only reason the hostname would change 
automatically would be if you had configured shadow masters - in that 
case if the qmaster can't be contacted within a timeout period, a new 
qmaster starts up, reads the spool and then writes its hostname to the 
act_qmaster file.

If you are looking for docs at this I'd look for "shadow master". 
Sounds like you did not intentionally set up or expect to see failover 
behavior.

If shadow master is not the culprit than the only other reason would 
be that someone manually tried to start the Qmaster on a different 
host -- youd' see the same symptoms then (new hostname in act_qmaster)

-Chris

On Oct 22, 2008, at 11:27 AM, Brett W Grant wrote:

>
> I am running 6.1 on a cluster of macs.  All but two of the macs are 
> 10.4 Tiger OS, two are the 10.5.5 Leopard.  At 7:16 local this 
> morning, the file act_qmasters contents changed from the qmaster to 
> one of these Leopard macs.  In the spool/qmaster/messages file at 
> 7:17 there is a message about a corrupted database detected, and 
> then a DB_RUNRECOVERY message and then a number of messages where 
> gethostbyname fails.
>
> If I look at the message file in the host that the was found in the 
> self-modified act_qmaster file, it simply says at 7:20 that it 
> couldn't connect to service.
>
> There was no longer a sgemaster process running on the original 
> qmaster host.
>
> This system has been running just fine for over a year, however, I 
> did add the two leopard clients about 1 month ago, but they have 
> been working fine since then.
>
> I guess that I don't really understand what the act_qmaster file is 
> for.  I didn't see an entry in the Manual section.  How could it 
> change by itself?  What should I do to prevent this from happening 
> in the future?  Where else can I look to see what happened?  I 
> didn't see anything at all in the system logs.
>
> Thanks,
> Brett Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net







More information about the gridengine-users mailing list