[GE users] Odd behavior with act_qmaster - file contents change

Brett W Grant Brett_W_Grant at raytheon.com
Wed Oct 22 16:27:49 BST 2008


I am running 6.1 on a cluster of macs.  All but two of the macs are 10.4 
Tiger OS, two are the 10.5.5 Leopard.  At 7:16 local this morning, the 
file act_qmasters contents changed from the qmaster to one of these 
Leopard macs.  In the spool/qmaster/messages file at 7:17 there is a 
message about a corrupted database detected, and then a DB_RUNRECOVERY 
message and then a number of messages where gethostbyname fails.

If I look at the message file in the host that the was found in the 
self-modified act_qmaster file, it simply says at 7:20 that it couldn't 
connect to service.

There was no longer a sgemaster process running on the original qmaster 
host.

This system has been running just fine for over a year, however, I did add 
the two leopard clients about 1 month ago, but they have been working fine 
since then.

I guess that I don't really understand what the act_qmaster file is for. I 
didn't see an entry in the Manual section.  How could it change by itself? 
 What should I do to prevent this from happening in the future?  Where 
else can I look to see what happened?  I didn't see anything at all in the 
system logs.

Thanks,
Brett Grant



More information about the gridengine-users mailing list