[GE users] Does a shadow master really improve reliability

templedf dan.templeton at sun.com
Tue Mar 17 15:50:58 GMT 2009


No.  If the BDB server fails, the cluster will fail.  Let me try to 
explain that again.

The qmaster writes a heartbeat file to a shared file system every 
second.  The shadow master reads that heartbeat file periodically to 
make sure that the qmaster is still there.  If the heartbeat file goes 
stale, the shadow master will update the act_qmaster file in the cell 
directory and start its own qmaster.  In order for the new qmaster to 
get the cluster's state, it has to have access to the old qmaster's 
spool directory.  There are three options in this regard.  1) The old 
qmaster uses classic spooling over a shared file system.  Slow but 
simple, and if the cell directory is served from the same place as the 
spool directory, the shared file system is still the only single point 
of failure (SPoF).  2) The old qmaster uses local BDB spooling over 
NFSv4.  Still slow, but a little less simple.  Again, as long as the 
cell directory is served from the same place as the spool directory, the 
shared file system is still the only SPoF.  3) The old qmaster uses a 
BDB spooling to a remote BDB server.  Faster than spooling over a shared 
file system, but much less simple.  With this option, the shadow still 
needs the shared file system to read the qmaster's heartbeat file and 
update the act_qmaster file.  That's one SPoF.  But you now also have 
the BDB server itself to worry about.  That's the second SPoF.  Instead 
of eliminating the SPoF from the file system, you've really only added 
one in the BDB server.

Make sense?

Daniel

ddavies wrote:
> Hi,
> I'm still concerned about the sentence from the Grid whitepaper I quoted: "The goal of installing a shadow master is to eliminate a single point of failure in the cluster, but adding a remote Berkeley server for spooling adds an additional single point of failure."
>
> This implies that if the shadow master fails, then the Grid will fail. This is what "adds an additional single point of failure" means; correct?
> Is it true that if the shadow master fails, it will cause Grid to fail?
>
> Is there any documented way to install a shadow master such that Grid will not fail when shadow master fails?
>
> Regards,
> Dave Davies
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=134392
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=134401

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list