[GE users] problem with migrating to shadow master
reuti at staff.uni-marburg.de
Tue Nov 23 21:56:08 GMT 2010
Am 23.11.2010 um 10:34 schrieb rumpelkeks:
>>> I have just discovered that I can't seem to migrate my qmaster service
>>> any more (it definitely used to work).
>>> I get the "old qmaster did not write lock file. Cannot migrate qmaster."
>>> error. If I manually touch a file called 'lock' in the spool directory,
>>> all works fine. Oh plus it successfully shuts down the qmaster (just
>>> never starts one). So I've probably traced it down to 'no lock file'.
>>> Have found nothing in the logs to explain it either.
>>> Now. Need some pointers for further debugging.
>>> I don't quite understand this 'locking' mechanism - when and where (and
>>> where to) is the 'lock' written? (I can't really find anything in the
>>> startup script that writes the file, only things that check for it). Is
>>> this something that the old qmaster writes when it's shutting down and
>>> the new one only starts once it appears? (There certainly is no 'lock'
>>> file in the spool directory when the qmaster is running.)
>> AFAIK: the lock file is written by the qmaster, and removed when it's shut down in a proper fashion. But when it crashes, the file will stay there and the heartbeat file won't be updated any longer. Then the shadow master decides to take over and start a now qmaster.
>> Do you have a file $SGE_ROOT/default/common/shadow_masters?
>> -- Reuti
> I do have a shadow_masters file, yes.
> Failover in case one dies appears to work (i.e. if I kill the qmaster
> process, eventually the shadow master starts it). It's the manual
> migrate (by calling the startup script with -migrate) that's not working
> at the moment.
> I do not appear to have any lock file 'in residence' though. I've
> discovered (further testing) that it appears to work when migrating from
> 'master B' to 'master A', but not 'master A' to 'master B'. With the
> successsful migration, a file $SGE_ROOT/$SGE_CELL/spool/qmaster/lock
> makes a brief appearance. It is, however, not there in normal operation
> - looks to me as if the shutting down master creates it only on
> shutdown, which puzzled me. (In case of the unsuccessful migration
> attempt, there never is a $SGE_ROOT/$SGE_CELL/spool/qmaster/lock file.
> Also if I manually touch one, migration works.)
it looks like the lock file is written to confirm a successful shutdown then (the opposite of what I was used to), and will prevent that a shadowd will take action of an unchanged heartbeat file then, as -migrate will first shut down the actual master and then start its own.
Do you want to have two qmasters, which can startup when the other is missing two-way, so you have a shadowd running on both of them?
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users