[GE users] problem with migrating to shadow master

reuti reuti at staff.uni-marburg.de
Tue Nov 23 21:56:08 GMT 2010


Hi Tina,

Am 23.11.2010 um 10:34 schrieb rumpelkeks:

>>> I have just discovered that I can't seem to migrate my qmaster service
>>> any more (it definitely used to work).
>>> 
>>> I get the "old qmaster did not write lock file. Cannot migrate qmaster."
>>> error. If I manually touch a file called 'lock' in the spool directory,
>>> all works fine. Oh plus it successfully shuts down the qmaster (just
>>> never starts one). So I've probably traced it down to 'no lock file'.
>>> 
>>> Have found nothing in the logs to explain it either.
>>> 
>>> Now. Need some pointers for further debugging.
>>> 
>>> I don't quite understand this 'locking' mechanism - when and where (and
>>> where to) is the 'lock' written? (I can't really find anything in the
>>> startup script that writes the file, only things that check for it). Is
>>> this something that the old qmaster writes when it's shutting down and
>>> the new one only starts once it appears? (There certainly is no 'lock'
>>> file in the spool directory when the qmaster is running.)
>> 
>> AFAIK: the lock file is written by the qmaster, and removed when it's shut down in a proper fashion. But when it crashes, the file will stay there and the heartbeat file won't be updated any longer. Then the shadow master decides to take over and start a now qmaster.
>> 
>> Do you have a file $SGE_ROOT/default/common/shadow_masters?
>> 
>> -- Reuti
> 
> I do have a shadow_masters file, yes.
> 
> Failover in case one dies appears to work (i.e. if I kill the qmaster 
> process, eventually the shadow master starts it). It's the manual 
> migrate (by calling the startup script with -migrate) that's not working 
> at the moment.
> 
> I do not appear to have any lock file 'in residence' though. I've 
> discovered (further testing) that it appears to work when migrating from 
> 'master B' to 'master A', but not 'master A' to 'master B'. With the 
> successsful migration, a file $SGE_ROOT/$SGE_CELL/spool/qmaster/lock 
> makes a brief appearance. It is, however, not there in normal operation 
> - looks to me as if the shutting down master creates it only on 
> shutdown, which puzzled me. (In case of the unsuccessful migration 
> attempt, there never is a $SGE_ROOT/$SGE_CELL/spool/qmaster/lock file. 
> Also if I manually touch one, migration works.)

it looks like the lock file is written to confirm a successful shutdown then (the opposite of what I was used to), and will prevent that a shadowd will take action of an unchanged heartbeat file then, as -migrate will first shut down the actual master and then start its own.

Do you want to have two qmasters, which can startup when the other is missing two-way, so you have a shadowd running on both of them?

-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=298089

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list