[GE users] problem with migrating to shadow master
tina.friedrich at diamond.ac.uk
Wed Nov 24 10:12:58 GMT 2010
>>>> I have just discovered that I can't seem to migrate my qmaster service
>>>> any more (it definitely used to work).
>>>> I get the "old qmaster did not write lock file. Cannot migrate qmaster."
>>>> error. If I manually touch a file called 'lock' in the spool directory,
>>>> all works fine. Oh plus it successfully shuts down the qmaster (just
>>>> never starts one). So I've probably traced it down to 'no lock file'.
>>>> Have found nothing in the logs to explain it either.
>>>> Now. Need some pointers for further debugging.
>>>> I don't quite understand this 'locking' mechanism - when and where (and
>>>> where to) is the 'lock' written? (I can't really find anything in the
>>>> startup script that writes the file, only things that check for it). Is
>>>> this something that the old qmaster writes when it's shutting down and
>>>> the new one only starts once it appears? (There certainly is no 'lock'
>>>> file in the spool directory when the qmaster is running.)
>>> AFAIK: the lock file is written by the qmaster, and removed when it's shut down in a proper fashion. But when it crashes, the file will stay there and the heartbeat file won't be updated any longer. Then the shadow master decides to take over and start a now qmaster.
>>> Do you have a file $SGE_ROOT/default/common/shadow_masters?
>>> -- Reuti
>> I do have a shadow_masters file, yes.
>> Failover in case one dies appears to work (i.e. if I kill the qmaster
>> process, eventually the shadow master starts it). It's the manual
>> migrate (by calling the startup script with -migrate) that's not working
>> at the moment.
>> I do not appear to have any lock file 'in residence' though. I've
>> discovered (further testing) that it appears to work when migrating from
>> 'master B' to 'master A', but not 'master A' to 'master B'. With the
>> successsful migration, a file $SGE_ROOT/$SGE_CELL/spool/qmaster/lock
>> makes a brief appearance. It is, however, not there in normal operation
>> - looks to me as if the shutting down master creates it only on
>> shutdown, which puzzled me. (In case of the unsuccessful migration
>> attempt, there never is a $SGE_ROOT/$SGE_CELL/spool/qmaster/lock file.
>> Also if I manually touch one, migration works.)
> it looks like the lock file is written to confirm a successful shutdown then (the opposite of what I was used to), and will prevent that a shadowd will take action of an unchanged heartbeat file then, as -migrate will first shut down the actual master and then start its own.
> Do you want to have two qmasters, which can startup when the other is missing two-way, so you have a shadowd running on both of them?
That seems to be the theory, only in my case it doesn't seem to work
very well (and I'm trying to find out why - it might be a timing issue,
my $SGE_ROOT/$SGE_CELL/spool etc is on an NFS share).
Yes I do want (and have) two servers running a shadowd (and one running
a qmaster) so they can take over if one fails. Which from my tests is
stil working (I'll do some more testing).
However, not actually being able to cleanly migrate (well, not unless I
manually fake a lock file to appear) is annoying. It's a useful feature,
I think; I stumbled upon this problem when I wanted to migrate of the
current master to be able to take it down for maintenance. I am sure it
used to work, I tested it a lot when I set up the shadow master.
This was before I upgraded to 6.2u4 from 6.2u2 though - did the
mechanism on how a migration is handled change between u2 and u4 to
anyone's knowledge? I'm trying to find out if this is a problem within
SGE (odd timing or something), or a problem with my setup (which I don't
think changed since this was working). I can't fine a lot of information
about about the actual mechanism (i.e. who is supposed to write the lock
file, and when; stuff like that), which limits my debugging capabilities
a bit :)
> -- Reuti
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users