[GE users] qmaster migration
pvdmeer at gmail.com
Thu Mar 11 11:49:56 GMT 2010
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Dear gridengine users,
Recently I attempted to migrate the master to another machine in our cluster. What I did was basically this:
1) shut down the daemons
2) installed the "gridengine-master" package on the new machine, and configured it..
3) copied the "/var/lib/gridengine/default/common" directory from the old to the new machine, changed the NFS mount of this directory, set the "act_qmaster"" file to the new hostname.
4) adjusted "/var/spool/gridengine/qmaster/jobseqnum" to reflect the current job number
5) started up all daemons
I think I missed something here. For one, it's pretty cumbersome. And, more importantly, users noticed that using "qdel" on their jobs basically could freeze a machine. Raising load average to about 20.0 (!). As suspected. There was something wrong with the job numbers .. It seemed to start from 0 again.. Oh no!
So, what I did was one additional step. This time, I made sure I copied the "/var/spool/gridengine/spooldb".... So far, it seems to work. I can submit and delete jobs without a problem. But it seems like quite a chore..
Is there an official and clean 'n easy way to migrate? I'm running Debian on 64 bit Intel, and SGE 6.2u3 (IIRC).
Any hints are appreciated.
With kind regards,
Pieter van der Meer
More information about the gridengine-users