[GE users] failover after a qmaster failure

Mulley, Nikhil Nikhil.Mulley at deshaw.com
Thu Aug 16 01:57:00 BST 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

You need to update the act_master file :

------
Changing the Master Host : 
Because the spooling database cannot be located on an NFS-mounted file system, the
following procedure requires that the Berkeley DB RPC server be used for spooling.
If you configure spooling to a local file system, you must transfer the spooling
database to a local file system on the new sge_qmaster host.
To change the master host, do the following:
1. On the current master host, stop the master daemon and the scheduler daemon by
typing the following command:
qconf -ks -km
2. Edit the sge-root/cell/common/act_qmaster file according to the following
guidelines:
a. In the act_qmaster file, replace the current host name with the new master
host's name.
This name should be the same as the name returned by the gethostname
utility. To get that name, type the following command on the new master host:
sge-root/utilbin/$ARCH/gethostname
b. Replace the old name in the act_qmaster file with the name returned by the
gethostname utility.
3. On the new master host, run the following script:
sge-root/cell/common/sge5
This starts up sge_qmaster and sge_schedd on the new master host.

------

________________________________

	From: Hugo R. Hernandez-Mora [mailto:hugo.hernandez at loni.ucla.edu] 
	Sent: Thursday, August 16, 2007 4:14 AM
	To: users at gridengine.sunsource.net
	Subject: [GE users] failover after a qmaster failure
	
	
	Hello there,
	we have configure our cluster with a qmaster host (cerebro-rmn1.data) and a shadow host (cerebro-rmn2.data).   We configure the failover to take effect after two minutes (we set SGE_CHECK_INTERVAL=45, SGE_GET_ACTIVE_INTERVAL=90 AND SGE_DELAY_TIME=30).  During the testing of the failover after a qmaster failure, we noted the submit nodes no longer get communication with the "new" qmaster (the configured shadow host), trying to access the dead qmaster instead of the current one:
	
	

		<hdezmora at cerebro-rsn1.data> <mailto:hdezmora at cerebro-rsn1.data>  qstat
		error: commlib error: can't connect to service (Connection refused)
		error: unable to contact qmaster using port 6444 on host "cerebro-rmn1.data"
		
		<hdezmora at cerebro-rsn2.data> <mailto:hdezmora at cerebro-rsn2.data>  qstat
		error: commlib error: can't connect to service (Connection refused)
		error: unable to contact qmaster using port 6444 on host "cerebro-rmn1.data"
		


	Any assistance will be very appreciated.  Thanks in advance.
	- Hugo
	
	
	-- 
	Hugo R. Hernandez-Mora
	System Administrator
	Laboratory of Neuro Imaging, UCLA
	635 Charles E. Young Drive South, Suite 225
	Los Angeles, CA 90095-7332
	Tel: 310.267.5076
	Fax: 310.206.5518
	hugo.hernandez at loni.ucla.edu
	--
	
	"Si seus esfor?os, foram vistos com indefren?a, não desanime, 
	que o sol faze un espectacolo maravilhoso todas as manhãs 
	cuando a maior parte das pessoas, ainda estam durmindo" 
	

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list