[GE users] failover after a qmaster failure

Hugo R. Hernandez-Mora hugo.hernandez at loni.ucla.edu
Thu Aug 16 20:46:13 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Thanks for your suggestion but after the failover, the 
$SGE_ROOT/$SGE_CELL/common/act_qmaster file is updated correctly.

Mulley, Nikhil wrote:
> You need to update the act_master file :
>
> ------
> Changing the Master Host : 
> Because the spooling database cannot be located on an NFS-mounted file system, the
> following procedure requires that the Berkeley DB RPC server be used for spooling.
> If you configure spooling to a local file system, you must transfer the spooling
> database to a local file system on the new sge_qmaster host.
> To change the master host, do the following:
> 1. On the current master host, stop the master daemon and the scheduler daemon by
> typing the following command:
> qconf -ks -km
> 2. Edit the sge-root/cell/common/act_qmaster file according to the following
> guidelines:
> a. In the act_qmaster file, replace the current host name with the new master
> host's name.
> This name should be the same as the name returned by the gethostname
> utility. To get that name, type the following command on the new master host:
> sge-root/utilbin/$ARCH/gethostname
> b. Replace the old name in the act_qmaster file with the name returned by the
> gethostname utility.
> 3. On the new master host, run the following script:
> sge-root/cell/common/sge5
> This starts up sge_qmaster and sge_schedd on the new master host.
>
> ------
>
> ________________________________
>
> 	From: Hugo R. Hernandez-Mora [mailto:hugo.hernandez at loni.ucla.edu] 
> 	Sent: Thursday, August 16, 2007 4:14 AM
> 	To: users at gridengine.sunsource.net
> 	Subject: [GE users] failover after a qmaster failure
> 	
> 	
> 	Hello there,
> 	we have configure our cluster with a qmaster host (cerebro-rmn1.data) and a shadow host (cerebro-rmn2.data).   We configure the failover to take effect after two minutes (we set SGE_CHECK_INTERVAL=45, SGE_GET_ACTIVE_INTERVAL=90 AND SGE_DELAY_TIME=30).  During the testing of the failover after a qmaster failure, we noted the submit nodes no longer get communication with the "new" qmaster (the configured shadow host), trying to access the dead qmaster instead of the current one:
> 	
> 	
>
> 		<hdezmora at cerebro-rsn1.data> <mailto:hdezmora at cerebro-rsn1.data>  qstat
> 		error: commlib error: can't connect to service (Connection refused)
> 		error: unable to contact qmaster using port 6444 on host "cerebro-rmn1.data"
> 		
> 		<hdezmora at cerebro-rsn2.data> <mailto:hdezmora at cerebro-rsn2.data>  qstat
> 		error: commlib error: can't connect to service (Connection refused)
> 		error: unable to contact qmaster using port 6444 on host "cerebro-rmn1.data"
> 		
>
>
> 	Any assistance will be very appreciated.  Thanks in advance.
> 	- Hugo
> 	
> 	
> 	-- 
> 	Hugo R. Hernandez-Mora
> 	System Administrator
> 	Laboratory of Neuro Imaging, UCLA
> 	635 Charles E. Young Drive South, Suite 225
> 	Los Angeles, CA 90095-7332
> 	Tel: 310.267.5076
> 	Fax: 310.206.5518
> 	hugo.hernandez at loni.ucla.edu
> 	--
> 	
> 	"Si seus esfor?os, foram vistos com indefren?a, não desanime, 
> 	que o sol faze un espectacolo maravilhoso todas as manhãs 
> 	cuando a maior parte das pessoas, ainda estam durmindo" 
> 	
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>   

-- 
Hugo R. Hernandez-Mora
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Fax: 310.206.5518
hugo.hernandez at loni.ucla.edu
--

"Si seus esfor?os, foram vistos com indefren?a, não desanime, 
que o sol faze un espectacolo maravilhoso todas as manhãs 
cuando a maior parte das pessoas, ainda estam durmindo" 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list