[GE users] failover after a qmaster failure

Hugo R. Hernandez-Mora hugo.hernandez at loni.ucla.edu
Thu Aug 16 20:51:03 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

BTY, the problem is only with the submit hosts.   Execution hosts as 
well as the shadow host are working correctly after the failover.
- Hugo

Hugo R. Hernandez-Mora wrote:
> Thanks for your suggestion but after the failover, the 
> $SGE_ROOT/$SGE_CELL/common/act_qmaster file is updated correctly.
>
> Mulley, Nikhil wrote:
>> You need to update the act_master file :
>>
>> ------
>> Changing the Master Host : Because the spooling database cannot be 
>> located on an NFS-mounted file system, the
>> following procedure requires that the Berkeley DB RPC server be used 
>> for spooling.
>> If you configure spooling to a local file system, you must transfer 
>> the spooling
>> database to a local file system on the new sge_qmaster host.
>> To change the master host, do the following:
>> 1. On the current master host, stop the master daemon and the 
>> scheduler daemon by
>> typing the following command:
>> qconf -ks -km
>> 2. Edit the sge-root/cell/common/act_qmaster file according to the 
>> following
>> guidelines:
>> a. In the act_qmaster file, replace the current host name with the 
>> new master
>> host's name.
>> This name should be the same as the name returned by the gethostname
>> utility. To get that name, type the following command on the new 
>> master host:
>> sge-root/utilbin/$ARCH/gethostname
>> b. Replace the old name in the act_qmaster file with the name 
>> returned by the
>> gethostname utility.
>> 3. On the new master host, run the following script:
>> sge-root/cell/common/sge5
>> This starts up sge_qmaster and sge_schedd on the new master host.
>>
>> ------
>>
>> ________________________________
>>
>>     From: Hugo R. Hernandez-Mora 
>> [mailto:hugo.hernandez at loni.ucla.edu]     Sent: Thursday, August 16, 
>> 2007 4:14 AM
>>     To: users at gridengine.sunsource.net
>>     Subject: [GE users] failover after a qmaster failure
>>     
>>     
>>     Hello there,
>>     we have configure our cluster with a qmaster host 
>> (cerebro-rmn1.data) and a shadow host (cerebro-rmn2.data).   We 
>> configure the failover to take effect after two minutes (we set 
>> SGE_CHECK_INTERVAL=45, SGE_GET_ACTIVE_INTERVAL=90 AND 
>> SGE_DELAY_TIME=30).  During the testing of the failover after a 
>> qmaster failure, we noted the submit nodes no longer get 
>> communication with the "new" qmaster (the configured shadow host), 
>> trying to access the dead qmaster instead of the current one:
>>     
>>     
>>
>>         <hdezmora at cerebro-rsn1.data> 
>> <mailto:hdezmora at cerebro-rsn1.data>  qstat
>>         error: commlib error: can't connect to service (Connection 
>> refused)
>>         error: unable to contact qmaster using port 6444 on host 
>> "cerebro-rmn1.data"
>>        
>>         <hdezmora at cerebro-rsn2.data> 
>> <mailto:hdezmora at cerebro-rsn2.data>  qstat
>>         error: commlib error: can't connect to service (Connection 
>> refused)
>>         error: unable to contact qmaster using port 6444 on host 
>> "cerebro-rmn1.data"
>>        
>>
>>
>>     Any assistance will be very appreciated.  Thanks in advance.
>>     - Hugo
>>     
>>     
>>     --     Hugo R. Hernandez-Mora
>>     System Administrator
>>     Laboratory of Neuro Imaging, UCLA
>>     635 Charles E. Young Drive South, Suite 225
>>     Los Angeles, CA 90095-7332
>>     Tel: 310.267.5076
>>     Fax: 310.206.5518
>>     hugo.hernandez at loni.ucla.edu
>>     --
>>     
>>     "Si seus esfor?os, foram vistos com indefren?a, não desanime, 
>>     que o sol faze un espectacolo maravilhoso todas as manhãs 
>>     cuando a maior parte das pessoas, ainda estam durmindo"     
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>   
>

-- 
Hugo R. Hernandez-Mora
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Fax: 310.206.5518
hugo.hernandez at loni.ucla.edu
--

"Si seus esfor?os, foram vistos com indefren?a, não desanime, 
que o sol faze un espectacolo maravilhoso todas as manhãs 
cuando a maior parte das pessoas, ainda estam durmindo" 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list