[GE users] failover after a qmaster failure

Hugo R. Hernandez-Mora hugo.hernandez at loni.ucla.edu
Thu Aug 16 21:05:07 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

You got with the problem, Rayson!  Thanks!!!
We are in the testing stage for GE v6.1u2 (currently, on production we 
are using SGE 6.0u6 and Solaris 10.  We are changing too the OS to 
CentOS 4.5 used by Rocks Cluster) and we are using a testing NFS server 
which will be changed once we move to production.    We were using the 
"production" server instead of the "testing" server to mount the NFS 
filesystems on the submit hosts.   When changes were done (the same 
$SGE_ROOT directory shared across all the hosts), after the failover 
everything work fine as expected.
Again, than you very much!
- Hugo

Rayson Ho wrote:
> Is the SGE_ROOT directory shared across all the hosts??
>
> Rayson
>
>
>
> On 8/16/07, Hugo R. Hernandez-Mora <hugo.hernandez at loni.ucla.edu> wrote:
>   
>> BTY, the problem is only with the submit hosts.   Execution hosts as
>> well as the shadow host are working correctly after the failover.
>> - Hugo
>>
>> Hugo R. Hernandez-Mora wrote:
>>     
>>> Thanks for your suggestion but after the failover, the
>>> $SGE_ROOT/$SGE_CELL/common/act_qmaster file is updated correctly.
>>>
>>> Mulley, Nikhil wrote:
>>>       
>>>> You need to update the act_master file :
>>>>
>>>> ------
>>>> Changing the Master Host : Because the spooling database cannot be
>>>> located on an NFS-mounted file system, the
>>>> following procedure requires that the Berkeley DB RPC server be used
>>>> for spooling.
>>>> If you configure spooling to a local file system, you must transfer
>>>> the spooling
>>>> database to a local file system on the new sge_qmaster host.
>>>> To change the master host, do the following:
>>>> 1. On the current master host, stop the master daemon and the
>>>> scheduler daemon by
>>>> typing the following command:
>>>> qconf -ks -km
>>>> 2. Edit the sge-root/cell/common/act_qmaster file according to the
>>>> following
>>>> guidelines:
>>>> a. In the act_qmaster file, replace the current host name with the
>>>> new master
>>>> host's name.
>>>> This name should be the same as the name returned by the gethostname
>>>> utility. To get that name, type the following command on the new
>>>> master host:
>>>> sge-root/utilbin/$ARCH/gethostname
>>>> b. Replace the old name in the act_qmaster file with the name
>>>> returned by the
>>>> gethostname utility.
>>>> 3. On the new master host, run the following script:
>>>> sge-root/cell/common/sge5
>>>> This starts up sge_qmaster and sge_schedd on the new master host.
>>>>
>>>> ------
>>>>
>>>> ________________________________
>>>>
>>>>     From: Hugo R. Hernandez-Mora
>>>> [mailto:hugo.hernandez at loni.ucla.edu]     Sent: Thursday, August 16,
>>>> 2007 4:14 AM
>>>>     To: users at gridengine.sunsource.net
>>>>     Subject: [GE users] failover after a qmaster failure
>>>>
>>>>
>>>>     Hello there,
>>>>     we have configure our cluster with a qmaster host
>>>> (cerebro-rmn1.data) and a shadow host (cerebro-rmn2.data).   We
>>>> configure the failover to take effect after two minutes (we set
>>>> SGE_CHECK_INTERVAL=45, SGE_GET_ACTIVE_INTERVAL=90 AND
>>>> SGE_DELAY_TIME=30).  During the testing of the failover after a
>>>> qmaster failure, we noted the submit nodes no longer get
>>>> communication with the "new" qmaster (the configured shadow host),
>>>> trying to access the dead qmaster instead of the current one:
>>>>
>>>>
>>>>
>>>>         <hdezmora at cerebro-rsn1.data>
>>>> <mailto:hdezmora at cerebro-rsn1.data>  qstat
>>>>         error: commlib error: can't connect to service (Connection
>>>> refused)
>>>>         error: unable to contact qmaster using port 6444 on host
>>>> "cerebro-rmn1.data"
>>>>
>>>>         <hdezmora at cerebro-rsn2.data>
>>>> <mailto:hdezmora at cerebro-rsn2.data>  qstat
>>>>         error: commlib error: can't connect to service (Connection
>>>> refused)
>>>>         error: unable to contact qmaster using port 6444 on host
>>>> "cerebro-rmn1.data"
>>>>
>>>>
>>>>
>>>>     Any assistance will be very appreciated.  Thanks in advance.
>>>>     - Hugo
>>>>
>>>>
>>>>     --     Hugo R. Hernandez-Mora
>>>>     System Administrator
>>>>     Laboratory of Neuro Imaging, UCLA
>>>>     635 Charles E. Young Drive South, Suite 225
>>>>     Los Angeles, CA 90095-7332
>>>>     Tel: 310.267.5076
>>>>     Fax: 310.206.5518
>>>>     hugo.hernandez at loni.ucla.edu
>>>>     --
>>>>
>>>>     "Si seus esfor?os, foram vistos com indefren?a, não desanime,
>>>>     que o sol faze un espectacolo maravilhoso todas as manhãs
>>>>     cuando a maior parte das pessoas, ainda estam durmindo"
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>         
>> --
>> Hugo R. Hernandez-Mora
>> System Administrator
>> Laboratory of Neuro Imaging, UCLA
>> 635 Charles E. Young Drive South, Suite 225
>> Los Angeles, CA 90095-7332
>> Tel: 310.267.5076
>> Fax: 310.206.5518
>> hugo.hernandez at loni.ucla.edu
>> --
>>
>> "Si seus esfor?os, foram vistos com indefren?a, não desanime,
>> que o sol faze un espectacolo maravilhoso todas as manhãs
>> cuando a maior parte das pessoas, ainda estam durmindo"
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>   

-- 
Hugo R. Hernandez-Mora
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Fax: 310.206.5518
hugo.hernandez at loni.ucla.edu
--

"Si seus esfor?os, foram vistos com indefren?a, não desanime, 
que o sol faze un espectacolo maravilhoso todas as manhãs 
cuando a maior parte das pessoas, ainda estam durmindo" 




More information about the gridengine-users mailing list