[GE users] failover after a qmaster failure

Hugo R. Hernandez-Mora hugo.hernandez at loni.ucla.edu
Wed Aug 15 23:44:29 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello there,
we have configure our cluster with a qmaster host (cerebro-rmn1.data) 
and a shadow host (cerebro-rmn2.data).   We configure the failover to 
take effect after two minutes (we set SGE_CHECK_INTERVAL=45, 
SGE_GET_ACTIVE_INTERVAL=90 AND SGE_DELAY_TIME=30).  During the testing 
of the failover after a qmaster failure, we noted the submit nodes no 
longer get communication with the "new" qmaster (the configured shadow 
host), trying to access the dead qmaster instead of the current one:

    *<hdezmora at cerebro-rsn1.data> qstat*
    error: commlib error: can't connect to service (Connection refused)
    error: unable to contact qmaster using port 6444 on host
    "cerebro-rmn1.data"
    *
    <hdezmora at cerebro-rsn2.data> qstat*
    error: commlib error: can't connect to service (Connection refused)
    error: unable to contact qmaster using port 6444 on host
    "cerebro-rmn1.data"


Any assistance will be very appreciated.  Thanks in advance.
- Hugo

-- 
Hugo R. Hernandez-Mora
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Fax: 310.206.5518
hugo.hernandez at loni.ucla.edu
--

"Si seus esfor?os, foram vistos com indefren?a, não desanime, 
que o sol faze un espectacolo maravilhoso todas as manhãs 
cuando a maior parte das pessoas, ainda estam durmindo" 




More information about the gridengine-users mailing list