[GE users] failover after a qmaster failure

Rayson Ho rayrayson at gmail.com
Thu Aug 16 20:53:40 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Is the SGE_ROOT directory shared across all the hosts??

Rayson



On 8/16/07, Hugo R. Hernandez-Mora <hugo.hernandez at loni.ucla.edu> wrote:
> BTY, the problem is only with the submit hosts.   Execution hosts as
> well as the shadow host are working correctly after the failover.
> - Hugo
>
> Hugo R. Hernandez-Mora wrote:
> > Thanks for your suggestion but after the failover, the
> > $SGE_ROOT/$SGE_CELL/common/act_qmaster file is updated correctly.
> >
> > Mulley, Nikhil wrote:
> >> You need to update the act_master file :
> >>
> >> ------
> >> Changing the Master Host : Because the spooling database cannot be
> >> located on an NFS-mounted file system, the
> >> following procedure requires that the Berkeley DB RPC server be used
> >> for spooling.
> >> If you configure spooling to a local file system, you must transfer
> >> the spooling
> >> database to a local file system on the new sge_qmaster host.
> >> To change the master host, do the following:
> >> 1. On the current master host, stop the master daemon and the
> >> scheduler daemon by
> >> typing the following command:
> >> qconf -ks -km
> >> 2. Edit the sge-root/cell/common/act_qmaster file according to the
> >> following
> >> guidelines:
> >> a. In the act_qmaster file, replace the current host name with the
> >> new master
> >> host's name.
> >> This name should be the same as the name returned by the gethostname
> >> utility. To get that name, type the following command on the new
> >> master host:
> >> sge-root/utilbin/$ARCH/gethostname
> >> b. Replace the old name in the act_qmaster file with the name
> >> returned by the
> >> gethostname utility.
> >> 3. On the new master host, run the following script:
> >> sge-root/cell/common/sge5
> >> This starts up sge_qmaster and sge_schedd on the new master host.
> >>
> >> ------
> >>
> >> ________________________________
> >>
> >>     From: Hugo R. Hernandez-Mora
> >> [mailto:hugo.hernandez at loni.ucla.edu]     Sent: Thursday, August 16,
> >> 2007 4:14 AM
> >>     To: users at gridengine.sunsource.net
> >>     Subject: [GE users] failover after a qmaster failure
> >>
> >>
> >>     Hello there,
> >>     we have configure our cluster with a qmaster host
> >> (cerebro-rmn1.data) and a shadow host (cerebro-rmn2.data).   We
> >> configure the failover to take effect after two minutes (we set
> >> SGE_CHECK_INTERVAL=45, SGE_GET_ACTIVE_INTERVAL=90 AND
> >> SGE_DELAY_TIME=30).  During the testing of the failover after a
> >> qmaster failure, we noted the submit nodes no longer get
> >> communication with the "new" qmaster (the configured shadow host),
> >> trying to access the dead qmaster instead of the current one:
> >>
> >>
> >>
> >>         <hdezmora at cerebro-rsn1.data>
> >> <mailto:hdezmora at cerebro-rsn1.data>  qstat
> >>         error: commlib error: can't connect to service (Connection
> >> refused)
> >>         error: unable to contact qmaster using port 6444 on host
> >> "cerebro-rmn1.data"
> >>
> >>         <hdezmora at cerebro-rsn2.data>
> >> <mailto:hdezmora at cerebro-rsn2.data>  qstat
> >>         error: commlib error: can't connect to service (Connection
> >> refused)
> >>         error: unable to contact qmaster using port 6444 on host
> >> "cerebro-rmn1.data"
> >>
> >>
> >>
> >>     Any assistance will be very appreciated.  Thanks in advance.
> >>     - Hugo
> >>
> >>
> >>     --     Hugo R. Hernandez-Mora
> >>     System Administrator
> >>     Laboratory of Neuro Imaging, UCLA
> >>     635 Charles E. Young Drive South, Suite 225
> >>     Los Angeles, CA 90095-7332
> >>     Tel: 310.267.5076
> >>     Fax: 310.206.5518
> >>     hugo.hernandez at loni.ucla.edu
> >>     --
> >>
> >>     "Si seus esfor?os, foram vistos com indefren?a, não desanime,
> >>     que o sol faze un espectacolo maravilhoso todas as manhãs
> >>     cuando a maior parte das pessoas, ainda estam durmindo"
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>
> >
>
> --
> Hugo R. Hernandez-Mora
> System Administrator
> Laboratory of Neuro Imaging, UCLA
> 635 Charles E. Young Drive South, Suite 225
> Los Angeles, CA 90095-7332
> Tel: 310.267.5076
> Fax: 310.206.5518
> hugo.hernandez at loni.ucla.edu
> --
>
> "Si seus esfor?os, foram vistos com indefren?a, não desanime,
> que o sol faze un espectacolo maravilhoso todas as manhãs
> cuando a maior parte das pessoas, ainda estam durmindo"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list