[GE users] Restarting sge_execd on all nodes
pcu-m at xs4all.nl
Tue Mar 3 22:33:30 GMT 2009
On Tuesday 03 March 2009, reuti wrote:
> Hi Paul,
> Am 03.03.2009 um 00:12 schrieb paulu:
> > This weekend, by a fileserver failure, the queue master went down
> > together with the sge_execd daemons on all nodes.
> > Everything is working again. Restarting all sge_execd daemons was
> > done by logging on remotely on each node and starting the daemon
> > from the commandline manually.
> > Is there some smarter way to do that, for example analogous to
> > the 'qconf -ke all' command? I quess it is a bit of a catch 22
> > situation, because there's no daemon to talk to yet.
> > Of course I could do some scripting, using 'qselect -qs u' to
> > iterate over all unavailable nodes, but perhaps there is a more
> > elegant way.
> > Any suggestion would be welcome.
> how did you install SGE? Usually there will be links and scripts
> installed in e.g. /etc/init.d and the rc3.5 and rc5.d
> subdirectories to start it automatically during boot (under Linux
> the location depends on the distribution). Only pitfall is, that
> they might start too early and other necessary system daemons are
> not up at that time.
> I usually move the startup of SGE be the last during startup and
> the first during shutdown, i.e. entries like "S99sgemaster.p6444 ->
> ../ sgemaster.p6444".
The problem is not that the sge_execd daemons do not start on boot.
All nodes where still up and running, only the sge_execd daemon on
each node had died due to a (NFS) fileserver failure. On that
fileserver SGE is installed. Also the spool directories and so on are
on that fileserver.
So I just wanted to know if there was an elegant way to start the
sge_execd daemons again on all nodes with a single command.
Somebody else suggested using pdsh, which I quite like. So that's the
path I will follow, I guess.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users