[GE users] Restarting sge_execd on all nodes

reuti reuti at staff.uni-marburg.de
Wed Mar 4 09:49:00 GMT 2009


Am 03.03.2009 um 23:33 schrieb paulu:

> On Tuesday 03 March 2009, reuti wrote:
>> Hi Paul,
>>
>> Am 03.03.2009 um 00:12 schrieb paulu:
>>> This weekend, by a fileserver failure, the queue master went down
>>> together with the sge_execd daemons on all nodes.
>>>
>>> Everything is working again. Restarting all sge_execd daemons was
>>> done by logging on remotely on each node and starting the daemon
>>> from the commandline manually.
>>>
>>> Is there some smarter way to do that, for example analogous to
>>> the 'qconf -ke all' command? I quess it is a bit of a catch 22
>>> situation, because there's no daemon to talk to yet.
>>>
>>> Of course I could do some scripting, using 'qselect -qs u' to
>>> iterate over all unavailable nodes, but perhaps there is a more
>>> elegant way.
>>>
>>> Any suggestion would be welcome.
>>
>> how did you install SGE? Usually there will be links and scripts
>> installed in e.g. /etc/init.d and the rc3.5 and rc5.d
>> subdirectories to start it automatically during boot (under Linux
>> the location depends on the distribution). Only pitfall is, that
>> they might start too early and other necessary system daemons are
>> not up at that time.
>>
>> I usually move the startup of SGE be the last during startup and
>> the first during shutdown, i.e. entries like "S99sgemaster.p6444 ->
>> ../ sgemaster.p6444".
>
> Reuti,
>
> The problem is not that the sge_execd daemons do not start on boot.
>
> All nodes where still up and running, only the sge_execd daemon on
> each node had died due to a (NFS) fileserver failure. On that
> fileserver SGE is installed. Also the spool directories and so on are
> on that fileserver.

Hi,

then I would suggest to make at least the spool directories local,  
then this shouldn't happen. The sgeexecd should survive then. It was  
just on the list:

http://gridengine.sunsource.net/howto/nfsreduce.html

Anyway: was it a hard or soft mount of the NFS?

-- Reuti


>
> So I just wanted to know if there was an elegant way to start the
> sge_execd daemons again on all nodes with a single command.
>
> Somebody else suggested using pdsh, which I quite like. So that's the
> path I will follow, I guess.
>
> Thanks.
>
> Paul.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=119883
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=120296

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list