[GE users] sge_execd fails to start after crash and reboot

Filipe Brandenburger filipe.brandenburger at idilia.com
Thu Jun 19 17:38:55 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

One of my machines just crashed (kernel panic on Linux) and I rebooted
it quickly. After the reboot, sge_execd didn't start. I checked the
console (I edited the init script and commented out the exec >/dev/null)
and this was the message I got:

> error: commlib error: endpoint is not unique error (endpoint "node5.mydomain.com/execd/1" is already connected)
> error: getting configuration: unable to contact qmaster using port 536 on host "sgemaster"
> error: there is already a client endpoint node5.mydomain.com/execd/1 connected to qmaster service

I believe the problem was that the master host hadn't find out that the
machine was unreachable yet (didn't mark it as "u" in "qstat -f"
output). So, I guess the master thought that the node was still connected.

My question here is: Is there a way that I can tune this to avoid this
situation? Can I change the timeouts/hearbeats to make the master see
that a machine is unreachable more quickly?

Or, on the other hand, can I configure it so that if the master sees a
machine connecting with a name it already knows, it discards the old one
and accepts the connection of the new one?

Or, still another option, can I configure sge_execd to retry
periodically after this kind of situation?

I'm using SGE 6.0.

Any help will be appreciated.

Thanks,
Filipe

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list